How to create a custom expert with tutel? #226

zws98 · 2024-03-22T08:32:47Z

Code:

self._moe_layer = tutel_moe.moe_layer(
gate_type = {'type': 'top', 'k': top_value, 'fp32_gate': args.fp32_gate},
experts = {'type': 'ffn', 'count_per_node': num_local_experts, 'hidden_size_per_expert': hidden_size, 'activation_fn': lambda x: F.relu(x)},
model_dim = model_dim,
scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),
seeds = (1, dist_rank + 1, 1),
a2a_ffn_overlap_degree = a2a_ffn_overlap_degree,
)

How can I define a custom expert, e.g., only one mlp layer?

ghostplant · 2024-03-23T05:24:42Z

You can follow this example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py, which can be executed with: python3 -m tutel.examples.helloworld_demo --batch_size=16

zws98 · 2024-03-25T02:08:34Z

thanks a lot!

zws98 · 2024-03-25T02:28:42Z

What if I want to feed another parameter in "class CustomExpertDemo(torch.nn.Module):", how can I revise the code in tutel?

zws98 · 2024-03-25T03:05:07Z

e.g., def forward(self, x, ctx, anew_param):

ghostplant · 2024-03-25T05:00:36Z

Is that a static parameter that can be set just in __init__ function of CustomExpertDemo?

zws98 · 2024-03-25T05:59:17Z

nope, it is a learnable parameter initialized out of the class "CustomExpertDem".

ghostplant · 2024-03-25T06:11:59Z

Still need a few API upgrades to meet your requirement.

zws98 · 2024-03-25T07:30:52Z

Thanks, is there an available way to modify it after installing tutel? (e.g., reivising xx.py after installing tutel)

ghostplant · 2024-03-25T12:19:13Z

You need to feed extra argument data you need here: https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L238,
where self.experts is the layer object created from your custom CustomExpertDemo.

You also need to extend corresponding argument list in the forward function to match data you feed: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py#L101

If you cannot clone and install tutel from source after changes above applied in the source, you have to get the location of installed file maybe at /usr/..../tutel/impls/moe_layer.py and apply the changes there.

zws98 · 2024-03-25T14:07:05Z

Thanks a lot.

zws98 · 2024-05-14T09:27:38Z

When I use the Customexpert, it stopped here:
if ctx.sharded_count > 1:
raise Exception("sharded_count > 1 is not implemented within this expert, Model parallel is disabled.")

class CustomExpert_lora(torch.nn.Module):
    def __init__(self, model_dim, local_experts, sharded_count, my_config, act_layer=nn.GELU):
        super().__init__()
        self.r = 8
        self.scale = 1 / math.sqrt(self.r) 
        self.lora_A1 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B1 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.act = act_layer()
        self.lora_A2 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B2 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_normal_(self.lora_A1)
        self.lora_A1.data *= self.scale 
        init.constant_(self.lora_B1, 0)
        init.kaiming_normal_(self.lora_A2)
        self.lora_A2.data *= self.scale 
        init.constant_(self.lora_B2, 0)

    def forward(self, x, ctx):

        if ctx.sharded_count > 1:
            raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.")

        t1 = torch.matmul(self.lora_A1, self.lora_B1) 
        t2 = torch.matmul(self.lora_A2, self.lora_B2)  
        y = torch.matmul(x, t1)  
        y = self.act(y)
        y = torch.matmul(y, t2)  
        return y

ghostplant · 2024-05-14T09:56:50Z

When I use the Customexpert, it stopped here: if ctx.sharded_count > 1: raise Exception("sharded_count > 1 is not implemented within this expert, Model parallel is disabled.")

class CustomExpert_lora(torch.nn.Module):
    def __init__(self, model_dim, local_experts, sharded_count, my_config, act_layer=nn.GELU):
        super().__init__()
        self.r = 8
        self.scale = 1 / math.sqrt(self.r) 
        self.lora_A1 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B1 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.act = act_layer()
        self.lora_A2 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B2 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_normal_(self.lora_A1)
        self.lora_A1.data *= self.scale 
        init.constant_(self.lora_B1, 0)
        init.kaiming_normal_(self.lora_A2)
        self.lora_A2.data *= self.scale 
        init.constant_(self.lora_B2, 0)

    def forward(self, x, ctx):

        if ctx.sharded_count > 1:
            raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.")

        t1 = torch.matmul(self.lora_A1, self.lora_B1) 
        t2 = torch.matmul(self.lora_A2, self.lora_B2)  
        y = torch.matmul(x, t1)  
        y = self.act(y)
        y = torch.matmul(y, t2)  
        return y

What is the value of adaptive_r in your moe forward setting?

zws98 · 2024-05-14T10:15:45Z

Where can I find the "adaptive_r" ?

zws98 · 2024-05-14T10:47:30Z

I have not changed the value of adaptive_r. I directly replaced the above-mentioned custom MLP with the default FFN and the program is working fine.

ghostplant · 2024-05-14T10:49:06Z

So looks like num_global_experts is smaller than the number of GPUs, right?

zws98 · 2024-05-14T10:57:04Z

num_global_experts=2, self.world_size=8

ghostplant · 2024-05-14T11:16:32Z

Yes. When the execution setting num_global_experts < self.world_size, you will have to handle if shared_count > 1 which tells the way to partition expert parameters that are distributed across more than 1 GPU. Typically, you can implement a expert-data-parallelism to enable this execution setting, which requires creating sharded parameters in initialization and then all_gather sharded parameters in forward function. Actually, the built-in FFN layer has included those implementations, but I'll share you a simpler example.

zws98 · 2024-05-14T11:58:36Z

thanks a lot!

ghostplant · 2024-05-14T14:05:11Z

Please follow this example in handling sharded_count: https://github.com/microsoft/tutel/blob/main/tutel/experts/llama_ffn.py
And another end-to-end example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_custom_expert_sharded.py

ghostplant mentioned this issue Mar 23, 2024

add tutel.examples.helloworld_demo based on custom experts #227

Merged

zws98 closed this as completed Mar 25, 2024

zws98 reopened this Mar 25, 2024

zws98 closed this as completed Mar 25, 2024

zws98 reopened this May 14, 2024

ghostplant mentioned this issue May 14, 2024

add built-in llama_ffn; add helloworld_custom_expert_sharded; #235

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create a custom expert with tutel? #226

How to create a custom expert with tutel? #226

zws98 commented Mar 22, 2024

ghostplant commented Mar 23, 2024

zws98 commented Mar 25, 2024

zws98 commented Mar 25, 2024

zws98 commented Mar 25, 2024

ghostplant commented Mar 25, 2024

zws98 commented Mar 25, 2024 •

edited

ghostplant commented Mar 25, 2024

zws98 commented Mar 25, 2024

ghostplant commented Mar 25, 2024

zws98 commented Mar 25, 2024

zws98 commented May 14, 2024 •

edited

ghostplant commented May 14, 2024

zws98 commented May 14, 2024

zws98 commented May 14, 2024

ghostplant commented May 14, 2024 •

edited

zws98 commented May 14, 2024

ghostplant commented May 14, 2024

zws98 commented May 14, 2024

ghostplant commented May 14, 2024

How to create a custom expert with tutel? #226

How to create a custom expert with tutel? #226

Comments

zws98 commented Mar 22, 2024

ghostplant commented Mar 23, 2024

zws98 commented Mar 25, 2024

zws98 commented Mar 25, 2024

zws98 commented Mar 25, 2024

ghostplant commented Mar 25, 2024

zws98 commented Mar 25, 2024 • edited

ghostplant commented Mar 25, 2024

zws98 commented Mar 25, 2024

ghostplant commented Mar 25, 2024

zws98 commented Mar 25, 2024

zws98 commented May 14, 2024 • edited

ghostplant commented May 14, 2024

zws98 commented May 14, 2024

zws98 commented May 14, 2024

ghostplant commented May 14, 2024 • edited

zws98 commented May 14, 2024

ghostplant commented May 14, 2024

zws98 commented May 14, 2024

ghostplant commented May 14, 2024

zws98 commented Mar 25, 2024 •

edited

zws98 commented May 14, 2024 •

edited

ghostplant commented May 14, 2024 •

edited