Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create a custom expert with tutel? #226

Open
zws98 opened this issue Mar 22, 2024 · 19 comments
Open

How to create a custom expert with tutel? #226

zws98 opened this issue Mar 22, 2024 · 19 comments

Comments

@zws98
Copy link

zws98 commented Mar 22, 2024

Code:

self._moe_layer = tutel_moe.moe_layer(
gate_type = {'type': 'top', 'k': top_value, 'fp32_gate': args.fp32_gate},
experts = {'type': 'ffn', 'count_per_node': num_local_experts, 'hidden_size_per_expert': hidden_size, 'activation_fn': lambda x: F.relu(x)},
model_dim = model_dim,
scan_expert_func = lambda name, param: setattr(param, 'skip_allreduce', True),
seeds = (1, dist_rank + 1, 1),
a2a_ffn_overlap_degree = a2a_ffn_overlap_degree,
)

How can I define a custom expert, e.g., only one mlp layer?

@ghostplant
Copy link
Contributor

You can follow this example: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py, which can be executed with: python3 -m tutel.examples.helloworld_demo --batch_size=16

@zws98
Copy link
Author

zws98 commented Mar 25, 2024

thanks a lot!

@zws98 zws98 closed this as completed Mar 25, 2024
@zws98
Copy link
Author

zws98 commented Mar 25, 2024

What if I want to feed another parameter in "class CustomExpertDemo(torch.nn.Module):", how can I revise the code in tutel?

@zws98 zws98 reopened this Mar 25, 2024
@zws98
Copy link
Author

zws98 commented Mar 25, 2024

e.g., def forward(self, x, ctx, anew_param):

@ghostplant
Copy link
Contributor

Is that a static parameter that can be set just in __init__ function of CustomExpertDemo?

@zws98
Copy link
Author

zws98 commented Mar 25, 2024

nope, it is a learnable parameter initialized out of the class "CustomExpertDem".

@ghostplant
Copy link
Contributor

Still need a few API upgrades to meet your requirement.

@zws98
Copy link
Author

zws98 commented Mar 25, 2024

Thanks, is there an available way to modify it after installing tutel? (e.g., reivising xx.py after installing tutel)

@ghostplant
Copy link
Contributor

You need to feed extra argument data you need here: https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L238,
where self.experts is the layer object created from your custom CustomExpertDemo.

You also need to extend corresponding argument list in the forward function to match data you feed: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_demo.py#L101

If you cannot clone and install tutel from source after changes above applied in the source, you have to get the location of installed file maybe at /usr/..../tutel/impls/moe_layer.py and apply the changes there.

@zws98
Copy link
Author

zws98 commented Mar 25, 2024

Thanks a lot.

@zws98 zws98 closed this as completed Mar 25, 2024
@zws98
Copy link
Author

zws98 commented May 14, 2024

When I use the Customexpert, it stopped here:
if ctx.sharded_count > 1:
raise Exception("sharded_count > 1 is not implemented within this expert, Model parallel is disabled.")

class CustomExpert_lora(torch.nn.Module):
    def __init__(self, model_dim, local_experts, sharded_count, my_config, act_layer=nn.GELU):
        super().__init__()
        self.r = 8
        self.scale = 1 / math.sqrt(self.r) 
        self.lora_A1 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B1 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.act = act_layer()
        self.lora_A2 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B2 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_normal_(self.lora_A1)
        self.lora_A1.data *= self.scale 
        init.constant_(self.lora_B1, 0)
        init.kaiming_normal_(self.lora_A2)
        self.lora_A2.data *= self.scale 
        init.constant_(self.lora_B2, 0)

    def forward(self, x, ctx):

        if ctx.sharded_count > 1:
            raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.")

        t1 = torch.matmul(self.lora_A1, self.lora_B1) 
        t2 = torch.matmul(self.lora_A2, self.lora_B2)  
        y = torch.matmul(x, t1)  
        y = self.act(y)
        y = torch.matmul(y, t2)  
        return y

@zws98 zws98 reopened this May 14, 2024
@ghostplant
Copy link
Contributor

When I use the Customexpert, it stopped here: if ctx.sharded_count > 1: raise Exception("sharded_count > 1 is not implemented within this expert, Model parallel is disabled.")

class CustomExpert_lora(torch.nn.Module):
    def __init__(self, model_dim, local_experts, sharded_count, my_config, act_layer=nn.GELU):
        super().__init__()
        self.r = 8
        self.scale = 1 / math.sqrt(self.r) 
        self.lora_A1 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B1 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.act = act_layer()
        self.lora_A2 = torch.nn.Parameter(torch.empty(local_experts, self.r, model_dim))
        self.lora_B2 = torch.nn.Parameter(torch.empty(local_experts, model_dim, self.r))
        self.reset_parameters()

    def reset_parameters(self):
        init.kaiming_normal_(self.lora_A1)
        self.lora_A1.data *= self.scale 
        init.constant_(self.lora_B1, 0)
        init.kaiming_normal_(self.lora_A2)
        self.lora_A2.data *= self.scale 
        init.constant_(self.lora_B2, 0)

    def forward(self, x, ctx):

        if ctx.sharded_count > 1:
            raise Exception("`sharded_count > 1` is not implemented within this expert, Model parallel is disabled.")

        t1 = torch.matmul(self.lora_A1, self.lora_B1) 
        t2 = torch.matmul(self.lora_A2, self.lora_B2)  
        y = torch.matmul(x, t1)  
        y = self.act(y)
        y = torch.matmul(y, t2)  
        return y

What is the value of adaptive_r in your moe forward setting?

@zws98
Copy link
Author

zws98 commented May 14, 2024

Where can I find the "adaptive_r" ?

@zws98
Copy link
Author

zws98 commented May 14, 2024

I have not changed the value of adaptive_r. I directly replaced the above-mentioned custom MLP with the default FFN and the program is working fine.

@ghostplant
Copy link
Contributor

ghostplant commented May 14, 2024

So looks like num_global_experts is smaller than the number of GPUs, right?

@zws98
Copy link
Author

zws98 commented May 14, 2024

num_global_experts=2, self.world_size=8

@ghostplant
Copy link
Contributor

Yes. When the execution setting num_global_experts < self.world_size, you will have to handle if shared_count > 1 which tells the way to partition expert parameters that are distributed across more than 1 GPU. Typically, you can implement a expert-data-parallelism to enable this execution setting, which requires creating sharded parameters in initialization and then all_gather sharded parameters in forward function. Actually, the built-in FFN layer has included those implementations, but I'll share you a simpler example.

@zws98
Copy link
Author

zws98 commented May 14, 2024

thanks a lot!

@ghostplant
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants