feat(moe): support isp for moe #57

blankde · 2024-02-26T02:55:37Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

This PR supports weight parallel for moe. If isp is used, then the ws(world size) =wps*eps*edps, otherwise, ws = tp*eps*edps. Related to #44

Modification

configs/7B_MoE4_sft.py: add new parallel config for expert
internlm/core/context/parallel_context.py: add expert parallel and expert weight parallel check
internlm/core/context/process_group_initializer.py: if isp is used, run Initializer_Expert_Weight_Data to divide Expert、Expert_Weight、Expert_Data groups. Here Expert_Data means expert weight data parallel groups, and each device only save part of weight of some experts. Otherwise, run Initializer_Expert_Data to divide Expert and Expert_Data groups.
internlm/moe/gshard_moe.py: find the true mlp impl class.
internlm/core/communication/isp.py: change the logic for ISPLinear check since moe linears are deeper modules in moe model.
internlm/solver/optimizer/hybrid_zero_optim.py: change the logic for weight accum grad hooks
internlm/train/training_internlm.py: set moe linears params as IS_WEIGHT_EXPERT_DATA_PARALLEL if isp is used
internlm/solver/optimizer/utils.py: add moe linears norm calculate is isp is used

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
CLA has been signed and all committers have signed the CLA in this PR.

huangting4201 · 2024-02-26T03:46:10Z

configs/7B_MoE4_sft.py

    2. overlap: bool, enable/disable all_gather/reduce_scatter communication overlap, defaults to False.
    3. memory_pool: bool, enable/disable memory pool, defaults to False.
+expert parallel (dict):
+    1. size: int, the size of expert parallel, each device would save {num_expert/ep_size} local experts.
+expert parallel (dict):


应该改成 expert weight parallel (dict): ？

huangting4201 · 2024-02-26T07:00:34Z

internlm/utils/parallel.py

@@ -71,6 +72,14 @@ def is_tensor_expert_data_parallel_parameter(p):
    )


+def is_weight_expert_data_parallel_parameter(p):
+    return (
+        gpc.is_initialized(ParallelMode.TENSOR)


这里应该是gpc.is_initialized(ParallelMode.WEIGHT)？

…for_moe

blankde added 9 commits February 4, 2024 14:57

impl isp communication groups

0f97d0a

impl no overlap isp

3ac714d

add comments for Initializer_Expert_Weight_Data

6d3855a

fix bugs

074567c

support overlap isp

343c455

test

5dae5c7

fix wo module check error

d52d28f

merge with upstream develop

76a6080

refactor code

16e4318

mm-assistant bot assigned yhcc Feb 26, 2024

blankde marked this pull request as draft February 26, 2024 02:57

huangting4201 reviewed Feb 26, 2024

View reviewed changes

blankde added 3 commits March 7, 2024 11:10

support isp for moe checkpoint

92c5b2a

fix some bugs

55e4a7a

merge with upstream/develop

d65f1c2

blankde marked this pull request as ready for review March 12, 2024 06:26

blankde marked this pull request as draft March 14, 2024 14:16

blankde added 3 commits March 15, 2024 16:43

impl group linear

bcf57be

merge with upstream develop

2f0f2d7

support isp for megablock moe

341305a

blankde marked this pull request as ready for review March 18, 2024 11:20

blankde marked this pull request as draft March 18, 2024 11:20

blankde marked this pull request as ready for review March 18, 2024 11:21

blankde marked this pull request as draft March 18, 2024 11:22

blankde added 3 commits March 20, 2024 13:50

support mtp/msp/fsp for megablock moe

7a241df

Merge remote-tracking branch 'upstream/develop' into feat/support_wp_…

8b6f580

…for_moe

refactor code

849c176

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(moe): support isp for moe #57

feat(moe): support isp for moe #57

blankde commented Feb 26, 2024

huangting4201 Feb 26, 2024

huangting4201 Feb 26, 2024

feat(moe): support isp for moe #57

Are you sure you want to change the base?

feat(moe): support isp for moe #57

Conversation

blankde commented Feb 26, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

huangting4201 Feb 26, 2024

Choose a reason for hiding this comment

huangting4201 Feb 26, 2024

Choose a reason for hiding this comment