Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixture of Experts #19

Open
3 of 11 tasks
xrsrke opened this issue Oct 25, 2023 · 0 comments
Open
3 of 11 tasks

Mixture of Experts #19

xrsrke opened this issue Oct 25, 2023 · 0 comments
Assignees

Comments

@xrsrke
Copy link
Owner

xrsrke commented Oct 25, 2023

APIs

from pipegoose.nn.expert_parallel import ExpertParallel, ExpertLoss

parallel_context = ParallelContext.from_torch(expert_parallel_size=8)

mlp = CustomExpert()
router = CustomRouter()
noise_policy = CustomNoisePolicy()
loss_func = nn.CrossEntropy()

model = ExpertParallel(
     model,
     expert=mlp,
     router=router,
     noise_policy=noise_policy,
     enable_tensor_parallelism=True,
     parallel_context=parallel_context,
).parallelize()

loss_func = ExpertLoss(loss_func, aux_weight=0.1)

TODOs

  • Top-1, Top-2 router

  • ExpertParallel (turn a 🤗 transformers to a MoE automatically)

  • Does expert embedding need to multiply its corresponding router probability?

  • Make ExpertParallel work with data parallelism

    • Create a new process group for experts across data parallelism dimension
    • Register a backward hook between the same expert across data parallelism dimension
  • Optionally apply tensor parallelism to an expert layer

  • Make ExpertParallel work in pipeline parallelism

  • Make ExpertParallel work with ZeRO-1

  • Loss function (include aux and z loss)

  • Move inputs to target expert device

Engineering Reading

  • Pipeline MoE - A Flexible MoE Implementation with Pipeline Parallelism
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
  • DeepSpeed-TED: Tensor-Expert-Data Parallelism Optimize Hybrid: A Approach to Mixture-of-Experts Training
  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
  • FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
  • MegaBlocks - Efficient Sparse Training with Mixture-of-Experts

MoE Reading

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
  • ST-MoE: Designing Stable and Transferable Sparse Expert Models
  • Mixture-of-Experts with Expert Choice Routing
@xrsrke xrsrke added the help wanted Extra attention is needed label Oct 25, 2023
@xrsrke xrsrke self-assigned this Oct 25, 2023
@xrsrke xrsrke removed the help wanted Extra attention is needed label Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant