Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model partitioning for pipeline parallelism #6

Open
xrsrke opened this issue Oct 25, 2023 · 1 comment
Open

Model partitioning for pipeline parallelism #6

xrsrke opened this issue Oct 25, 2023 · 1 comment
Assignees
Milestone

Comments

@xrsrke
Copy link
Owner

xrsrke commented Oct 25, 2023

Basically, take a transformers and num_pipeline_stage as arguments, then divide the module like this:
The first stage and the last stage must include the embedding layer and lm_head, respectively.
All other stages in between should be divided evenly.
For example: if we have [embedding layer] > [8 x transformer blocks] > [language model head], and we want to shard them into 5 pipeline stages:

  • The first partition includes the embedding layer and the first block.
  • The 3 partitions in between each consist of 2 transformer blocks.
    The last partition includes the language model head and the last block.
    The goal is to arrange the first and the last pipeline stages so they do not become bottlenecks in terms of training speed, while all stages in between are distributed evenly to balance the computation.
@xrsrke xrsrke added this to the v1 milestone Oct 28, 2023
@abourramouss
Copy link
Contributor

pr #28 is a first approach. I am trying to understand how to combine wte and wpe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants