Model partitioning for pipeline parallelism #6

xrsrke · 2023-10-25T05:11:05Z

Basically, take a transformers and num_pipeline_stage as arguments, then divide the module like this:
The first stage and the last stage must include the embedding layer and lm_head, respectively.
All other stages in between should be divided evenly.
For example: if we have [embedding layer] > [8 x transformer blocks] > [language model head], and we want to shard them into 5 pipeline stages:

The first partition includes the embedding layer and the first block.
The 3 partitions in between each consist of 2 transformer blocks.
The last partition includes the language model head and the last block.
The goal is to arrange the first and the last pipeline stages so they do not become bottlenecks in terms of training speed, while all stages in between are distributed evenly to balance the computation.

abourramouss · 2023-11-04T23:22:30Z

pr #28 is a first approach. I am trying to understand how to combine wte and wpe.

xrsrke added this to the v1 milestone Oct 28, 2023

xrsrke assigned abourramouss Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model partitioning for pipeline parallelism #6

Model partitioning for pipeline parallelism #6

xrsrke commented Oct 25, 2023 •

edited

abourramouss commented Nov 4, 2023

Model partitioning for pipeline parallelism #6

Model partitioning for pipeline parallelism #6

Comments

xrsrke commented Oct 25, 2023 • edited

abourramouss commented Nov 4, 2023

xrsrke commented Oct 25, 2023 •

edited