Sequence Parallelism #22

xrsrke · 2023-10-25T05:45:21Z

Implement distributed attention in LightSeq, Colossal-AI, or DeepSpeed's SP.... We have not decided which one yet.

from pipegoose.nn.sequence_parallel.attention import DistributedAttention

local_attention = torch.nn.MultiheadAttention
attention = DistributedAttention(local_attention, parallel_context)
outputs = attention(q, k, v)

assert outputs == local_attention(q, k, v)

TODOs

Take all the Triton kernels from LightSeq, and structure them in a modular way. Do not directly call the kernel, but call through a middle-man function.
Sequence parallelism's scheduler.
Send and receive query, key.
Calculate local attention.
Obtain complete attention output.
Activation checkpointing

Reading

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [[link]](https://arxiv.org/abs/2309.14509)
LightSeq Sequence Parallelism: https://arxiv.org/pdf/2310.03294v1.pdf

The text was updated successfully, but these errors were encountered:

3outeille · 2023-11-02T01:07:40Z

on it

xrsrke added help wanted Extra attention is needed good first issue Good for newcomers and removed good first issue Good for newcomers labels Oct 25, 2023

xrsrke self-assigned this Oct 25, 2023

xrsrke assigned 3outeille Nov 2, 2023

xrsrke added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence Parallelism #22

Sequence Parallelism #22

xrsrke commented Oct 25, 2023 •

edited

3outeille commented Nov 2, 2023

Sequence Parallelism #22

Sequence Parallelism #22

Comments

xrsrke commented Oct 25, 2023 • edited

3outeille commented Nov 2, 2023

xrsrke commented Oct 25, 2023 •

edited