Fused Optimizer #13

xrsrke · 2023-10-25T05:16:12Z

Since our DistributedOptimizer takes another optimizer and turns it into ZeRO-1, can we make it do a fused optimizer like this? It should take an optimizer and turn it into a fused ZeRO-1 in a generic way.

APIs

from torch.optim import Adam
from pipegoose.optim import FusedOptim

optim = Adam(model.parameters(), lr=1e-3)
optim = FusedOptim(optim).fuse()

loss.backward()
optim.step()

TODO

Fused Adam
Fused SGD
Test all fused optimizers with DataParallel and ZeRO-1

The text was updated successfully, but these errors were encountered:

isamu-isozaki · 2023-10-25T22:05:46Z

https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747
kind of does what you say I think

xrsrke · 2023-10-26T07:44:59Z

@isamu-isozaki I was referring to a fused optimizer like FusedAdam from DeepSpeed (link). We fuse certain operations, such as element-wise operations, since these occupy the majority of the runtime during training.

Our goal is to enable the library to perform 3D parallelism in conjunction with DistributedOptimizer (ZeRO-1). We maintain a list of popular optimizers along with their fused versions. Then we create a mapping between a torch.optim.Optimizer and its corresponding fused version, which we subsequently feed to DistributedOptimizer. This is just one potential solution I have in mind :)

isamu-isozaki · 2023-10-26T15:04:11Z

@xrsrke I think this is definitely possible if we were to make a fused ver of each optimizer beforehand yup. For the above link it was mainly for just converting generic pytorch code to fused ver.
Then do you think this is pretty much the same issue as the porting cuda kernels issue?(or under it)

xrsrke · 2023-10-26T20:58:36Z

@isamu-isozaki

Then do you think this is pretty much the same issue as the porting CUDA kernels issue?

Yes.

For the above link, it was mainly for just converting generic PyTorch code to a fused version.

Or maybe we could fuse the entire model after parallelizing it using (TensorParallel, PipelineParallel...)

Would you like to take on both issues (this one and the port CUDA kernel)? I will merge them both for you and assign them. Let me know if you need a GPU for testing, although any GPU could work for this, since we will just test the correctness of the fused version.

isamu-isozaki · 2023-10-26T21:42:48Z

@xrsrke sounds good. I think I can do the initial setup for how we want the cuda code formatted and some examples and then we can prob start accepting cuda kernel pr contributions for each optimizer

xrsrke · 2023-10-26T23:17:35Z

Thank you. @isamu-isozaki also, if you look at those fused optimizers, the only thing that they do is replace one or a few operations with their fused one (do I miss something?), and keep everything else the same. So it'd be amazing if we could take an arbitrary optimizer, and only replace the operation that we have the fused one available, and keep everything else the same... so that if users have some tweaks in their optimizer, they still can do it. What do you think?

isamu-isozaki · 2023-10-26T23:51:56Z

@xrsrke I think I kind of get you but I think that'll lead to decreased performance since the more segmented it is=the more global reads/global writes which is the bottleneck for cuda performance. So overall, replacing everything with cuda to minimize read-writes tend to be the fastest(if cuda is optimized). For design, I'm thinking something like https://github.com/lucidrains/lion-pytorch but instead of triton cuda. (I'm mainly familiar with triton+optimizers where they pretty much just replace the main chunk with triton)

xrsrke · 2023-10-27T07:46:35Z

"So overall, replacing everything with CUDA to minimize read-writes tend to be the fastest(if CUDA is optimized)."

@isamu-isozaki
That sounds good. If that yields better results, then go for it. Thank you.

xrsrke added the help wanted Extra attention is needed label Oct 25, 2023

xrsrke assigned isamu-isozaki Oct 27, 2023

xrsrke removed the help wanted Extra attention is needed label Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused Optimizer #13

Fused Optimizer #13

xrsrke commented Oct 25, 2023 •

edited

isamu-isozaki commented Oct 25, 2023

xrsrke commented Oct 26, 2023

isamu-isozaki commented Oct 26, 2023

xrsrke commented Oct 26, 2023 •

edited

isamu-isozaki commented Oct 26, 2023

xrsrke commented Oct 26, 2023 •

edited

isamu-isozaki commented Oct 26, 2023 •

edited

xrsrke commented Oct 27, 2023 •

edited

Fused Optimizer #13

Fused Optimizer #13

Comments

xrsrke commented Oct 25, 2023 • edited

isamu-isozaki commented Oct 25, 2023

xrsrke commented Oct 26, 2023

isamu-isozaki commented Oct 26, 2023

xrsrke commented Oct 26, 2023 • edited

isamu-isozaki commented Oct 26, 2023

xrsrke commented Oct 26, 2023 • edited

isamu-isozaki commented Oct 26, 2023 • edited

xrsrke commented Oct 27, 2023 • edited

xrsrke commented Oct 25, 2023 •

edited

xrsrke commented Oct 26, 2023 •

edited

xrsrke commented Oct 26, 2023 •

edited

isamu-isozaki commented Oct 26, 2023 •

edited

xrsrke commented Oct 27, 2023 •

edited