Update xpu related device setting #446

zhuhong61 · 2024-04-19T12:49:23Z

What does this PR do?

This PR update some xpu related logic for correct support.

Fixes # (issue)

Feature/Issue validation/testing

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Thanks for contributing 🎉!

zhuhong61 · 2024-04-22T03:01:38Z

@abhilash1910 Could you please help review? Thanks!

abhilash1910

Thanks for this update !
@HamidShojanazeri could you help take a look ?
cc @gujinghui

gujinghui · 2024-04-22T06:46:30Z

src/llama_recipes/utils/train_utils.py

@@ -55,13 +58,15 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche
    if train_config.use_fp16 and train_config.enable_fsdp:
        scaler = ShardedGradScaler()
    elif train_config.use_fp16 and not train_config.enable_fsdp:
-        scaler = torch.cuda.amp.GradScaler()
+        scaler = torch.xpu.amp.GradScaler() if is_xpu_available() else torch.cuda.amp.GradScaler()


Do we really have torch.xpu.amp.GradScaler() already?

Yes I also had a doubt regarding this, not sure we have xpu.amp.GradScaler() .

Do we really have torch.xpu.amp.GradScaler() already?

Thanks @gujinghui , @abhilash1910 , yes, we don't support torch.xpu.amp.GradScaler(), I will remove it and update it once we support this API. By the way, do we need a warning message or exit message to indicate torch.xpu.amp.GradScaler() is not supported on xpu?

Should be yes? Assert to stop the workload with graceful exit message?

BrodysgotMs · 2024-04-23T03:54:21Z

$BYE

…

On Mon, Apr 22, 2024 at 8:53 PM Jinghui ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In src/llama_recipes/utils/train_utils.py <#446 (comment)> : > @@ -55,13 +58,15 @@ def train(model, train_dataloader,eval_dataloader, tokenizer, optimizer, lr_sche if train_config.use_fp16 and train_config.enable_fsdp: scaler = ShardedGradScaler() elif train_config.use_fp16 and not train_config.enable_fsdp: - scaler = torch.cuda.amp.GradScaler() + scaler = torch.xpu.amp.GradScaler() if is_xpu_available() else torch.cuda.amp.GradScaler() Should be yes? Assert to stop the workload with graceful exit message? — Reply to this email directly, view it on GitHub <#446 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A65RSD6TFKZE5BLDRQSEZS3Y6XLL3AVCNFSM6AAAAABGPD7BGGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMJWGE4DMMBUHA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Update xpu related device setting

ba572c2

facebook-github-bot added the cla signed label Apr 19, 2024

abhilash1910 approved these changes Apr 22, 2024

View reviewed changes

gujinghui reviewed Apr 22, 2024

View reviewed changes

Update train_utils.py

5d27b31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update xpu related device setting #446

Update xpu related device setting #446

zhuhong61 commented Apr 19, 2024 •

edited

zhuhong61 commented Apr 22, 2024

abhilash1910 left a comment

gujinghui Apr 22, 2024

abhilash1910 Apr 22, 2024

zhuhong61 Apr 22, 2024 •

edited

gujinghui Apr 23, 2024

BrodysgotMs commented Apr 23, 2024 via email

Update xpu related device setting #446

Are you sure you want to change the base?

Update xpu related device setting #446

Conversation

zhuhong61 commented Apr 19, 2024 • edited

What does this PR do?

Feature/Issue validation/testing

Before submitting

zhuhong61 commented Apr 22, 2024

abhilash1910 left a comment

Choose a reason for hiding this comment

gujinghui Apr 22, 2024

Choose a reason for hiding this comment

abhilash1910 Apr 22, 2024

Choose a reason for hiding this comment

zhuhong61 Apr 22, 2024 • edited

Choose a reason for hiding this comment

gujinghui Apr 23, 2024

Choose a reason for hiding this comment

BrodysgotMs commented Apr 23, 2024 via email

zhuhong61 commented Apr 19, 2024 •

edited

zhuhong61 Apr 22, 2024 •

edited