Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

Open
insujang opened this issue Apr 15, 2024 · 1 comment
Open

[BUG] [Shardformer]: Error in blip2 testing with half precision #5600

insujang opened this issue Apr 15, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@insujang
Copy link
Contributor

馃悰 Describe the bug

  1. It seems blip2 testing doesn't work correctly at all if model is half precision (torch.float16).
  2. With bfloat16, colossalai.shardformer.layer.FusedLayerNorm doesn't seem to work correctly.

https://github.com/hpcaitech/ColossalAI/blob/main/tests/test_shardformer/test_model/test_shard_blip2.py
This test file passes as it is.

But if I change dtype to torch.float16:

It fails:

E         File "test_shard_blip2.py", line 28, in check_forward_backward
E           assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E         File "colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E           assert_close(
E         File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E           raise error_metas[0].to_error(msg)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 5947392 / 5947392 (100.0%)
E       Greatest absolute difference: nan at index (0, 0) (up to 1e-06 allowed)
E       Greatest relative difference: nan at index (0, 0) (up to 1e-05 allowed)

With dtype=torch.bfloat16 and without enable_fused_normalization it passes, but if I enable enable_fused_normalization, it fails again:

E         File "test_shard_blip2.py", line 28, in check_forward_backward
E           assert_hf_output_close(org_output, shard_output, ignore_keys=["past_key_values"])
E         File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "/colossalai/testing/comparison.py", line 125, in assert_hf_output_close
E           assert_hf_output_close(
E         File "/colossalai/testing/comparison.py", line 149, in assert_hf_output_close
E           assert_close(
E         File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1520, in assert_close
E           raise error_metas[0].to_error(msg)
E       AssertionError: Tensor-likes are not close!
E       
E       Mismatched elements: 24271 / 2161696 (1.1%)
E       Greatest absolute difference: 0.0078125 at index (0, 3, 47) (up to 1e-05 allowed)
E       Greatest relative difference: 169.0 at index (0, 3, 47325) (up to 1e-05 allowed)

Environment

torch 2.2.1 / CUDA 12.1
colossalai 0.3.6
transformesr 4.36.0

@insujang insujang added the bug Something isn't working label Apr 15, 2024
@insujang
Copy link
Contributor Author

I am not sure if it is a bug or an unavoidable error due to lower precision and it was intended to test only on fp32. Would appreciate it if you could share some insights about it. Thanks.

@Edenzzzz Edenzzzz assigned Edenzzzz and ver217 and unassigned Edenzzzz Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants