-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Long context: Decoding performance degradation #1608
Comments
@lzhangzz is working on this issue. The related PR is #1606 |
With the script above, on A100-SXM4-80G
|
@lzhangzz Can you test the above llama3_lmdeploy.py script again? I noticed that the decoding speed dropped after upgrading to the new version 0.4.2 |
@DayDayupupupup I got almost the same result on v0.4.2 compared with #1606
|
mark |
@lzhangzz Thanks for your reply. In fact, the decoding performance is consistent between 0.4.2 and 0.4.1 with PR1606。 Correct the above data, 0.4.1+PR1606 performance data is wrong. WIth 0.4.1, I made some changes to support head_dim=64,and then merged PR1606, so TPOP=12.24ms at 200k is not accurate. # src/turbomind/kernels/attention/decoding.cu +32
# I modifide 128 to 64, so I got TPOP=12.24ms(0.4.1 + my changes + PR1606 + LLAMA3)
# Using the original 128, TPOP=16.56 ms(0.4.1 + PR1606 + LLAMA3)
static constexpr std::integral_constant<int, 128> kHeadDim{}; Another question, when is head_dim=64 expected to be supported? |
Likely in July.
You may try to find the bottleneck using Nsight Compute |
Thx, I'll give it a try. |
Checklist
Describe the bug
Compared with vllm 0.4.2, when the input length is 200k, the decoding time is significantly increased.
Reproduction
TEST Environment
Model
Testing script
TEST MODE
llama3_lmdeploy.py
llama3_vllm.py
Performance comparison
Environment
Error traceback
No response
The text was updated successfully, but these errors were encountered: