Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize moe #1520

Merged
merged 8 commits into from
May 22, 2024
Merged

Optimize moe #1520

merged 8 commits into from
May 22, 2024

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Apr 29, 2024

deepseek-moe-16b-chat

concurrency: 256
elapsed_time: 250.754s

first token latency(s)(min, max, ave): 1.940, 13.106, 3.160
per-token latency(s) percentile(50, 75, 95, 99): [0.052, 0.067, 0.195, 0.229]

number of prompt tokens: 721793
number of completion tokens: 668265
token throughput (completion token): 2665.019 token/s
token throughput (prompt + completion token): 5543.506 token/s
RPS (request per second): 11.964 req/s
RPM (request per minute): 717.834 req/min

Qwen1.5-MoE-A2.7B-Chat TP=2

concurrency: 256
elapsed_time: 224.461s

first token latency(s)(min, max, ave): 1.834, 14.229, 3.098
per-token latency(s) percentile(50, 75, 95, 99): [0.057, 0.06, 0.13, 0.154]

number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 2766.490 token/s
token throughput (prompt + completion token): 5796.291 token/s
RPS (request per second): 13.365 req/s
RPM (request per minute): 801.920 req/min

Mixtral 8x7b TP=2

concurrency: 256
elapsed_time: 396.564s

first token latency(s)(min, max, ave): 2.646, 35.353, 6.054
per-token latency(s) percentile(50, 75, 95, 99): [0.07, 0.074, 0.308, 0.373]

number of prompt tokens: 741804
number of completion tokens: 712850
token throughput (completion token): 1797.565 token/s
token throughput (prompt + completion token): 3668.142 token/s
RPS (request per second): 7.565 req/s
RPM (request per minute): 453.899 req/min

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented May 8, 2024

Qwen1.5-MoE-A2.7B-Chat

CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tp 2

main

concurrency: 256
elapsed_time: 4845.707s

first_token latency(min, max, ave): 0.004s, 3.831s, 1.272s

number of prompt tokens: 1148381
number of completion tokens: 1016401
token throughput (completion token): 209.753 token/s
token throughput (prompt + completion token): 446.742 token/s
RPS (request per second): 1.032 req/s
RPM (request per minute): 61.910 req/min

this pr
concurrency: 256
elapsed_time: 1865.420s

first_token latency(min, max, ave): 0.006s, 2.110s, 0.436s

number of prompt tokens: 1148381
number of completion tokens: 1016401
token throughput (completion token): 544.864 token/s
token throughput (prompt + completion token): 1160.480 token/s
RPS (request per second): 2.680 req/s
RPM (request per minute): 160.822 req/min

vllm 0.4.0

python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tensor-parallel-size 2
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

== Serving Benchmark Result
Successful requests: 1000
Benchmark duration (s): 83.39
Total input tokens: 217393
Total generated tokens: 201441
Request throughput (req/s): 11.99
Input token throughput (tok/s): 2606.86
Output token throughput (tok/s): 2415.57
--Time to First Token--
Mean TTFT (ms): 26022.18
Median TTFT (ms): 24357.74
P99 TTFT (ms): 62867.29
--Time per Output Token (excl. 1st token)--
Mean TPOT (ms): 96.21
Median TPOT (ms): 97.28
P99 TPOT (ms): 145.80

@zhulinJulia24
Copy link
Collaborator

deepseek-moe-16b-chat

this pr

CUDA_VISIBLE_DEVICES=6 lmdeploy serve api_server /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --server-port 24333
python benchmark/profile_restful_api.py localhost:24333 /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat ../ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 256 --stream-output True --num-prompts 1000

in process

vllm 0.4.0

CUDA_VISIBLE_DEVICES=6 python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --trust-remote-code
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

Successful requests: 1000
Benchmark duration (s): 83.97
Total input tokens: 236158
Total generated tokens: 158473
Request throughput (req/s): 11.91
Input token throughput (tok/s): 2812.46
Output token throughput (tok/s): 1887.29
--Time to First Token--
Mean TTFT (ms): 16670.32
Median TTFT (ms): 5263.92
P99 TTFT (ms): 57460.03
--Time per Output Token (excl. 1st token)--
Mean TPOT (ms): 120.28
Median TPOT (ms): 115.89
P99 TPOT (ms): 329.52

@grimoire grimoire marked this pull request as draft May 9, 2024 07:50
@zhulinJulia24
Copy link
Collaborator

mistralai/Mistral-7B-Instruct-v0.1 单卡

== Serving Benchmark Result ==
Successful requests: 1000
Benchmark duration (s): 86.02
Total input tokens: 241080
Total generated tokens: 173935
Request throughput (req/s): 11.62
Input token throughput (tok/s): 2802.46
Output token throughput (tok/s): 2021.93
--Time to First Token--
Mean TTFT (ms): 16705.49
Median TTFT (ms): 6305.76
P99 TTFT (ms): 58420.31
--Time per Output Token (excl. 1st token)--
Mean TPOT (ms): 113.28
Median TPOT (ms): 108.95
P99 TPOT (ms): 385.06

@grimoire grimoire marked this pull request as ready for review May 14, 2024 11:57
@grimoire grimoire mentioned this pull request May 16, 2024
1 task
@zhulinJulia24
Copy link
Collaborator

dataset version metric mode internlm2-chat-7b-pytorch internlm2-chat-20b-pytorch llama-2-7b-chat-pytorch qwen1.5-7b-chat-pytorch qwen1.5-moe-2.7b-chat-pytorch llama-3-8b-instruct-pytorch
ceval - naive_average gen 58.2 63.45 28.51 70.67 77.24 50.63
mmlu - naive_average gen 58.41 60.96 35.63 61.47 60.87 54.5
WiC d06864 accuracy gen 58.62 60.19 0 63.01 60.03 21.32
WSC 7902a7 accuracy gen 56.73 50 0 39.42 40.38 30.77
triviaqa 2121ce score gen 56.82 63.87 56.12 44.63 54.59 63.83
gsm8k 1d7fe4 accuracy gen 33.36 53.6 14.03 5.61 22.21 25.78
race-middle 9a54b6 accuracy gen 76.32 86.84 58.57 87.6 82.31 86.35
race-high 9a54b6 accuracy gen 75.19 83.7 51.17 82.7 79.3 80.93

@lvhan028
Copy link
Collaborator

Hi, @zhulinJulia24, please help double check the inference speed

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented May 20, 2024

deepseek-moe-16b-chat

this pr

CUDA_VISIBLE_DEVICES=6 lmdeploy serve api_server /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --server-port 24333
python benchmark/profile_restful_api.py localhost:24333 /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat ../ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 256 --stream-output True --num-prompts 1000

in process

vllm 0.4.0

CUDA_VISIBLE_DEVICES=6 python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --trust-remote-code
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

Successful requests: 1000
Benchmark duration (s): 83.97
Total input tokens: 236158
Total generated tokens: 158473
Request throughput (req/s): 11.91
Input token throughput (tok/s): 2812.46
Output token throughput (tok/s): 1887.29
--Time to First Token
-- Mean TTFT (ms): 16670.32 Median
TTFT (ms): 5263.92 P99
TTFT (ms): 57460.03
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 120.28
Median TPOT (ms): 115.89
P99 TPOT (ms): 329.52

newest code:

concurrency: 256
elapsed_time: 350.753s

first_token latency(min, max, ave): 0.352s, 64.760s, 14.496s

number of prompt tokens: 721793
number of completion tokens: 668265
token throughput (completion token): 1905.230 token/s
token throughput (prompt + completion token): 3963.069 token/s
RPS (request per second): 8.553 req/s
RPM (request per minute): 513.182 req/min

@zhulinJulia24
Copy link
Collaborator

Qwen1.5-MoE-A2.7B-Chat

CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tp 2

main

concurrency: 256 elapsed_time: 4845.707s

first_token latency(min, max, ave): 0.004s, 3.831s, 1.272s

number of prompt tokens: 1148381 number of completion tokens: 1016401 token throughput (completion token): 209.753 token/s token throughput (prompt + completion token): 446.742 token/s RPS (request per second): 1.032 req/s RPM (request per minute): 61.910 req/min

this pr concurrency: 256 elapsed_time: 1865.420s

first_token latency(min, max, ave): 0.006s, 2.110s, 0.436s

number of prompt tokens: 1148381 number of completion tokens: 1016401 token throughput (completion token): 544.864 token/s token throughput (prompt + completion token): 1160.480 token/s RPS (request per second): 2.680 req/s RPM (request per minute): 160.822 req/min

vllm 0.4.0

python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tensor-parallel-size 2
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

== Serving Benchmark Result Successful requests: 1000
Benchmark duration (s): 83.39
Total input tokens: 217393
Total generated tokens: 201441
Request throughput (req/s): 11.99
Input token throughput (tok/s): 2606.86
Output token throughput (tok/s): 2415.57
--Time to First Token
-- Mean TTFT (ms): 26022.18
Median TTFT (ms): 24357.74 P99
TTFT (ms): 62867.29
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 96.21
Median TPOT (ms): 97.28
P99 TPOT (ms): 145.80

newest code:

concurrency: 256
elapsed_time: 341.970s

first_token latency(min, max, ave): 1.632s, 118.259s, 14.703s

number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 1815.861 token/s
token throughput (prompt + completion token): 3804.554 token/s
RPS (request per second): 8.773 req/s
RPM (request per minute): 526.362 req/min

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented May 20, 2024

mistralai/Mistral-7B-Instruct-v0.1 单卡

vllm
== Serving Benchmark Result ==
Successful requests: 1000
Benchmark duration (s): 86.02
Total input tokens: 241080
Total generated tokens: 173935
Request throughput (req/s): 11.62
Input token throughput (tok/s): 2802.46
Output token throughput (tok/s): 2021.93
--Time to First Token
-- Mean TTFT (ms): 16705.49
Median TTFT (ms): 6305.76
P99 TTFT (ms): 58420.31
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 113.28
Median TPOT (ms): 108.95
P99 TPOT (ms): 385.06

newest code:

concurrency: 256
elapsed_time: 321.985s

first_token latency(min, max, ave): 1.181s, 61.126s, 13.595s

number of prompt tokens: 741804
number of completion tokens: 712850
token throughput (completion token): 2213.925 token/s
token throughput (prompt + completion token): 4517.773 token/s
RPS (request per second): 9.317 req/s
RPM (request per minute): 559.033 req/min

if I use vllm benchmark script

Successful requests: 1000
Benchmark duration (s): 84.30
Total input tokens: 241080
Total generated tokens: 151005
Request throughput (req/s): 11.86
Input token throughput (tok/s): 2859.71
Output token throughput (tok/s): 1791.23
Mean TTFT (ms): 22458.00
Median TTFT (ms): 16685.52
P99 TTFT (ms): 63616.28
Mean TPOT (ms): 50.99
Median TPOT (ms): 45.98
P99 TPOT (ms): 165.68

@zhyncs
Copy link
Contributor

zhyncs commented May 20, 2024

Currently, the MOE model still has a certain gap in throughput and latency compared to vLLM.

@RunningLeon
Copy link
Collaborator

mistralai/Mistral-7B-Instruct-v0.1 单卡
vllm
== Serving Benchmark Result ==
Successful requests: 1000
Benchmark duration (s): 86.02
Total input tokens: 241080
Total generated tokens: 173935
Request throughput (req/s): 11.62
Input token throughput (tok/s): 2802.46
Output token throughput (tok/s): 2021.93
--Time to First Token
-- Mean TTFT (ms): 16705.49
Median TTFT (ms): 6305.76
P99 TTFT (ms): 58420.31
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 113.28
Median TPOT (ms): 108.95
P99 TPOT (ms): 385.06
....

@zhulinJulia24
mistralai/Mistral-7B-Instruct-v0.1 does not have moe. mistralai/Mixtral-8x7B-Instruct-v0.1 should have moe. We should test this model if necessary

@grimoire grimoire mentioned this pull request May 21, 2024
2 tasks
Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 5ce3ed8 into InternLM:main May 22, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants