Optimize moe #1520

grimoire · 2024-04-29T10:15:10Z

deepseek-moe-16b-chat

concurrency: 256
elapsed_time: 250.754s

first token latency(s)(min, max, ave): 1.940, 13.106, 3.160
per-token latency(s) percentile(50, 75, 95, 99): [0.052, 0.067, 0.195, 0.229]

number of prompt tokens: 721793
number of completion tokens: 668265
token throughput (completion token): 2665.019 token/s
token throughput (prompt + completion token): 5543.506 token/s
RPS (request per second): 11.964 req/s
RPM (request per minute): 717.834 req/min

Qwen1.5-MoE-A2.7B-Chat TP=2

concurrency: 256
elapsed_time: 224.461s

first token latency(s)(min, max, ave): 1.834, 14.229, 3.098
per-token latency(s) percentile(50, 75, 95, 99): [0.057, 0.06, 0.13, 0.154]

number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 2766.490 token/s
token throughput (prompt + completion token): 5796.291 token/s
RPS (request per second): 13.365 req/s
RPM (request per minute): 801.920 req/min

Mixtral 8x7b TP=2

concurrency: 256
elapsed_time: 396.564s

first token latency(s)(min, max, ave): 2.646, 35.353, 6.054
per-token latency(s) percentile(50, 75, 95, 99): [0.07, 0.074, 0.308, 0.373]

number of prompt tokens: 741804
number of completion tokens: 712850
token throughput (completion token): 1797.565 token/s
token throughput (prompt + completion token): 3668.142 token/s
RPS (request per second): 7.565 req/s
RPM (request per minute): 453.899 req/min

zhulinJulia24 · 2024-05-08T11:23:27Z

Qwen1.5-MoE-A2.7B-Chat

CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tp 2

main

concurrency: 256
elapsed_time: 4845.707s

first_token latency(min, max, ave): 0.004s, 3.831s, 1.272s

number of prompt tokens: 1148381
number of completion tokens: 1016401
token throughput (completion token): 209.753 token/s
token throughput (prompt + completion token): 446.742 token/s
RPS (request per second): 1.032 req/s
RPM (request per minute): 61.910 req/min

this pr
concurrency: 256
elapsed_time: 1865.420s

first_token latency(min, max, ave): 0.006s, 2.110s, 0.436s

number of prompt tokens: 1148381
number of completion tokens: 1016401
token throughput (completion token): 544.864 token/s
token throughput (prompt + completion token): 1160.480 token/s
RPS (request per second): 2.680 req/s
RPM (request per minute): 160.822 req/min

vllm 0.4.0

python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tensor-parallel-size 2
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

== Serving Benchmark Result
Successful requests: 1000
Benchmark duration (s): 83.39
Total input tokens: 217393
Total generated tokens: 201441
Request throughput (req/s): 11.99
Input token throughput (tok/s): 2606.86
Output token throughput (tok/s): 2415.57
--Time to First Token--
Mean TTFT (ms): 26022.18
Median TTFT (ms): 24357.74
P99 TTFT (ms): 62867.29
--Time per Output Token (excl. 1st token)--
Mean TPOT (ms): 96.21
Median TPOT (ms): 97.28
P99 TPOT (ms): 145.80

zhulinJulia24 · 2024-05-09T07:39:47Z

deepseek-moe-16b-chat

this pr

CUDA_VISIBLE_DEVICES=6 lmdeploy serve api_server /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --server-port 24333
python benchmark/profile_restful_api.py localhost:24333 /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat ../ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 256 --stream-output True --num-prompts 1000

in process

vllm 0.4.0

CUDA_VISIBLE_DEVICES=6 python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --trust-remote-code
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json

Successful requests: 1000
Benchmark duration (s): 83.97
Total input tokens: 236158
Total generated tokens: 158473
Request throughput (req/s): 11.91
Input token throughput (tok/s): 2812.46
Output token throughput (tok/s): 1887.29
--Time to First Token--
Mean TTFT (ms): 16670.32
Median TTFT (ms): 5263.92
P99 TTFT (ms): 57460.03
--Time per Output Token (excl. 1st token)--
Mean TPOT (ms): 120.28
Median TPOT (ms): 115.89
P99 TPOT (ms): 329.52

zhulinJulia24 · 2024-05-09T09:34:47Z

mistralai/Mistral-7B-Instruct-v0.1 单卡

== Serving Benchmark Result ==
Successful requests: 1000
Benchmark duration (s): 86.02
Total input tokens: 241080
Total generated tokens: 173935
Request throughput (req/s): 11.62
Input token throughput (tok/s): 2802.46
Output token throughput (tok/s): 2021.93
--Time to First Token--
Mean TTFT (ms): 16705.49
Median TTFT (ms): 6305.76
P99 TTFT (ms): 58420.31
--Time per Output Token (excl. 1st token)--
Mean TPOT (ms): 113.28
Median TPOT (ms): 108.95
P99 TPOT (ms): 385.06

zhulinJulia24 · 2024-05-18T09:32:49Z

dataset	version	metric	mode	internlm2-chat-7b-pytorch	internlm2-chat-20b-pytorch	llama-2-7b-chat-pytorch	qwen1.5-7b-chat-pytorch	qwen1.5-moe-2.7b-chat-pytorch	llama-3-8b-instruct-pytorch
ceval	-	naive_average	gen	58.2	63.45	28.51	70.67	77.24	50.63
mmlu	-	naive_average	gen	58.41	60.96	35.63	61.47	60.87	54.5
WiC	d06864	accuracy	gen	58.62	60.19	0	63.01	60.03	21.32
WSC	7902a7	accuracy	gen	56.73	50	0	39.42	40.38	30.77
triviaqa	2121ce	score	gen	56.82	63.87	56.12	44.63	54.59	63.83
gsm8k	1d7fe4	accuracy	gen	33.36	53.6	14.03	5.61	22.21	25.78
race-middle	9a54b6	accuracy	gen	76.32	86.84	58.57	87.6	82.31	86.35
race-high	9a54b6	accuracy	gen	75.19	83.7	51.17	82.7	79.3	80.93

lvhan028 · 2024-05-19T11:36:04Z

Hi, @zhulinJulia24, please help double check the inference speed

zhulinJulia24 · 2024-05-20T03:08:19Z

deepseek-moe-16b-chat

this pr
CUDA_VISIBLE_DEVICES=6 lmdeploy serve api_server /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --server-port 24333
python benchmark/profile_restful_api.py localhost:24333 /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat ../ShareGPT_V3_unfiltered_cleaned_split.json --concurrency 256 --stream-output True --num-prompts 1000
in process

vllm 0.4.0
CUDA_VISIBLE_DEVICES=6 python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --trust-remote-code
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/deepseek-ai/deepseek-moe-16b-chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json
Successful requests: 1000
Benchmark duration (s): 83.97
Total input tokens: 236158
Total generated tokens: 158473
Request throughput (req/s): 11.91
Input token throughput (tok/s): 2812.46
Output token throughput (tok/s): 1887.29
--Time to First Token
-- Mean TTFT (ms): 16670.32 Median
TTFT (ms): 5263.92 P99
TTFT (ms): 57460.03
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 120.28
Median TPOT (ms): 115.89
P99 TPOT (ms): 329.52

newest code:

concurrency: 256
elapsed_time: 350.753s

first_token latency(min, max, ave): 0.352s, 64.760s, 14.496s

number of prompt tokens: 721793
number of completion tokens: 668265
token throughput (completion token): 1905.230 token/s
token throughput (prompt + completion token): 3963.069 token/s
RPS (request per second): 8.553 req/s
RPM (request per minute): 513.182 req/min

zhulinJulia24 · 2024-05-20T03:32:54Z

Qwen1.5-MoE-A2.7B-Chat
CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tp 2
main

concurrency: 256 elapsed_time: 4845.707s

first_token latency(min, max, ave): 0.004s, 3.831s, 1.272s

number of prompt tokens: 1148381 number of completion tokens: 1016401 token throughput (completion token): 209.753 token/s token throughput (prompt + completion token): 446.742 token/s RPS (request per second): 1.032 req/s RPM (request per minute): 61.910 req/min

this pr concurrency: 256 elapsed_time: 1865.420s

first_token latency(min, max, ave): 0.006s, 2.110s, 0.436s

number of prompt tokens: 1148381 number of completion tokens: 1016401 token throughput (completion token): 544.864 token/s token throughput (prompt + completion token): 1160.480 token/s RPS (request per second): 2.680 req/s RPM (request per minute): 160.822 req/min

vllm 0.4.0
python3 -m vllm.entrypoints.openai.api_server --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --tensor-parallel-size 2
python3 benchmarks/benchmark_serving.py --model /nvme/qa_test_models/Qwen/Qwen1.5-MoE-A2.7B-Chat --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json
== Serving Benchmark Result Successful requests: 1000
Benchmark duration (s): 83.39
Total input tokens: 217393
Total generated tokens: 201441
Request throughput (req/s): 11.99
Input token throughput (tok/s): 2606.86
Output token throughput (tok/s): 2415.57
--Time to First Token
-- Mean TTFT (ms): 26022.18
Median TTFT (ms): 24357.74 P99
TTFT (ms): 62867.29
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 96.21
Median TPOT (ms): 97.28
P99 TPOT (ms): 145.80

newest code:

concurrency: 256
elapsed_time: 341.970s

first_token latency(min, max, ave): 1.632s, 118.259s, 14.703s

number of prompt tokens: 680073
number of completion tokens: 620970
token throughput (completion token): 1815.861 token/s
token throughput (prompt + completion token): 3804.554 token/s
RPS (request per second): 8.773 req/s
RPM (request per minute): 526.362 req/min

zhulinJulia24 · 2024-05-20T03:41:23Z

mistralai/Mistral-7B-Instruct-v0.1 单卡

vllm
== Serving Benchmark Result ==
Successful requests: 1000
Benchmark duration (s): 86.02
Total input tokens: 241080
Total generated tokens: 173935
Request throughput (req/s): 11.62
Input token throughput (tok/s): 2802.46
Output token throughput (tok/s): 2021.93
--Time to First Token
-- Mean TTFT (ms): 16705.49
Median TTFT (ms): 6305.76
P99 TTFT (ms): 58420.31
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 113.28
Median TPOT (ms): 108.95
P99 TPOT (ms): 385.06

newest code:

concurrency: 256
elapsed_time: 321.985s

first_token latency(min, max, ave): 1.181s, 61.126s, 13.595s

number of prompt tokens: 741804
number of completion tokens: 712850
token throughput (completion token): 2213.925 token/s
token throughput (prompt + completion token): 4517.773 token/s
RPS (request per second): 9.317 req/s
RPM (request per minute): 559.033 req/min

if I use vllm benchmark script

Successful requests: 1000
Benchmark duration (s): 84.30
Total input tokens: 241080
Total generated tokens: 151005
Request throughput (req/s): 11.86
Input token throughput (tok/s): 2859.71
Output token throughput (tok/s): 1791.23
Mean TTFT (ms): 22458.00
Median TTFT (ms): 16685.52
P99 TTFT (ms): 63616.28
Mean TPOT (ms): 50.99
Median TPOT (ms): 45.98
P99 TPOT (ms): 165.68

zhyncs · 2024-05-20T03:54:11Z

Currently, the MOE model still has a certain gap in throughput and latency compared to vLLM.

zhyncs · 2024-05-20T03:56:03Z

ref https://docs.google.com/presentation/u/0/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM

https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit#slide=id.g2c91788b9bf_90_605

RunningLeon · 2024-05-20T06:29:59Z

mistralai/Mistral-7B-Instruct-v0.1 单卡
vllm
== Serving Benchmark Result ==
Successful requests: 1000
Benchmark duration (s): 86.02
Total input tokens: 241080
Total generated tokens: 173935
Request throughput (req/s): 11.62
Input token throughput (tok/s): 2802.46
Output token throughput (tok/s): 2021.93
--Time to First Token
-- Mean TTFT (ms): 16705.49
Median TTFT (ms): 6305.76
P99 TTFT (ms): 58420.31
--Time per Output Token (excl. 1st token)
-- Mean TPOT (ms): 113.28
Median TPOT (ms): 108.95
P99 TPOT (ms): 385.06
....

@zhulinJulia24
mistralai/Mistral-7B-Instruct-v0.1 does not have moe. mistralai/Mixtral-8x7B-Instruct-v0.1 should have moe. We should test this model if necessary

RunningLeon

LGTM

grimoire and others added 2 commits April 29, 2024 16:09

finish deepseek and qwen2-moe, dbrx wip

d621316

fix dbrx

a3b0220

grimoire added the improvement label Apr 29, 2024

lvhan028 requested review from RunningLeon and zhulinJulia24 May 8, 2024 10:14

grimoire marked this pull request as draft May 9, 2024 07:50

optimize mixtral, deepseek, qwen

e686601

grimoire force-pushed the optimize-moe branch from 0708769 to e686601 Compare May 11, 2024 08:17

grimoire and others added 5 commits May 11, 2024 18:41

support dbrx

9d1f7ee

tuning fused kernel

e65c89d

Merge branch 'main' into optimize-moe

c6c0d95

group size = 1

00683c4

move import out of function

26a4129

grimoire marked this pull request as ready for review May 14, 2024 11:57

grimoire mentioned this pull request May 16, 2024

Refactor load weights #1603

Open

1 task

grimoire mentioned this pull request May 21, 2024

[Draft] Torch deepseek v2 #1621

Draft

2 tasks

RunningLeon approved these changes May 21, 2024

View reviewed changes

lvhan028 approved these changes May 22, 2024

View reviewed changes

lvhan028 merged commit 5ce3ed8 into InternLM:main May 22, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize moe #1520

Optimize moe #1520

grimoire commented Apr 29, 2024 •

edited

zhulinJulia24 commented May 8, 2024 •

edited

zhulinJulia24 commented May 9, 2024

zhulinJulia24 commented May 9, 2024

zhulinJulia24 commented May 18, 2024

lvhan028 commented May 19, 2024

zhulinJulia24 commented May 20, 2024 •

edited

zhulinJulia24 commented May 20, 2024

zhulinJulia24 commented May 20, 2024 •

edited

zhyncs commented May 20, 2024

zhyncs commented May 20, 2024 •

edited

RunningLeon commented May 20, 2024

RunningLeon left a comment

Optimize moe #1520

Optimize moe #1520

Conversation

grimoire commented Apr 29, 2024 • edited

zhulinJulia24 commented May 8, 2024 • edited

zhulinJulia24 commented May 9, 2024

zhulinJulia24 commented May 9, 2024

zhulinJulia24 commented May 18, 2024

lvhan028 commented May 19, 2024

zhulinJulia24 commented May 20, 2024 • edited

zhulinJulia24 commented May 20, 2024

zhulinJulia24 commented May 20, 2024 • edited

zhyncs commented May 20, 2024

zhyncs commented May 20, 2024 • edited

RunningLeon commented May 20, 2024

RunningLeon left a comment

Choose a reason for hiding this comment

grimoire commented Apr 29, 2024 •

edited

zhulinJulia24 commented May 8, 2024 •

edited

zhulinJulia24 commented May 20, 2024 •

edited

zhulinJulia24 commented May 20, 2024 •

edited

zhyncs commented May 20, 2024 •

edited