[WIP] support Medusa #1231

zhyncs · 2024-03-03T04:58:36Z

Motivation

As titled, support Medusa

Modification

finished

1、Medusa weights conversion
2、Medusa weights loading
3、Porting Medusa Heads code with LMDeploy components and utilities
4、TP support: Distribute the weights equally based on hidden_size

We've used https://github.com/zhyncs/medusa-whl-centos7/releases/tag/2024.02.27, https://huggingface.co/FasterDecoding/medusa-vicuna-13b-v1.3, https://huggingface.co/lmsys/vicuna-13b-v1.3 to verify the correctness of porting code (fp16 and bf16).

during debugging

1、Porting generate_candidates and evaluate_posterior
2、Integrating with LlamaBatch

todo

1、add docs
2、add tests
3、benchmark

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

zhyncs · 2024-03-13T06:41:05Z

Hi @lvhan028 May you help change the base branch to turbomind-2.1 for this pr? Thanks.

zhyncs · 2024-03-27T09:55:59Z

Hi all. We'v tested the acc rate of LM Head temperature 0 and Medusa Heads top 1 in the internal version, whether it is custom prompts or MT-Bench, the acc rate is relatively low, only 0-20% and the vast majority is 0. In this case, we did not achieve the desired benefits. At the same time, we verified LM Head temperature 0 and Medusa Heads top k in the official version with medusa choices(64). The acc rate is between 20%-40%, which is closer to the greedy data in the paper. Considering that verifying Medusa Choices using a multi-batch approach would incur significant costs, we have decided to implement a Tree Mask version based on Flash Decoding on this basis. We will provide technical solutions aligning with @lzhangzz as soon as possible. Please stay tuned for updates.

This was referenced Mar 3, 2024

[WIP] porting Medusa #1213

Closed

[WIP] porting Medusa #1226

Closed

[Feature] Medusa weights conversion #1180

Closed

zhyncs force-pushed the medusa-plugin branch 2 times, most recently from 56393e1 to 61145b3 Compare March 11, 2024 02:34

zhyncs force-pushed the medusa-plugin branch from b4c5f33 to 2a27648 Compare March 13, 2024 09:55

zhyncs changed the base branch from main to turbomind-2.1 March 13, 2024 09:56

zhyncs force-pushed the medusa-plugin branch from 2a27648 to f11dbd8 Compare March 14, 2024 06:55

zhyncs changed the base branch from turbomind-2.1 to main March 19, 2024 07:28

zhyncs force-pushed the medusa-plugin branch from 42a8eff to a5f8c9a Compare March 19, 2024 07:34

zhyncs and others added 4 commits March 29, 2024 15:40

feat: porting medusa head and resblock with tp support

7f58c95

feat: update medusa_head_output

ceb7850

feat: add invokeBatchTopK and invokeMedusaBatchMatch

2f78465

sync

302f5ec

zhyncs force-pushed the medusa-plugin branch from a5f8c9a to 302f5ec Compare March 29, 2024 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] support Medusa #1231

[WIP] support Medusa #1231

zhyncs commented Mar 3, 2024 •

edited

zhyncs commented Mar 13, 2024

zhyncs commented Mar 27, 2024

[WIP] support Medusa #1231

Are you sure you want to change the base?

[WIP] support Medusa #1231

Conversation

zhyncs commented Mar 3, 2024 • edited

Motivation

Modification

finished

during debugging

todo

Checklist

zhyncs commented Mar 13, 2024

zhyncs commented Mar 27, 2024

zhyncs commented Mar 3, 2024 •

edited