Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lmdeploy教程疑问 - KV Cache量化和W4A16量化怎么叠加? #376

Open
melonwine opened this issue Jan 17, 2024 · 2 comments
Open

Comments

@melonwine
Copy link

lmdeploy教程量化部分 分别介绍了如何做KV Cache量化和W4A16量化,两者结果都得到turbomind格式的模型。
但怎么把这两者结合起来?比如在KV Cache量化的结果上做W4A16量化。
lmdeploy lite calibratelmdeploy lite auto_awq都𣎴接受turbomind格式的模型,该如何去叠加?

另外,如果想把量化后的模型和别人共享,怎么把turbomind格式的转换成hugging face格式的?

@SchweitzerGAO
Copy link

反过来,先W4A16,再KV Cache

@hscspring
Copy link
Collaborator

@melonwine 教程里应该写了这个,w4a16得到参数后,kv cache量化用到的数据会放到参数文件夹下面,然后修改配置就生效了。
量化后的模型会在量化后的文件夹里,是先量化后再转为TurboMind的格式的。所以你需要分享的话直接拿那个量化模型就行。
TurboMind格式其实和TritonServer/FasterTransformer的是一样的,刚刚说了并不需要转回去。
如果确实想转回去(学习或探索目的),其实也是一样的,把TurboMind的参数依次读进去,再合并组装成HF格式的就行了。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants