[vLLM] optimizing vLLM models by qbits. #1548

Zhenzhong1 · 2024-05-15T06:58:15Z

Type of Change

New feature & API change

Complete the pipeline to replace part of vLLM linear modules by qbits linear. (chatglm2)
vLLM Integration API desgin: vllm_model = AutoModelForCausalLM.from_pretrained(args.model, use_vllm = True)
ITREX using pytorch==2.3.0 + cpu
extend acceleration to more models.

the expected behavior that triggered by this PR

how to reproduce the test (including hardware information)

any library dependency introduced or removed

for more information, see https://pre-commit.ci

Zhenzhong1 added 8 commits May 6, 2024 22:23

replace pipeline ok

4954d18

remove useless code

83a09b4

register_new_model & replace_rms_to_layernorm done

cfb68d7

update vllm_chatglm_model & fixed regiseter_new_model issue

7df82ff

rename vllm folder name

0e2a773

replace all QKVParallelLinear by linear successfully

3f51e1f

integrate qbits successfully

7cc238b

replace more linears done

569e467

Zhenzhong1 changed the title ~~Zhenzhong/vllm integrate~~ [vLLM] optimizing vLLM models by qbits. May 15, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

37d7e1b

for more information, see https://pre-commit.ci

Zhenzhong1 closed this May 20, 2024