Skip to content

Releases: InternLM/lmdeploy

LMDeploy Release V0.4.2

27 May 08:56
54b7230
Compare
Choose a tag to compare

Highlight

  • Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

Quantization

lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ

Inference with quantized model

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
  • Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.1...v0.4.2

LMDeploy Release V0.4.1

07 May 08:20
14e9953
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • fix local variable 'response' referenced before assignment in async_engine.generate by @irexyc in #1513
  • Fix turbomind import in windows by @irexyc in #1533
  • Fix convert qwen2 to turbomind by @AllentDan in #1546
  • Adding api_key and model_name parameters to the restful benchmark by @NiuBlibing in #1478

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.4.0...v0.4.1

LMDeploy Release V0.4.0

23 Apr 11:18
04ba0ff
Compare
Choose a tag to compare

Highlights

Support for Llama3 and additional Vision-Language Models (VLMs):

  • We now support Llama3 and an extended range of Vision-Language Models (VLMs), including InternVL versions 1.1 and 1.2, MiniGemini, and InternLMXComposer2.

Introduce online int4/int8 KV quantization and inference

  • data-free online quantization
  • Supports all nvidia GPU models with Volta architecture (sm70) and above
  • KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
  • Efficient inference, with int8/int4 KV quantization applied to llama2-7b, RPS is improved by approximately 30% and 40% respectively compared to fp16

The following table shows the evaluation results of three LLM models with different KV numerical precision:

- - - llama2-7b-chat - - internlm2-chat-7b - - qwen1.5-7b-chat - -
dataset version metric kv fp16 kv int8 kv int4 kv fp16 kv int8 kv int4 fp16 kv int8 kv int4
ceval - naive_average 28.42 27.96 27.58 60.45 60.88 60.28 70.56 70.49 68.62
mmlu - naive_average 35.64 35.58 34.79 63.91 64 62.36 61.48 61.56 60.65
triviaqa 2121ce score 56.09 56.13 53.71 58.73 58.7 58.18 44.62 44.77 44.04
gsm8k 1d7fe4 accuracy 28.2 28.05 27.37 70.13 69.75 66.87 54.97 56.41 54.74
race-middle 9a54b6 accuracy 41.57 41.78 41.23 88.93 88.93 88.93 87.33 87.26 86.28
race-high 9a54b6 accuracy 39.65 39.77 40.77 85.33 85.31 84.62 82.53 82.59 82.02

The below table presents LMDeploy's inference performance with quantized KV.

model kv type test settings RPS v.s. kv fp16
llama2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 14.98 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 19.01 1.27
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 20.81 1.39
llama2-chat-13b fp16 tp1 / ratio 0.9 / bs 128 / prompts 10000 8.55 1.0
- int8 tp1 / ratio 0.9 / bs 256 / prompts 10000 10.96 1.28
- int4 tp1 / ratio 0.9 / bs 256 / prompts 10000 11.91 1.39
internlm2-chat-7b fp16 tp1 / ratio 0.8 / bs 256 / prompts 10000 24.13 1.0
- int8 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.28 1.05
- int4 tp1 / ratio 0.8 / bs 256 / prompts 10000 25.80 1.07

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.3.0...v0.4.0

LMDeploy Release V0.3.0

03 Apr 01:55
4822fba
Compare
Choose a tag to compare

Highlight

  • Refactor attention and optimize GQA(#1258 #1307 #1116), achieving 22+ and 16+ RPS for internlm2-7b and internlm2-20b, about 1.8x faster than vLLM
  • Support new models, including Qwen1.5-MOE(#1372), DBRX(#1367), DeepSeek-VL(#1335)

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Full Changelog: v0.2.6...v0.3.0

LMDeploy Release V0.2.6

19 Mar 02:43
b69e717
Compare
Choose a tag to compare

Highlight

Support vision-languange models (VLM) inference pipeline and serving.
Currently, it supports the following models, Qwen-VL-Chat, LLaVA series v1.5, v1.6 and Yi-VL

  • VLM Inference Pipeline
from lmdeploy import pipeline
from lmdeploy.vl import load_image

pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b')

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Please refer to the detailed guide from here

  • VLM serving by openai compatible server
lmdeploy server api_server liuhaotian/llava-v1.6-vicuna-7b --server-port 8000
  • VLM Serving by gradio
lmdeploy serve gradio liuhaotian/llava-v1.6-vicuna-7b --server-port 6006

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.5...v0.2.6

LMDeploy Release V0.2.5

05 Mar 08:39
c5f4014
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.4...v0.2.5

LMDeploy Release V0.2.4

22 Feb 03:44
24ea5dc
Compare
Choose a tag to compare

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

Full Changelog: v0.2.3...v0.2.4

LMDeploy Release V0.2.3

06 Feb 06:14
2831dc2
Compare
Choose a tag to compare

What's Changed

🚀 Features

💥 Improvements

  • Remove caching tokenizer.json by @grimoire in #1074
  • Refactor get_logger to remove the dependency of MMLogger from mmengine by @yinfan98 in #1064
  • Use TM_LOG_LEVEL environment variable first by @zhyncs in #1071
  • Speed up the initialization of w8a8 model for torch engine by @yinfan98 in #1088
  • Make logging.logger's behavior consistent with MMLogger by @irexyc in #1092
  • Remove owned_session for torch engine by @grimoire in #1097
  • Unify engine initialization in pipeline by @irexyc in #1085
  • Add skip_special_tokens in GenerationConfig by @grimoire in #1091
  • Use default stop words for turbomind backend in pipeline by @irexyc in #1119
  • Add input_token_len to Response and update Response document by @AllentDan in #1115

🐞 Bug fixes

  • Fix fast tokenizer swallows prefix space when there are too many white spaces by @AllentDan in #992
  • Fix turbomind CUDA runtime error invalid argument by @zhyncs in #1100
  • Add safety check for incremental decode by @AllentDan in #1094
  • Fix device type of get_ppl for turbomind by @RunningLeon in #1093
  • Fix pipeline init turbomind from workspace by @irexyc in #1126
  • Add dependency version check and fix ignore_eos logic by @grimoire in #1099
  • Change configuration_internlm.py to configuration_internlm2.py by @HIT-cwh in #1129

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.2...v0.2.3

LMDeploy Release V0.2.2

31 Jan 09:57
4a28f12
Compare
Choose a tag to compare

Highlight

English version

  • The allocation strategy for k/v cache is changed. The parameter cache_max_entry_count defaults to 0.8. It means the proportion of GPU FREE memory rather than TOTAL memory. The default value is updated to 0.8. It can help prevent OOM issues.
  • The pipeline API supports streaming inference. You may give it a try!
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • Add api key and ssl to api_server

Chinese version

  • TurboMind engine 修改了GPU memory分配策略。k/v cache 内存比例参数 cache_max_entry_count 缺省值变更为 0.8。它表示 GPU空闲内存的比例,不再是 GPU 总内存的比例。
  • Pipeline 支持流式输出接口。可以尝试下如下代码:
from lmdeploy import pipeline
pipe = pipeline('internlm/internlm2-chat-7b')
for item in pipe.stream_infer('hi, please intro yourself'):
    print(item)
  • api_server 在接口中增加了 api_key

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Full Changelog: v0.2.1...v0.2.2

LMDeploy Release V0.2.1

19 Jan 10:38
e96e2b4
Compare
Choose a tag to compare

What's Changed

💥 Improvements

🐞 Bug fixes

📚 Documentations

  • add guide about installation on cuda 12+ platform by @lvhan028 in #988

🌐 Other

Full Changelog: v0.2.0...v0.2.1