[Feature]: Support cogvlm-chat #1502

RunningLeon · 2024-04-26T03:40:01Z

Motivation

Support cogvlm-chat-hf and CogVLM2 for pytorch engine

Usage:

Warning

CogVLM-Chat-hf uses 'lmsys/vicuna-7b-v1.5' as tokenizer, you need to copy the tokenizer model and configs into CogVLM model directory.

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model_path = './models--THUDM--cogvlm-chat-hf'

pipe = pipeline(model_path)

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Modification

TODOs

Suppot tp
Support ModelInputs.split with vision embeddings
Resolve conflicts with main branch
Support loading only llm part for vlm in pytorch engine
documents
Update after PR [Feature] Support vl models quantization #1553 and Balance vision model weights on multi gpus #1591

BC-breaking (Optional)

Profiling

tp=1 batch_size=128 num-prompts=3000

cogvlm-chat-hf

Without images

concurrency: 128
elapsed_time: 282.105s

first token latency(s)(min, max, ave): 0.859, 5.820, 1.471
per-token latency(s) percentile(50, 75, 95, 99): [0.032, 0.033, 0.111, 0.237]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 2571.792 token/s
token throughput (prompt + completion token): 5260.009 token/s
RPS (request per second): 10.634 req/s
RPM (request per minute): 638.060 req/min

with one image

change profile_throught.py and prepend 1234 tokens and image embeddings to each prompt

concurrency: 128
elapsed_time: 1066.815s

first token latency(s)(min, max, ave): 0.881, 39.275, 31.872
per-token latency(s) percentile(50, 75, 95, 99): [0.033, 0.036, 0.266, 0.334]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 680.077 token/s
token throughput (prompt + completion token): 1390.940 token/s
RPS (request per second): 2.812 req/s
RPM (request per minute): 168.726 req/min

REST API

using PR #1662

concurrency: 16
elapsed_time: 447.581s

first_token latency(min, max, ave): 0.055s, 7.081s, 1.278s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 537.516 token/s
token throughput (prompt + completion token): 1092.364 token/s
RPS (request per second): 2.234 req/s
RPM (request per minute): 134.054 req/min

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

lmdeploy/pytorch/config.py

lmdeploy/pytorch/engine/engine.py

lmdeploy/pytorch/config.py

lmdeploy/api.py

AllentDan · 2024-05-29T08:05:02Z

lmdeploy/pytorch/config.py

@@ -75,7 +89,7 @@ class ModelConfig:
    num_attention_heads: int
    num_key_value_heads: int
    bos_token_id: int
-    eos_token_id: int
+    eos_token_id: List[int]


AFAK, there is only one eos_token_id for each model.

but cogvlm2 has two eso tokens and the pytorch engine is able to deal with multiple eos tokens.

There are often multiple EOS tokens, and it has tripped up HF transformers as well. E.g. llama-3 has eot and eos and both need to be stopped on. Others are same.

Would it be one eos_token_id and multiple stop words?

For generation, it's identified as 2 eos_token_ids. HF added proper support for this after llama-3 and other models required it.

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/generation_config.json#L3

For the model itself, there's still only 1 eos token id:

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/config.json#L8

So it's equivalent to saying that those 2 are just stop ids yes. It depends upon how one uses the information.

lvhan028 · 2024-05-30T03:28:32Z

According to the note in this PR "you need to copy the tokenizer model and configs into CogVLM model directory.", I suggest making a user guide about CogVLM deployment.
I think we can make a folder "multi_modal" in the docs/en, and cogvlm.md as one example in this folder, like swift did
https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal

AllentDan

This is what I got

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model = '/nvme/shared_data/models--THUDM--cogvlm-chat-hf/snapshots/e29dc3ba206d524bf8efbfc60d80fc4556ab0e3c'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, log_level='INFO')

response = pipe('describe this image')
print(response)

Response(text='in a and nobody. everyone  The SPA and the 20000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', generate_token_len=512......

Can cogvlm do only chatting without an image?

RunningLeon · 2024-05-30T05:02:22Z

This is what I got

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model = '/nvme/shared_data/models--THUDM--cogvlm-chat-hf/snapshots/e29dc3ba206d524bf8efbfc60d80fc4556ab0e3c'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, log_level='INFO')

response = pipe('describe this image')
print(response)

Response(text='in a and nobody. everyone  The SPA and the 20000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', generate_token_len=512......

Can cogvlm do only chatting without an image?

It seems the chatting without images is not working for even hf models. This is output for hf model
query=Describe this image answer= nobody was 2016-02-11 16:30:32</s>

pseudotensor · 2024-05-30T05:06:40Z

Cogvlm2 works fine without images. Here's modification of their openai script that works:

https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py

Client is like OpenAI client with or without images.

AllentDan · 2024-05-30T06:18:52Z

I replaced the model path with cogvlm2, but found it responded pretending there was an image.

AllentDan · 2024-05-30T06:50:40Z

lmdeploy/pytorch/supported_models.py

+    # cogvlm-chat
+    CogVLMForCausalLM=True,
+    # llava
+    LlavaLlamaForCausalLM=False,
+    # deepseekvl
+    MultiModalityCausalLM=False,


Does it mean other VL model architecture supported by turbomind also needs to be updated to False here?

if arch not exists, it would be False . No need to add in this pr.

Previously, users used Pytorch engine to run VL models, they would get not supported hint. However, the current branch will lead users to fix errors individually and finally get nothing but failure even though it is a deepseek vl model or a llava model.

docs/en/multi_modal/cogvlm.md

lvhan028 · 2024-05-30T12:20:13Z

docs/en/multi_modal/cogvlm.md

+CogVLM is a powerful open-source visual language model (VLM). LMDeploy supports CogVLM-17B models like [THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf) and CogVLM2-19B models like [THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) in PyTorch engine.
+
+## Quick Start
+


May introduce pip install lmdeploy

pseudotensor · 2024-05-31T02:45:08Z

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.

THUDM/CogVLM2#68

The server code is just fastapi wrapper (single thread at a time) with transformers:

https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py

This is just based upon their code:

https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py

client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py

Many things work, then just hits issues and all dead.

Doesn't seem to be GPU OOM, because 40GB/80GB still left.

I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

RunningLeon · 2024-05-31T03:38:48Z

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.

THUDM/CogVLM2#68

The server code is just fastapi wrapper (single thread at a time) with transformers:

https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py

This is just based upon their code:

https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py

client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py

Many things work, then just hits issues and all dead.

Doesn't seem to be GPU OOM, because 40GB/80GB still left.

I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service

docs/en/multi_modal/cogvlm.md

lvhan028 · 2024-05-31T07:13:18Z

docs/en/multi_modal/cogvlm.md

+Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.
+
+```shell
+pip install lmdeploy


xformers should be installed

I installed xformers. It will install torch 2.3.0
But lmdeploy requires torch<=2.2.2,>=2.0.0

lvhan028 · 2024-05-31T08:56:15Z

@zhulinJulia24 please add cogvlm and cogvlm2 into test cases

lmdeploy/vl/model/cogvlm.py

pseudotensor · 2024-05-31T23:29:10Z

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.
THUDM/CogVLM2#68
The server code is just fastapi wrapper (single thread at a time) with transformers:
https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py
This is just based upon their code:
https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py
client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py
Many things work, then just hits issues and all dead.
Doesn't seem to be GPU OOM, because 40GB/80GB still left.
I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service

Ok, trying now. First building docker image from this PR:

docker build . -f docker/Dockerfile -t cogvlm2 --no-cache

pseudotensor · 2024-06-01T00:41:34Z

I modified the docker llava-like thing for this cogvlm2 case:

# git clone https://github.com/InternLM/lmdeploy.git
# cd lmdeploy
# git fetch origin pull/1502/head:pr-1502
# git checkout pr-1502
# docker build . -f docker/Dockerfile -t cogvlm2
# cd ~/h2ogpt_ops
# docker build - < Dockerfile.cogvlm2 -t cogvlm2_internalvl

FROM cogvlm2:latest

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install --upgrade pip
RUN pip3 install timm xformers triton==2.2.0
RUN pip3 install git+https://github.com/haotian-liu/LLaVA.git --no-deps

COPY . .

CMD ["lmdeploy", "serve", "api_server", "THUDM/cogvlm2-llama3-chat-19B"]

And notice on startup:

2024-06-01 00:39:16,053 - lmdeploy - WARNING - Fallback to pytorch engine because `/root/.cache/huggingface/hub/models--THUDM--cogvlm2-llama3-chat-19B/snapshots/2bf7de6892877eb50142395af14847519ba95998` not supported by turbomind engine.

is that ok?

pseudotensor · 2024-06-01T00:51:14Z

A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image:

Another funny one:

Another bad one:

It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally.

But if I pass an image it responds ok:

pseudotensor · 2024-06-01T01:19:21Z

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

RunningLeon added 5 commits April 24, 2024 16:07

support cogvlm

3fb26a1

fix

d6ae02d

fix position id

960792f

simplify rewritings

31fc433

refactor to move compute inside model forward

8be73aa

RunningLeon added the WIP label Apr 26, 2024

RunningLeon added 5 commits May 8, 2024 21:20

fix conflicts with main

c1952b8

fix tp

c984d42

remove vision encoder

a57051e

update vision embedding split

2cff2ea

update docs

e20d312

RunningLeon changed the title ~~[WIP]: Support cogvlm~~ [WIP]: Support cogvlm-chat May 10, 2024

RunningLeon added 5 commits May 10, 2024 20:29

copy tokenizer for cogvlm

61796dc

fix position ids

05c9b20

add emb indexing to vision model inputs

e8e0be0

update history embedding to enable set step

23c8060

remove print

29592f3

RunningLeon force-pushed the support-cogvlm-dev branch from 0e4befd to 29592f3 Compare May 15, 2024 11:10

Merge remote-tracking branch 'upstream/main' into support-cogvlm-dev

6b3752e

RunningLeon marked this pull request as ready for review May 15, 2024 12:24

RunningLeon changed the title ~~[WIP]: Support cogvlm-chat~~ [Feature]: Support cogvlm-chat May 15, 2024

RunningLeon removed the WIP label May 15, 2024

reorganize config

55eb8f5

RunningLeon mentioned this pull request May 21, 2024

[Feature] Implement COG-VLM2 #1622

Open

RunningLeon requested a review from grimoire May 23, 2024 03:43

RunningLeon mentioned this pull request May 23, 2024

[Feature]: Support llava for pytorch engine #1641

Open

1 task

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/config.py Show resolved Hide resolved

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Outdated Show resolved Hide resolved

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Show resolved Hide resolved

grimoire reviewed May 23, 2024

View reviewed changes

lmdeploy/pytorch/engine/engine.py Show resolved Hide resolved

resolve comment of chat template for cogvlm2

e12ec6d

lvhan028 reviewed May 29, 2024

View reviewed changes

lmdeploy/pytorch/config.py Outdated Show resolved Hide resolved

lvhan028 reviewed May 29, 2024

View reviewed changes

lmdeploy/api.py Show resolved Hide resolved

AllentDan reviewed May 29, 2024

View reviewed changes

resolve comment of qwen torch_dtype

cde5fc1

AllentDan reviewed May 30, 2024

View reviewed changes

RunningLeon added 4 commits May 30, 2024 15:49

fix slow for tp>1

9789ad7

update messages

7331f6d

add docs

f654a2e

update doc

25f4eaa

lvhan028 reviewed May 30, 2024

View reviewed changes

docs/en/multi_modal/cogvlm.md Outdated Show resolved Hide resolved

lvhan028 reviewed May 30, 2024

View reviewed changes

update docs

484d829

feed history_lenghts for all vlm

de93951

lvhan028 reviewed May 31, 2024

View reviewed changes

docs/en/multi_modal/cogvlm.md Outdated Show resolved Hide resolved

lvhan028 reviewed May 31, 2024

View reviewed changes

irexyc reviewed May 31, 2024

View reviewed changes

lmdeploy/vl/model/cogvlm.py Outdated Show resolved Hide resolved

resolve comments

bfa7267

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support cogvlm-chat #1502

[Feature]: Support cogvlm-chat #1502

RunningLeon commented Apr 26, 2024 •

edited

AllentDan May 29, 2024

RunningLeon May 29, 2024

pseudotensor May 29, 2024

AllentDan May 30, 2024

pseudotensor May 30, 2024 •

edited

lvhan028 commented May 30, 2024

AllentDan left a comment •

edited

RunningLeon commented May 30, 2024

pseudotensor commented May 30, 2024

AllentDan commented May 30, 2024

AllentDan May 30, 2024

RunningLeon May 30, 2024 •

edited

AllentDan May 31, 2024 •

edited

lvhan028 May 30, 2024

pseudotensor commented May 31, 2024 •

edited

RunningLeon commented May 31, 2024 •

edited

lvhan028 May 31, 2024

lvhan028 May 31, 2024

lvhan028 commented May 31, 2024

pseudotensor commented May 31, 2024 •

edited

pseudotensor commented Jun 1, 2024

pseudotensor commented Jun 1, 2024 •

edited

pseudotensor commented Jun 1, 2024 •

edited

		CogVLM is a powerful open-source visual language model (VLM). LMDeploy supports CogVLM-17B models like [THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf) and CogVLM2-19B models like [THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) in PyTorch engine.

		## Quick Start

[Feature]: Support cogvlm-chat #1502

Are you sure you want to change the base?

[Feature]: Support cogvlm-chat #1502

Conversation

RunningLeon commented Apr 26, 2024 • edited

Motivation

Modification

BC-breaking (Optional)

Profiling

tp=1 batch_size=128 num-prompts=3000

cogvlm-chat-hf

Without images

with one image

REST API

Use cases (Optional)

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pseudotensor May 30, 2024 • edited

Choose a reason for hiding this comment

lvhan028 commented May 30, 2024

AllentDan left a comment • edited

Choose a reason for hiding this comment

RunningLeon commented May 30, 2024

pseudotensor commented May 30, 2024

AllentDan commented May 30, 2024

Choose a reason for hiding this comment

RunningLeon May 30, 2024 • edited

Choose a reason for hiding this comment

AllentDan May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pseudotensor commented May 31, 2024 • edited

RunningLeon commented May 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lvhan028 commented May 31, 2024

pseudotensor commented May 31, 2024 • edited

pseudotensor commented Jun 1, 2024

pseudotensor commented Jun 1, 2024 • edited

pseudotensor commented Jun 1, 2024 • edited

RunningLeon commented Apr 26, 2024 •

edited

pseudotensor May 30, 2024 •

edited

AllentDan left a comment •

edited

RunningLeon May 30, 2024 •

edited

AllentDan May 31, 2024 •

edited

pseudotensor commented May 31, 2024 •

edited

RunningLeon commented May 31, 2024 •

edited

pseudotensor commented May 31, 2024 •

edited

pseudotensor commented Jun 1, 2024 •

edited

pseudotensor commented Jun 1, 2024 •

edited