Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support cogvlm-chat #1502

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

RunningLeon
Copy link
Collaborator

@RunningLeon RunningLeon commented Apr 26, 2024

Motivation

Support cogvlm-chat-hf and CogVLM2 for pytorch engine

Usage:

Warning

CogVLM-Chat-hf uses 'lmsys/vicuna-7b-v1.5' as tokenizer, you need to copy the tokenizer model and configs into CogVLM model directory.

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model_path = './models--THUDM--cogvlm-chat-hf'

pipe = pipeline(model_path)

image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

Modification

TODOs

BC-breaking (Optional)

Profiling

tp=1 batch_size=128 num-prompts=3000

cogvlm-chat-hf

Without images
concurrency: 128
elapsed_time: 282.105s

first token latency(s)(min, max, ave): 0.859, 5.820, 1.471
per-token latency(s) percentile(50, 75, 95, 99): [0.032, 0.033, 0.111, 0.237]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 2571.792 token/s
token throughput (prompt + completion token): 5260.009 token/s
RPS (request per second): 10.634 req/s
RPM (request per minute): 638.060 req/min
with one image

change profile_throught.py and prepend 1234 tokens and image embeddings to each prompt

concurrency: 128
elapsed_time: 1066.815s

first token latency(s)(min, max, ave): 0.881, 39.275, 31.872
per-token latency(s) percentile(50, 75, 95, 99): [0.033, 0.036, 0.266, 0.334]

number of prompt tokens: 758360
number of completion tokens: 725516
token throughput (completion token): 680.077 token/s
token throughput (prompt + completion token): 1390.940 token/s
RPS (request per second): 2.812 req/s
RPM (request per minute): 168.726 req/min
REST API

using PR #1662

concurrency: 16
elapsed_time: 447.581s

first_token latency(min, max, ave): 0.055s, 7.081s, 1.278s

number of prompt tokens: 248339
number of completion tokens: 240582
token throughput (completion token): 537.516 token/s
token throughput (prompt + completion token): 1092.364 token/s
RPS (request per second): 2.234 req/s
RPM (request per minute): 134.054 req/min

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

@RunningLeon RunningLeon changed the title [WIP]: Support cogvlm [WIP]: Support cogvlm-chat May 10, 2024
@RunningLeon RunningLeon marked this pull request as ready for review May 15, 2024 12:24
@RunningLeon RunningLeon changed the title [WIP]: Support cogvlm-chat [Feature]: Support cogvlm-chat May 15, 2024
@RunningLeon RunningLeon removed the WIP label May 15, 2024
@@ -75,7 +89,7 @@ class ModelConfig:
num_attention_heads: int
num_key_value_heads: int
bos_token_id: int
eos_token_id: int
eos_token_id: List[int]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAK, there is only one eos_token_id for each model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but cogvlm2 has two eso tokens and the pytorch engine is able to deal with multiple eos tokens.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are often multiple EOS tokens, and it has tripped up HF transformers as well. E.g. llama-3 has eot and eos and both need to be stopped on. Others are same.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be one eos_token_id and multiple stop words?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For generation, it's identified as 2 eos_token_ids. HF added proper support for this after llama-3 and other models required it.

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/generation_config.json#L3

For the model itself, there's still only 1 eos token id:

https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/config.json#L8

So it's equivalent to saying that those 2 are just stop ids yes. It depends upon how one uses the information.

@lvhan028
Copy link
Collaborator

According to the note in this PR "you need to copy the tokenizer model and configs into CogVLM model directory.", I suggest making a user guide about CogVLM deployment.
I think we can make a folder "multi_modal" in the docs/en, and cogvlm.md as one example in this folder, like swift did
https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal

Copy link
Collaborator

@AllentDan AllentDan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I got

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model = '/nvme/shared_data/models--THUDM--cogvlm-chat-hf/snapshots/e29dc3ba206d524bf8efbfc60d80fc4556ab0e3c'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, log_level='INFO')

response = pipe('describe this image')
print(response)
Response(text='in a and nobody. everyone  The SPA and the 20000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', generate_token_len=512......

Can cogvlm do only chatting without an image?

@RunningLeon
Copy link
Collaborator Author

This is what I got

from lmdeploy import pipeline
from lmdeploy.vl import load_image

model = '/nvme/shared_data/models--THUDM--cogvlm-chat-hf/snapshots/e29dc3ba206d524bf8efbfc60d80fc4556ab0e3c'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, log_level='INFO')

response = pipe('describe this image')
print(response)
Response(text='in a and nobody. everyone  The SPA and the 20000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', generate_token_len=512......

Can cogvlm do only chatting without an image?

It seems the chatting without images is not working for even hf models. This is output for hf model
query=Describe this image answer= nobody was 2016-02-11 16:30:32</s>

@pseudotensor
Copy link

Cogvlm2 works fine without images. Here's modification of their openai script that works:

https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py

Client is like OpenAI client with or without images.

@AllentDan
Copy link
Collaborator

I replaced the model path with cogvlm2, but found it responded pretending there was an image.

Comment on lines +45 to +50
# cogvlm-chat
CogVLMForCausalLM=True,
# llava
LlavaLlamaForCausalLM=False,
# deepseekvl
MultiModalityCausalLM=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean other VL model architecture supported by turbomind also needs to be updated to False here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if arch not exists, it would be False . No need to add in this pr.

Copy link
Collaborator

@AllentDan AllentDan May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, users used Pytorch engine to run VL models, they would get not supported hint. However, the current branch will lead users to fix errors individually and finally get nothing but failure even though it is a deepseek vl model or a llava model.

CogVLM is a powerful open-source visual language model (VLM). LMDeploy supports CogVLM-17B models like [THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf) and CogVLM2-19B models like [THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) in PyTorch engine.

## Quick Start

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May introduce pip install lmdeploy

@pseudotensor
Copy link

pseudotensor commented May 31, 2024

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.

THUDM/CogVLM2#68

The server code is just fastapi wrapper (single thread at a time) with transformers:

https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py

This is just based upon their code:

https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py

client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py

Many things work, then just hits issues and all dead.

Doesn't seem to be GPU OOM, because 40GB/80GB still left.

I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@RunningLeon
Copy link
Collaborator Author

RunningLeon commented May 31, 2024

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.

THUDM/CogVLM2#68

The server code is just fastapi wrapper (single thread at a time) with transformers:

https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py

This is just based upon their code:

https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py

client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py

Many things work, then just hits issues and all dead.

Doesn't seem to be GPU OOM, because 40GB/80GB still left.

I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service

Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more.

```shell
pip install lmdeploy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xformers should be installed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed xformers. It will install torch 2.3.0
But lmdeploy requires torch<=2.2.2,>=2.0.0

@lvhan028
Copy link
Collaborator

@zhulinJulia24 please add cogvlm and cogvlm2 into test cases

lmdeploy/vl/model/cogvlm.py Outdated Show resolved Hide resolved
@pseudotensor
Copy link

pseudotensor commented May 31, 2024

Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something.
THUDM/CogVLM2#68
The server code is just fastapi wrapper (single thread at a time) with transformers:
https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py
This is just based upon their code:
https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py
client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py
Many things work, then just hits issues and all dead.
Doesn't seem to be GPU OOM, because 40GB/80GB still left.
I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right.

@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service

Ok, trying now. First building docker image from this PR:

docker build . -f docker/Dockerfile -t cogvlm2 --no-cache

@pseudotensor
Copy link

I modified the docker llava-like thing for this cogvlm2 case:

# git clone https://github.com/InternLM/lmdeploy.git
# cd lmdeploy
# git fetch origin pull/1502/head:pr-1502
# git checkout pr-1502
# docker build . -f docker/Dockerfile -t cogvlm2
# cd ~/h2ogpt_ops
# docker build - < Dockerfile.cogvlm2 -t cogvlm2_internalvl

FROM cogvlm2:latest

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install --upgrade pip
RUN pip3 install timm xformers triton==2.2.0
RUN pip3 install git+https://github.com/haotian-liu/LLaVA.git --no-deps

COPY . .

CMD ["lmdeploy", "serve", "api_server", "THUDM/cogvlm2-llama3-chat-19B"]

And notice on startup:

2024-06-01 00:39:16,053 - lmdeploy - WARNING - Fallback to pytorch engine because `/root/.cache/huggingface/hub/models--THUDM--cogvlm2-llama3-chat-19B/snapshots/2bf7de6892877eb50142395af14847519ba95998` not supported by turbomind engine.

is that ok?

@pseudotensor
Copy link

pseudotensor commented Jun 1, 2024

A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image:

image

Another funny one:

image

Another bad one:

image

It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally.

But if I pass an image it responds ok:

image

@pseudotensor
Copy link

pseudotensor commented Jun 1, 2024

I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.

INFO:     172.16.0.225:19544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants