-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Support cogvlm-chat #1502
base: main
Are you sure you want to change the base?
Conversation
0e4befd
to
29592f3
Compare
@@ -75,7 +89,7 @@ class ModelConfig: | |||
num_attention_heads: int | |||
num_key_value_heads: int | |||
bos_token_id: int | |||
eos_token_id: int | |||
eos_token_id: List[int] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAK, there is only one eos_token_id for each model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but cogvlm2 has two eso tokens and the pytorch engine is able to deal with multiple eos tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are often multiple EOS tokens, and it has tripped up HF transformers as well. E.g. llama-3 has eot and eos and both need to be stopped on. Others are same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be one eos_token_id and multiple stop words?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For generation, it's identified as 2 eos_token_ids. HF added proper support for this after llama-3 and other models required it.
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/generation_config.json#L3
For the model itself, there's still only 1 eos token id:
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/config.json#L8
So it's equivalent to saying that those 2 are just stop ids yes. It depends upon how one uses the information.
According to the note in this PR "you need to copy the tokenizer model and configs into CogVLM model directory.", I suggest making a user guide about CogVLM deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I got
from lmdeploy import pipeline
from lmdeploy.vl import load_image
model = '/nvme/shared_data/models--THUDM--cogvlm-chat-hf/snapshots/e29dc3ba206d524bf8efbfc60d80fc4556ab0e3c'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
pipe = pipeline(model, log_level='INFO')
response = pipe('describe this image')
print(response)
Response(text='in a and nobody. everyone The SPA and the 20000.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000', generate_token_len=512......
Can cogvlm do only chatting without an image?
It seems the chatting without images is not working for even hf models. This is output for hf model |
Cogvlm2 works fine without images. Here's modification of their openai script that works: https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py Client is like OpenAI client with or without images. |
I replaced the model path with cogvlm2, but found it responded pretending there was an image. |
# cogvlm-chat | ||
CogVLMForCausalLM=True, | ||
# llava | ||
LlavaLlamaForCausalLM=False, | ||
# deepseekvl | ||
MultiModalityCausalLM=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean other VL model architecture supported by turbomind also needs to be updated to False here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if arch not exists, it would be False
. No need to add in this pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, users used Pytorch engine to run VL models, they would get not supported
hint. However, the current branch will lead users to fix errors individually and finally get nothing but failure even though it is a deepseek vl model or a llava model.
CogVLM is a powerful open-source visual language model (VLM). LMDeploy supports CogVLM-17B models like [THUDM/cogvlm-chat-hf](https://huggingface.co/THUDM/cogvlm-chat-hf) and CogVLM2-19B models like [THUDM/cogvlm2-llama3-chat-19B](https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B) in PyTorch engine. | ||
|
||
## Quick Start | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May introduce pip install lmdeploy
Curious if you guys would know why cogvlm2 would hit these kinds of issues with transformers. I don't have a repro, but also noticed the same thing in sglang. I'm wondering if I'm misusing the mode or something. The server code is just fastapi wrapper (single thread at a time) with transformers: https://github.com/h2oai/h2ogpt/blob/main/openai_server/cogvlm2_server/cogvlm2.py This is just based upon their code: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_demo.py client side: https://github.com/THUDM/CogVLM2/blob/main/basic_demo/openai_api_request.py Many things work, then just hits issues and all dead. Doesn't seem to be GPU OOM, because 40GB/80GB still left. I bring it up, because I'd like to use lmdeploy for cogvlm2, but I'm worried something is not right. |
@pseudotensor hi, maybe you could try with lmdeploy using the latest code from this PR to see if it happens. You can refer to this doc: https://lmdeploy.readthedocs.io/en/latest/serving/api_server_vl.html#launch-service |
Install LMDeploy with pip (Python 3.8+). Refer to [Installation](https://lmdeploy.readthedocs.io/en/latest/get_started.html#installation) for more. | ||
|
||
```shell | ||
pip install lmdeploy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xformers should be installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I installed xformers. It will install torch 2.3.0
But lmdeploy requires torch<=2.2.2,>=2.0.0
@zhulinJulia24 please add cogvlm and cogvlm2 into test cases |
Ok, trying now. First building docker image from this PR:
|
I modified the docker llava-like thing for this cogvlm2 case:
And notice on startup:
is that ok? |
A quite strange response the transformers use of the model doesn't do. Seems like the prompting is off. This is with no image: Another funny one: Another bad one: It isn't always bad, but something seems off. Never noticed such oddities with the cogvlm2 demos locally. But if I pass an image it responds ok: |
I see this in logs. Maybe something uninteded going on? It's ok as long as not doing CUDA in a fork.
|
Motivation
Support cogvlm-chat-hf and CogVLM2 for pytorch engine
Usage:
Warning
CogVLM-Chat-hf uses
'lmsys/vicuna-7b-v1.5'
as tokenizer, you need to copy the tokenizer model and configs into CogVLM model directory.Modification
TODOs
ModelInputs.split
with vision embeddingsBC-breaking (Optional)
Profiling
tp=1 batch_size=128 num-prompts=3000
cogvlm-chat-hf
Without images
with one image
change
profile_throught.py
and prepend 1234 tokens and image embeddings to each promptREST API
using PR #1662
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist