Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 部署llava-v1.6-34b,模型一直输出重复的结果 #1604

Open
2 tasks
wssywh opened this issue May 16, 2024 · 4 comments
Open
2 tasks

[Bug] 部署llava-v1.6-34b,模型一直输出重复的结果 #1604

wssywh opened this issue May 16, 2024 · 4 comments
Assignees

Comments

@wssywh
Copy link

wssywh commented May 16, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

部署OPENAI 服务后,模型一直输出重复的结果
image

Reproduction

服务运行命令:

CUDA_VISIBLE_DEVICES=0,3,4,5 NCCL_SHM_DISABLE=1 lmdeploy serve api_server /data/workspace/models/llava-v1.6-34b --server-port 23333 --tp 4 --cache-max-entry-count 0.5 --chat-template template.json

template.json文件内容:

{
    "model_name": "llava-chatml",
    "system": "system\n",
    "meta_instruction": "You are a robot developed by lmdeploy.",
    "eosys": "\n",
    "user": "user\n",
    "eoh": "\n",
    "assistant": "assistant\n",
    "eoa": "",
    "separator": "\n",
    "capability": "chat",
    "stop_words": [""]
}

程序运行代码:

from openai import OpenAI
import sys
from qwen_test import prompts

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
stream = True
responses = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': '描述一下这张图片',
            
        }, {
            'type': 'image_url',
            'image_url': {
                'url': 'http://yanshi.jxgh.vip:8000/000000-zg119/upload/20240123/6f3a0494cc268f34b146031d73eaaaf8.jpeg',
            },
        }],
    }],
    temperature=0.1,
    stream=stream,
    # top_p=0.8
    )
if stream:
    for response in responses:
        result = response.choices[0].delta.content
        sys.stdout.write(result)
        sys.stdout.flush()
else:
    print(responses.choices[0].message.content)

程序输出结果:
image

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5: NVIDIA A100-PCIE-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu121
LMDeploy: 0.4.1+398c2aa
transformers: 4.40.2
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

@irexyc irexyc self-assigned this May 16, 2024
@irexyc
Copy link
Collaborator

irexyc commented May 16, 2024

我感觉是生成参数的原因,把temperature=0.1注释掉,跑了很多次都没有重复的现象。

另外看你指定了template.json,改了meta_instruction,但是为什么要把<|im_start|>, <|im_end|>这些token去掉呢?

@irexyc
Copy link
Collaborator

irexyc commented May 16, 2024

ollama 是怎么跑的? 也是自定义了相同的模版,相同的生成参数跑的么?

@wssywh
Copy link
Author

wssywh commented May 17, 2024

我感觉是生成参数的原因,把temperature=0.1注释掉,跑了很多次都没有重复的现象。

另外看你指定了template.json,改了meta_instruction,但是为什么要把<|im_start|>, <|im_end|>这些token去掉呢?

把<|im_start|>, <|im_end|>这些token加上我也测试过,一样的错误;
把temperature=0.1注释掉,没有出现重复的输出,很奇怪,为什么temperature=0.1的设置会导致重复的输出

@wssywh
Copy link
Author

wssywh commented May 17, 2024

ollama 是怎么跑的? 也是自定义了相同的模版,相同的生成参数跑的么?

ollama是使用相同的prompt、图片、参数,没有自定义模板,输出结果正常

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants