[Bug] 部署llava-v1.6-34b，模型一直输出重复的结果 #1604

wssywh · 2024-05-16T10:02:41Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

部署OPENAI 服务后，模型一直输出重复的结果

Reproduction

服务运行命令：

CUDA_VISIBLE_DEVICES=0,3,4,5 NCCL_SHM_DISABLE=1 lmdeploy serve api_server /data/workspace/models/llava-v1.6-34b --server-port 23333 --tp 4 --cache-max-entry-count 0.5 --chat-template template.json

template.json文件内容：

{
    "model_name": "llava-chatml",
    "system": "system\n",
    "meta_instruction": "You are a robot developed by lmdeploy.",
    "eosys": "\n",
    "user": "user\n",
    "eoh": "\n",
    "assistant": "assistant\n",
    "eoa": "",
    "separator": "\n",
    "capability": "chat",
    "stop_words": [""]
}

程序运行代码：

from openai import OpenAI
import sys
from qwen_test import prompts

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')
model_name = client.models.list().data[0].id
stream = True
responses = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': '描述一下这张图片',
            
        }, {
            'type': 'image_url',
            'image_url': {
                'url': 'http://yanshi.jxgh.vip:8000/000000-zg119/upload/20240123/6f3a0494cc268f34b146031d73eaaaf8.jpeg',
            },
        }],
    }],
    temperature=0.1,
    stream=stream,
    # top_p=0.8
    )
if stream:
    for response in responses:
        result = response.choices[0].delta.content
        sys.stdout.write(result)
        sys.stdout.flush()
else:
    print(responses.choices[0].message.content)

程序输出结果：

Environment

sys.platform: linux
Python: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5: NVIDIA A100-PCIE-40GB
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.2, V12.2.140
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.2.1+cu121
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.3.2 (Git Hash 2dc95a2ad0841e29db8b22fbccaf3e5da7992b01)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 12.1
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.9.2
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.2.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

TorchVision: 0.17.1+cu121
LMDeploy: 0.4.1+398c2aa
transformers: 4.40.2
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.1
triton: 2.2.0

Error traceback

No response

The text was updated successfully, but these errors were encountered:

irexyc · 2024-05-16T11:47:21Z

我感觉是生成参数的原因，把temperature=0.1注释掉，跑了很多次都没有重复的现象。

另外看你指定了template.json，改了meta_instruction，但是为什么要把<|im_start|>, <|im_end|>这些token去掉呢？

irexyc · 2024-05-16T12:04:56Z

ollama 是怎么跑的？也是自定义了相同的模版，相同的生成参数跑的么？

wssywh · 2024-05-17T00:40:13Z

我感觉是生成参数的原因，把temperature=0.1注释掉，跑了很多次都没有重复的现象。

另外看你指定了template.json，改了meta_instruction，但是为什么要把<|im_start|>, <|im_end|>这些token去掉呢？

把<|im_start|>, <|im_end|>这些token加上我也测试过，一样的错误；
把temperature=0.1注释掉，没有出现重复的输出，很奇怪，为什么temperature=0.1的设置会导致重复的输出

wssywh · 2024-05-17T00:41:41Z

ollama 是怎么跑的？也是自定义了相同的模版，相同的生成参数跑的么？

ollama是使用相同的prompt、图片、参数，没有自定义模板，输出结果正常

irexyc self-assigned this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 部署llava-v1.6-34b，模型一直输出重复的结果 #1604

[Bug] 部署llava-v1.6-34b，模型一直输出重复的结果 #1604

wssywh commented May 16, 2024

irexyc commented May 16, 2024

irexyc commented May 16, 2024

wssywh commented May 17, 2024

wssywh commented May 17, 2024

[Bug] 部署llava-v1.6-34b，模型一直输出重复的结果 #1604

[Bug] 部署llava-v1.6-34b，模型一直输出重复的结果 #1604

Comments

wssywh commented May 16, 2024

Checklist

Describe the bug

Reproduction

Environment

Error traceback

irexyc commented May 16, 2024

irexyc commented May 16, 2024

wssywh commented May 17, 2024

wssywh commented May 17, 2024