Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add chatglm3-6b model support [help wanted] #6999

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mnlife
Copy link

@mnlife mnlife commented Apr 30, 2024

Text generation has been implemented.

The following features that I know of haven't been implemented.
Compare with PyTorch version:

  • The model input prefixes not include {"[gMASK]", "sop", "<|user|>", "_", "<0x0A>"}, while the suffix not includes {"<|assistant|>"}
    • for example: when we input "hi", after tokenizer it is {"[gMASK]", "sop", "<|user|>", "_", "<0x0A>", "hi", "<|assistant|>"}. Implement this feature, what we need to changed in llama.cpp?
    • when I add 9a8db6b, and exec below command,The changes have not taken effect.
./build/bin/main -m ~/models/chatglm3-6b-Q4_K_M.gguf --verbose-prompt -p 你好
  • The inference results are incorrect with the CUDA version.

below is some link about chatglm model
The Hugging Face model path for chatglm3-6b: https://huggingface.co/THUDM/chatglm3-6b
gguf model: https://modelscope.cn/api/v1/models/mnlife/chatglm3-6b-gguf/repo?Revision=master&FilePath=chatglm3-6b-Q4_K_M.gguf

Copy link
Contributor

github-actions bot commented Apr 30, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 531 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8835.79ms p(95)=22264.85ms fails=, finish reason: stop=477 truncated=54
  • Prompt processing (pp): avg=105.84tk/s p(95)=462.75tk/s
  • Token generation (tg): avg=49.51tk/s p(95)=47.34tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=chatglm3 commit=ed1d3ffc2a97d4d7aff94e419b9701c14487f6c0

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715918485 --> 1715919111
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 333.24, 333.24, 333.24, 333.24, 333.24, 905.1, 905.1, 905.1, 905.1, 905.1, 926.01, 926.01, 926.01, 926.01, 926.01, 946.65, 946.65, 946.65, 946.65, 946.65, 982.74, 982.74, 982.74, 982.74, 982.74, 971.0, 971.0, 971.0, 971.0, 971.0, 971.94, 971.94, 971.94, 971.94, 971.94, 976.49, 976.49, 976.49, 976.49, 976.49, 982.4, 982.4, 982.4, 982.4, 982.4, 973.66, 973.66, 973.66, 973.66, 973.66, 963.02, 963.02, 963.02, 963.02, 963.02, 979.29, 979.29, 979.29, 979.29, 979.29, 982.45, 982.45, 982.45, 982.45, 982.45, 854.1, 854.1, 854.1, 854.1, 854.1, 852.73, 852.73, 852.73, 852.73, 852.73, 858.07, 858.07, 858.07, 858.07, 858.07, 858.53, 858.53, 858.53, 858.53, 858.53, 864.85, 864.85, 864.85, 864.85, 864.85, 877.03, 877.03, 877.03, 877.03, 877.03, 874.05, 874.05, 874.05, 874.05, 874.05, 871.14, 871.14, 871.14, 871.14, 871.14, 875.1, 875.1, 875.1, 875.1, 875.1, 875.38, 875.38, 875.38, 875.38, 875.38, 873.11, 873.11, 873.11, 873.11, 873.11, 871.65, 871.65, 871.65, 871.65, 871.65, 873.19, 873.19, 873.19, 873.19, 873.19, 873.77, 873.77, 873.77, 873.77, 873.77, 863.96, 863.96, 863.96, 863.96, 863.96, 863.2, 863.2, 863.2, 863.2, 863.2, 861.97, 861.97, 861.97, 861.97, 861.97, 865.27, 865.27, 865.27, 865.27, 865.27, 866.78, 866.78, 866.78, 866.78, 866.78, 864.77, 864.77, 864.77, 864.77, 864.77, 869.29, 869.29, 869.29, 869.29, 869.29, 878.79, 878.79, 878.79, 878.79, 878.79, 880.85, 880.85, 880.85, 880.85, 880.85, 884.25, 884.25, 884.25, 884.25, 884.25, 882.96, 882.96, 882.96, 882.96, 882.96, 876.59, 876.59, 876.59, 876.59, 876.59, 875.23, 875.23, 875.23, 875.23, 875.23, 876.89, 876.89, 876.89, 876.89, 876.89, 877.67, 877.67, 877.67, 877.67, 877.67, 875.82, 875.82, 875.82, 875.82, 875.82, 881.89, 881.89, 881.89, 881.89, 881.89, 874.06, 874.06, 874.06, 874.06, 874.06, 875.61, 875.61, 875.61, 875.61, 875.61, 873.74, 873.74, 873.74, 873.74, 873.74, 872.62, 872.62, 872.62, 872.62, 872.62, 874.83, 874.83, 874.83, 874.83, 874.83, 875.54, 875.54, 875.54, 875.54, 875.54, 873.96, 873.96, 873.96, 873.96, 873.96, 872.07, 872.07, 872.07, 872.07, 872.07, 867.86, 867.86, 867.86, 867.86, 867.86, 870.33, 870.33, 870.33, 870.33, 870.33, 873.14, 873.14, 873.14, 873.14, 873.14, 871.97, 871.97, 871.97, 871.97, 871.97, 876.31, 876.31, 876.31, 876.31, 876.31, 876.83, 876.83, 876.83, 876.83, 876.83, 876.0, 876.0, 876.0, 876.0, 876.0, 877.0, 877.0, 877.0, 877.0, 877.0, 877.69, 877.69, 877.69, 877.69]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715918485 --> 1715919111
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 37.89, 37.89, 37.89, 37.89, 37.89, 28.41, 28.41, 28.41, 28.41, 28.41, 27.67, 27.67, 27.67, 27.67, 27.67, 30.77, 30.77, 30.77, 30.77, 30.77, 32.12, 32.12, 32.12, 32.12, 32.12, 33.82, 33.82, 33.82, 33.82, 33.82, 34.95, 34.95, 34.95, 34.95, 34.95, 35.55, 35.55, 35.55, 35.55, 35.55, 35.25, 35.25, 35.25, 35.25, 35.25, 34.69, 34.69, 34.69, 34.69, 34.69, 34.72, 34.72, 34.72, 34.72, 34.72, 34.48, 34.48, 34.48, 34.48, 34.48, 32.63, 32.63, 32.63, 32.63, 32.63, 31.84, 31.84, 31.84, 31.84, 31.84, 31.56, 31.56, 31.56, 31.56, 31.56, 30.36, 30.36, 30.36, 30.36, 30.36, 30.07, 30.07, 30.07, 30.07, 30.07, 30.24, 30.24, 30.24, 30.24, 30.24, 29.95, 29.95, 29.95, 29.95, 29.95, 29.62, 29.62, 29.62, 29.62, 29.62, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.7, 29.7, 29.7, 29.7, 29.7, 29.82, 29.82, 29.82, 29.82, 29.82, 29.82, 29.82, 29.82, 29.82, 29.82, 29.79, 29.79, 29.79, 29.79, 29.79, 29.99, 29.99, 29.99, 29.99, 29.99, 30.04, 30.04, 30.04, 30.04, 30.04, 29.83, 29.83, 29.83, 29.83, 29.83, 30.04, 30.04, 30.04, 30.04, 30.04, 30.34, 30.34, 30.34, 30.34, 30.34, 30.48, 30.48, 30.48, 30.48, 30.48, 30.59, 30.59, 30.59, 30.59, 30.59, 30.74, 30.74, 30.74, 30.74, 30.74, 30.82, 30.82, 30.82, 30.82, 30.82, 30.72, 30.72, 30.72, 30.72, 30.72, 30.5, 30.5, 30.5, 30.5, 30.5, 30.27, 30.27, 30.27, 30.27, 30.27, 29.86, 29.86, 29.86, 29.86, 29.86, 29.79, 29.79, 29.79, 29.79, 29.79, 29.97, 29.97, 29.97, 29.97, 29.97, 30.05, 30.05, 30.05, 30.05, 30.05, 30.12, 30.12, 30.12, 30.12, 30.12, 30.2, 30.2, 30.2, 30.2, 30.2, 30.13, 30.13, 30.13, 30.13, 30.13, 29.99, 29.99, 29.99, 29.99, 29.99, 29.76, 29.76, 29.76, 29.76, 29.76, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.64, 28.57, 28.57, 28.57, 28.57, 28.57, 28.47, 28.47, 28.47, 28.47, 28.47, 28.53, 28.53, 28.53, 28.53, 28.53, 28.51, 28.51, 28.51, 28.51, 28.51, 28.64, 28.64, 28.64, 28.64, 28.64, 28.72, 28.72, 28.72, 28.72, 28.72, 28.75, 28.75, 28.75, 28.75, 28.75, 28.75, 28.75, 28.75, 28.75, 28.75, 28.85, 28.85, 28.85, 28.85, 28.85, 29.01, 29.01, 29.01, 29.01, 29.01, 29.11, 29.11, 29.11, 29.11, 29.11, 29.12, 29.12, 29.12, 29.12]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715918485 --> 1715919111
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.1, 0.1, 0.36, 0.36, 0.36, 0.36, 0.36, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 0.4, 0.31, 0.31, 0.31, 0.31, 0.31, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.32, 0.32, 0.32, 0.32, 0.32, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.34, 0.34, 0.34, 0.34, 0.34, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.39, 0.39, 0.39, 0.39, 0.39, 0.23, 0.23, 0.23, 0.23, 0.23, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.47, 0.47, 0.47, 0.47, 0.47, 0.53, 0.53, 0.53, 0.53, 0.53, 0.41, 0.41, 0.41, 0.41, 0.41, 0.23, 0.23, 0.23, 0.23, 0.23, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.28, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 531 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715918485 --> 1715919111
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0]
                    

@mofosyne mofosyne added help wanted Extra attention is needed enhancement New feature or request review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 9, 2024
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@mofosyne mofosyne self-assigned this May 10, 2024
@mofosyne mofosyne removed their assignment May 10, 2024
huggingface model: https://hf-mirror.com/THUDM/chatglm3-6b

Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
@mofosyne mofosyne marked this pull request as ready for review May 15, 2024 03:12
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
Signed-off-by: XingXing Qiao <qiaoxx@dingdao.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants