Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for server crash if first token is the stop word and asking for logprobs #7038

Merged
merged 1 commit into from May 4, 2024

Conversation

maor-ps
Copy link
Contributor

@maor-ps maor-ps commented May 2, 2024

If the stop token is the first suggested token from the model and we ask for logprobs the server will crash when generating the logprobs vector

Currently this fix makes the allocation of the vector to be safer and wont crash the server.
it returns an empty content but also does not return any logprobs...

Toy example to reproduce on LLama2-13b

{'prompt': 'Q: hello world \nA: ',
'stop': ['\n'],
'temperature': 0.0,
'n_predict': 10,
'cache_prompt': True,
'n_probs': 10}

…will crash

This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
Copy link
Contributor

github-actions bot commented May 2, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 549 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8545.68ms p(95)=20299.56ms fails=, finish reason: stop=487 truncated=62
  • Prompt processing (pp): avg=102.25tk/s p(95)=428.51tk/s
  • Token generation (tg): avg=34.57tk/s p(95)=47.19tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=patch-2 commit=534db8eb3e6c95712193809071b8a4036c2f2a07

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 549 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1714637660 --> 1714638290
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 753.09, 753.09, 753.09, 753.09, 753.09, 739.35, 739.35, 739.35, 739.35, 739.35, 841.14, 841.14, 841.14, 841.14, 841.14, 840.75, 840.75, 840.75, 840.75, 840.75, 867.74, 867.74, 867.74, 867.74, 867.74, 882.2, 882.2, 882.2, 882.2, 882.2, 885.86, 885.86, 885.86, 885.86, 885.86, 885.61, 885.61, 885.61, 885.61, 885.61, 905.58, 905.58, 905.58, 905.58, 905.58, 900.88, 900.88, 900.88, 900.88, 900.88, 907.83, 907.83, 907.83, 907.83, 907.83, 929.03, 929.03, 929.03, 929.03, 929.03, 920.33, 920.33, 920.33, 920.33, 920.33, 872.56, 872.56, 872.56, 872.56, 872.56, 856.73, 856.73, 856.73, 856.73, 856.73, 849.61, 849.61, 849.61, 849.61, 849.61, 862.35, 862.35, 862.35, 862.35, 862.35, 860.21, 860.21, 860.21, 860.21, 860.21, 877.2, 877.2, 877.2, 877.2, 877.2, 874.54, 874.54, 874.54, 874.54, 874.54, 873.1, 873.1, 873.1, 873.1, 873.1, 878.49, 878.49, 878.49, 878.49, 878.49, 880.67, 880.67, 880.67, 880.67, 880.67, 867.29, 867.29, 867.29, 867.29, 867.29, 866.51, 866.51, 866.51, 866.51, 866.51, 867.02, 867.02, 867.02, 867.02, 867.02, 881.91, 881.91, 881.91, 881.91, 881.91, 878.75, 878.75, 878.75, 878.75, 878.75, 876.72, 876.72, 876.72, 876.72, 876.72, 876.53, 876.53, 876.53, 876.53, 876.53, 881.57, 881.57, 881.57, 881.57, 881.57, 880.55, 880.55, 880.55, 880.55, 880.55, 882.62, 882.62, 882.62, 882.62, 882.62, 888.78, 888.78, 888.78, 888.78, 888.78, 902.46, 902.46, 902.46, 902.46, 902.46, 906.9, 906.9, 906.9, 906.9, 906.9, 906.45, 906.45, 906.45, 906.45, 906.45, 904.1, 904.1, 904.1, 904.1, 904.1, 902.32, 902.32, 902.32, 902.32, 902.32, 900.94, 900.94, 900.94, 900.94, 900.94, 899.3, 899.3, 899.3, 899.3, 899.3, 905.08, 905.08, 905.08, 905.08, 905.08, 847.25, 847.25, 847.25, 847.25, 847.25, 841.27, 841.27, 841.27, 841.27, 841.27, 839.36, 839.36, 839.36, 839.36, 839.36, 837.79, 837.79, 837.79, 837.79, 837.79, 842.2, 842.2, 842.2, 842.2, 842.2, 844.1, 844.1, 844.1, 844.1, 844.1, 843.5, 843.5, 843.5, 843.5, 843.5, 843.41, 843.41, 843.41, 843.41, 843.41, 844.13, 844.13, 844.13, 844.13, 844.13, 846.81, 846.81, 846.81, 846.81, 846.81, 845.42, 845.42, 845.42, 845.42, 845.42, 846.97, 846.97, 846.97, 846.97, 846.97, 831.67, 831.67, 831.67, 831.67, 831.67, 832.85, 832.85, 832.85, 832.85, 832.85, 832.95, 832.95, 832.95, 832.95, 832.95, 834.27, 834.27, 834.27, 834.27, 834.27, 835.99, 835.99, 835.99, 835.99, 835.99, 836.61, 836.61, 836.61, 836.61, 836.61, 838.75, 838.75, 838.75, 838.75, 838.75, 838.75]
                    
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 549 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1714637660 --> 1714638290
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 35.37, 35.37, 35.37, 35.37, 35.37, 29.35, 29.35, 29.35, 29.35, 29.35, 29.34, 29.34, 29.34, 29.34, 29.34, 28.93, 28.93, 28.93, 28.93, 28.93, 29.98, 29.98, 29.98, 29.98, 29.98, 30.17, 30.17, 30.17, 30.17, 30.17, 32.28, 32.28, 32.28, 32.28, 32.28, 33.09, 33.09, 33.09, 33.09, 33.09, 33.14, 33.14, 33.14, 33.14, 33.14, 32.92, 32.92, 32.92, 32.92, 32.92, 33.13, 33.13, 33.13, 33.13, 33.13, 33.04, 33.04, 33.04, 33.04, 33.04, 32.67, 32.67, 32.67, 32.67, 32.67, 32.1, 32.1, 32.1, 32.1, 32.1, 31.8, 31.8, 31.8, 31.8, 31.8, 31.82, 31.82, 31.82, 31.82, 31.82, 32.13, 32.13, 32.13, 32.13, 32.13, 31.92, 31.92, 31.92, 31.92, 31.92, 31.83, 31.83, 31.83, 31.83, 31.83, 31.62, 31.62, 31.62, 31.62, 31.62, 31.4, 31.4, 31.4, 31.4, 31.4, 31.62, 31.62, 31.62, 31.62, 31.62, 31.7, 31.7, 31.7, 31.7, 31.7, 31.79, 31.79, 31.79, 31.79, 31.79, 31.91, 31.91, 31.91, 31.91, 31.91, 32.07, 32.07, 32.07, 32.07, 32.07, 31.54, 31.54, 31.54, 31.54, 31.54, 31.41, 31.41, 31.41, 31.41, 31.41, 31.63, 31.63, 31.63, 31.63, 31.63, 31.81, 31.81, 31.81, 31.81, 31.81, 31.85, 31.85, 31.85, 31.85, 31.85, 32.06, 32.06, 32.06, 32.06, 32.06, 32.11, 32.11, 32.11, 32.11, 32.11, 31.93, 31.93, 31.93, 31.93, 31.93, 31.76, 31.76, 31.76, 31.76, 31.76, 31.35, 31.35, 31.35, 31.35, 31.35, 31.23, 31.23, 31.23, 31.23, 31.23, 31.24, 31.24, 31.24, 31.24, 31.24, 31.39, 31.39, 31.39, 31.39, 31.39, 31.44, 31.44, 31.44, 31.44, 31.44, 31.57, 31.57, 31.57, 31.57, 31.57, 31.19, 31.19, 31.19, 31.19, 31.19, 31.18, 31.18, 31.18, 31.18, 31.18, 30.74, 30.74, 30.74, 30.74, 30.74, 29.86, 29.86, 29.86, 29.86, 29.86, 29.68, 29.68, 29.68, 29.68, 29.68, 29.63, 29.63, 29.63, 29.63, 29.63, 29.78, 29.78, 29.78, 29.78, 29.78, 29.8, 29.8, 29.8, 29.8, 29.8, 29.97, 29.97, 29.97, 29.97, 29.97, 30.03, 30.03, 30.03, 30.03, 30.03, 30.0, 30.0, 30.0, 30.0, 30.0, 29.93, 29.93, 29.93, 29.93, 29.93, 29.85, 29.85, 29.85, 29.85, 29.85, 30.02, 30.02, 30.02, 30.02, 30.02, 30.13, 30.13, 30.13, 30.13, 30.13, 30.27, 30.27, 30.27, 30.27, 30.27, 30.36, 30.36, 30.36, 30.36, 30.36, 30.38, 30.38, 30.38, 30.38, 30.38, 30.4, 30.4, 30.4, 30.4, 30.4, 30.42]
                    

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 549 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1714637660 --> 1714638290
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.01, 0.01, 0.01, 0.01, 0.01, 0.35, 0.35, 0.35, 0.35, 0.35, 0.37, 0.37, 0.37, 0.37, 0.37, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.3, 0.3, 0.3, 0.3, 0.3, 0.22, 0.22, 0.22, 0.22, 0.22, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.28, 0.28, 0.28, 0.28, 0.28, 0.3, 0.3, 0.3, 0.3, 0.3, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.39, 0.39, 0.39, 0.39, 0.39, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.2, 0.2, 0.2, 0.2, 0.2, 0.51, 0.51, 0.51, 0.51, 0.51, 0.6, 0.6, 0.6, 0.6, 0.6, 0.44, 0.44, 0.44, 0.44, 0.44, 0.33, 0.33, 0.33, 0.33, 0.33, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.28, 0.28, 0.28, 0.28, 0.28, 0.18, 0.18, 0.18, 0.18, 0.18, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23]
                    
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 549 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1714637660 --> 1714638290
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0]
                    

@maor-ps maor-ps changed the title Bug fix for server crash if first token is the stop word Bug fix for server crash if first token is the stop word and asking for logprobs May 2, 2024
@ngxson ngxson self-requested a review May 4, 2024 09:06
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

This seems to be an edge case where stop_word_toks.size() > generated_token_probs.size(), that causes slot.generated_token_probs.end() - stop_word_toks.size() to be negative and crash the server.

@ngxson ngxson merged commit 03fb8a0 into ggerganov:master May 4, 2024
64 checks passed
nopperl pushed a commit to nopperl/llama.cpp that referenced this pull request May 5, 2024
…will crash (ggerganov#7038)

This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 7, 2024
…will crash (ggerganov#7038)

This will reproduce the issue in llama13b
{
'prompt': 'Q: hello world \nA: ',
 'stop': ['\n'],
 'temperature': 0.0,
 'n_predict': 10,
 'cache_prompt': True,
 'n_probs': 10
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants