Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grammar not working #2159

Closed
loganlebanoff opened this issue May 16, 2024 · 2 comments
Closed

Grammar not working #2159

loganlebanoff opened this issue May 16, 2024 · 2 comments

Comments

@loganlebanoff
Copy link

Running the sample wav file with a grammar gives no change in the output compared to without the grammar. I'm purposefully not giving any prompt because I want to see how it works without the help of a prompt.

Command:
./main -f samples/jfk.wav -m models/ggml-tiny.en.bin -t 8 --grammar ./grammars/chess.gbnf --grammar-penalty 100

Output:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =    77.11 MB
whisper_model_load: model size    =   77.11 MB
whisper_init_state: kv self size  =    8.26 MB
whisper_init_state: kv cross size =    9.22 MB
whisper_init_state: compute buffer (conv)   =   13.32 MB
whisper_init_state: compute buffer (encode) =   85.66 MB
whisper_init_state: compute buffer (cross)  =    4.01 MB
whisper_init_state: compute buffer (decode) =   96.02 MB
main: grammar:
root ::= init color [.]
init ::= [ ] [r] [e] [d] [,] [ ] [g] [r] [e] [e] [n] [,] [ ] [b] [l] [u] [e]
color ::= [,] [ ] color_4
prompt ::= init [.]
color_4 ::= [r] [e] [d] | [g] [r] [e] [e] [n] | [b] [l] [u] [e]


system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:07.960]   And so my fellow Americans ask not what your country can do for you
[00:00:07.960 --> 00:00:10.760]   ask what you can do for your country.


whisper_print_timings:     load time =   288.52 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    12.49 ms
whisper_print_timings:   sample time =    44.04 ms /   139 runs (    0.32 ms per run)
whisper_print_timings:   encode time =   466.53 ms /     1 runs (  466.53 ms per run)
whisper_print_timings:   decode time =    10.08 ms /     2 runs (    5.04 ms per run)
whisper_print_timings:   batchd time =   128.82 ms /   133 runs (    0.97 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   962.72 ms

It doesn't seem to make any difference if I increase the grammar-penalty or use a different grammar. It doesn't even work when I run it with audio that is in the grammar.

Command:
./main -f knight.wav -m models/ggml-tiny.en.bin -t 8 --grammar ./grammars/chess.gbnf --grammar-penalty 100

Output:

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-tiny.en.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =    77.11 MB
whisper_model_load: model size    =   77.11 MB
whisper_init_state: kv self size  =    8.26 MB
whisper_init_state: kv cross size =    9.22 MB
whisper_init_state: compute buffer (conv)   =   13.32 MB
whisper_init_state: compute buffer (encode) =   85.66 MB
whisper_init_state: compute buffer (cross)  =    4.01 MB
whisper_init_state: compute buffer (decode) =   96.02 MB
main: grammar:
root ::= init move root_3 root_4 [.]
init ::= [ ] [r] [o] [o] [k] [ ] [t] [o] [ ] [b] [4] [,] [ ] [f] [3]
move ::= [,] [ ] move_12 [a-h] [1-8]
root_3 ::= move |
root_4 ::= move |
prompt ::= init [.]
move_6 ::= move_7 [ ] move_11
move_7 ::= piece | pawn | king
piece ::= [b] [i] [s] [h] [o] [p] | [r] [o] [o] [k] | [k] [n] [i] [g] [h] [t] | [q] [u] [e] [e] [n]
pawn ::= [p] [a] [w] [n]
king ::= [k] [i] [n] [g]
move_11 ::= [t] [o] [ ] |
move_12 ::= move_6 |


system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'knight.wav' (44360 samples, 2.8 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:03.640]   night to E5.


whisper_print_timings:     load time =   267.92 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     6.95 ms
whisper_print_timings:   sample time =     9.25 ms /    34 runs (    0.27 ms per run)
whisper_print_timings:   encode time =   448.45 ms /     1 runs (  448.45 ms per run)
whisper_print_timings:   decode time =     8.05 ms /     1 runs (    8.05 ms per run)
whisper_print_timings:   batchd time =    25.38 ms /    29 runs (    0.88 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   776.40 ms

It should give knight to e5 instead of night to E5.

@ggerganov
Copy link
Owner

Along with --grammar you have to also pass the name of the top-level grammar rule to use: --grammar-rule root for example

@loganlebanoff
Copy link
Author

Thanks that worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants