Whisper stutters #774

jrp2014 · 2024-05-10T19:51:02Z

Using

import mlx_whisper

speech_file = "stereo.mp3"

text = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False)["text"]

f=open("result.txt","w+")

f.write(text)

f.close()

I find that the output contains repeated phrases from time to time, enough to ruin the transcription. Eg:

... you used the address you put it in. Yes.        Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes.        Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. One day ... he arrived at the station and      he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. And he was       in the village. And he was in the village. And he was in the village. And he was in the village. And       he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. And he was       in the village. And he was in the village. And he was in the village. And he was in the village. And       he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. or horses        or whatever.

Maybe this is a feature of the underlying model?

The text was updated successfully, but these errors were encountered:

awni · 2024-05-11T13:10:03Z

Interesting.. I've seen that behavior before in lower quality models. Two questions:

Are you using 16-bit or 32-bit precision?
Did you try the PyTorch implementation on the same audio file? https://github.com/openai/whisper

jrp2014 · 2024-05-11T14:36:34Z

I expect that the mp3 will be 16-bit.

The problem seems to be a feature of the underlying architecture: padding to 30s chunks, eg. The original paper offered some mitigations, but they were far from completely effective. (Eg, using the results from the previous chunk as a prompt, fiddling with temperatures and beaming, etc). WhisperX seems to do a better job, but needs components that are only x86/cuda based.

It seems ironic that a i effectiveness should rely on hand tuning that is input-specific. 😇

jrp2014 · 2024-05-20T13:45:26Z

As most of the output seems v accurate, I can only suppose that the repetition is caused by some heuristic that says "if you cannot generate output, just repeat what you just produced". Reasons for not generating output could include either silence, or padding from the 30s chunking, or background noise, or ...

awni · 2024-05-20T19:14:44Z

The repetition problem is a common problem with encoder-decoder style models. Though it usually becomes vanishingly rare for high quality models. Indeed it could be that edge case inputs are more likely to trigger it.

I expect that the mp3 will be 16-bit.

I meant the model parameters. The default model is fp16 which may be slightly worse. You could try using an fp32 (pass fp16=False to the transcribe function). Also you could try a larger model (like Whisper large)

jrp2014 · 2024-05-20T20:11:33Z

Thanks. I'm just trying a recording of a back and forth chat. Most of the transcription looks great, it's just these repetitions that are anomalous.

I've tried using this:

import mlx_whisper


speech_file = "/Users/jrp/.cache/whisper/alice.mp3"


result = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False, fp16=False)

f=open("result.txt","w+")

for segment in result["segments"]:
     print(segment["text"], file=f)

f.close()

with fp16 and fp32 versions on the Alice chapter, to get started. Most of the differences (see attachment) seem to be just how the output is segmented (with the fp16 version (<) being preferable in most cases, but there are a couple of oddities. Eg:
diff.txt

65,71c65
<  what is the reason for my being so different?
<  I wonder if I've changed in the night.
<  Let me think, was I the same when I got up this morning?
<  I almost think I can remember feeling a little different. But if I'm not the same, the next question is,
< 
< 
<  The next question is, who in the world am I?
---
>  who in the world am I?

:

>  and then I'll tell you my history,
>  and you'll understand why it is that I hate cats and dogs.
280,281c296
<  for the pool was getting quite crowded
<  with birds and animals that had fallen into it.
---
>  for the pool was getting quite crowded with birds and animals that had fallen into it.
289c304
<  End of chapter two.
---
>  Chapter 2

jrp2014 · 2024-05-20T20:53:29Z

Blimey, the f32 version is about half the speed of the fp32 one. Doesn't half exercise the fans on this 48Gb machine. GPUs are at 100%...

Does transcription stream, or is it going to just increase memory demand?

Looking at the various whisper offshoots (the original, lightening, kit., etc). They all seem to suffer from the same problem, with various heuristics being added and subtracted.

jrp2014 · 2024-05-24T21:01:13Z

... and the f32 version also stutters / hallucinates for me.

This is a pity. Most of the output is remarkably good, it just seems to be the chopping up the input, padding it, and stitching it back together seems to introduce errors.

awni · 2024-05-25T02:30:49Z

~~There is an option that I don't think is currently implemented in the MLX example:~~

Edit: it is actually implemented and should be enabled by default.

    condition_on_previous_text: bool
        if True, the previous output of the model is provided as a prompt for the next window;
        disabling may make the text inconsistent across windows, but the model becomes less prone to
        getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper stutters #774

Whisper stutters #774

jrp2014 commented May 10, 2024

awni commented May 11, 2024

jrp2014 commented May 11, 2024

jrp2014 commented May 20, 2024

awni commented May 20, 2024

jrp2014 commented May 20, 2024 •

edited

jrp2014 commented May 20, 2024

jrp2014 commented May 24, 2024

awni commented May 25, 2024 •

edited

Whisper stutters #774

Whisper stutters #774

Comments

jrp2014 commented May 10, 2024

awni commented May 11, 2024

jrp2014 commented May 11, 2024

jrp2014 commented May 20, 2024

awni commented May 20, 2024

jrp2014 commented May 20, 2024 • edited

jrp2014 commented May 20, 2024

jrp2014 commented May 24, 2024

awni commented May 25, 2024 • edited

jrp2014 commented May 20, 2024 •

edited

awni commented May 25, 2024 •

edited