Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper stutters #774

Open
jrp2014 opened this issue May 10, 2024 · 8 comments
Open

Whisper stutters #774

jrp2014 opened this issue May 10, 2024 · 8 comments

Comments

@jrp2014
Copy link

jrp2014 commented May 10, 2024

Using

import mlx_whisper

speech_file = "stereo.mp3"

text = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False)["text"]

f=open("result.txt","w+")

f.write(text)

f.close()

I find that the output contains repeated phrases from time to time, enough to ruin the transcription. Eg:

... you used the address you put it in. Yes.        Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes.        Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. Yes. One day ... he arrived at the station and      he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. And he was       in the village. And he was in the village. And he was in the village. And he was in the village. And       he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. And he was       in the village. And he was in the village. And he was in the village. And he was in the village. And       he was in the village. And he was in the village. And he was in the village. And he was in the             village. And he was in the village. And he was in the village. And he was in the village. or horses        or whatever.

Maybe this is a feature of the underlying model?

@awni
Copy link
Member

awni commented May 11, 2024

Interesting.. I've seen that behavior before in lower quality models. Two questions:

  1. Are you using 16-bit or 32-bit precision?
  2. Did you try the PyTorch implementation on the same audio file? https://github.com/openai/whisper

@jrp2014
Copy link
Author

jrp2014 commented May 11, 2024

I expect that the mp3 will be 16-bit.

The problem seems to be a feature of the underlying architecture: padding to 30s chunks, eg. The original paper offered some mitigations, but they were far from completely effective. (Eg, using the results from the previous chunk as a prompt, fiddling with temperatures and beaming, etc). WhisperX seems to do a better job, but needs components that are only x86/cuda based.

It seems ironic that a i effectiveness should rely on hand tuning that is input-specific. 😇

@jrp2014
Copy link
Author

jrp2014 commented May 20, 2024

As most of the output seems v accurate, I can only suppose that the repetition is caused by some heuristic that says "if you cannot generate output, just repeat what you just produced". Reasons for not generating output could include either silence, or padding from the 30s chunking, or background noise, or ...

@awni
Copy link
Member

awni commented May 20, 2024

The repetition problem is a common problem with encoder-decoder style models. Though it usually becomes vanishingly rare for high quality models. Indeed it could be that edge case inputs are more likely to trigger it.

I expect that the mp3 will be 16-bit.

I meant the model parameters. The default model is fp16 which may be slightly worse. You could try using an fp32 (pass fp16=False to the transcribe function). Also you could try a larger model (like Whisper large)

@jrp2014
Copy link
Author

jrp2014 commented May 20, 2024

Thanks. I'm just trying a recording of a back and forth chat. Most of the transcription looks great, it's just these repetitions that are anomalous.

I've tried using this:

import mlx_whisper


speech_file = "/Users/jrp/.cache/whisper/alice.mp3"


result = mlx_whisper.transcribe(speech_file,path_or_hf_repo=f"mlx-community/whisper-large-v3-mlx",verbose=False, fp16=False)

f=open("result.txt","w+")

for segment in result["segments"]:
     print(segment["text"], file=f)

f.close()

with fp16 and fp32 versions on the Alice chapter, to get started. Most of the differences (see attachment) seem to be just how the output is segmented (with the fp16 version (<) being preferable in most cases, but there are a couple of oddities. Eg:
diff.txt

65,71c65
<  what is the reason for my being so different?
<  I wonder if I've changed in the night.
<  Let me think, was I the same when I got up this morning?
<  I almost think I can remember feeling a little different. But if I'm not the same, the next question is,
< 
< 
<  The next question is, who in the world am I?
---
>  who in the world am I?

:

>  and then I'll tell you my history,
>  and you'll understand why it is that I hate cats and dogs.
280,281c296
<  for the pool was getting quite crowded
<  with birds and animals that had fallen into it.
---
>  for the pool was getting quite crowded with birds and animals that had fallen into it.
289c304
<  End of chapter two.
---
>  Chapter 2

@jrp2014
Copy link
Author

jrp2014 commented May 20, 2024

Blimey, the f32 version is about half the speed of the fp32 one. Doesn't half exercise the fans on this 48Gb machine. GPUs are at 100%...

Does transcription stream, or is it going to just increase memory demand?

Looking at the various whisper offshoots (the original, lightening, kit., etc). They all seem to suffer from the same problem, with various heuristics being added and subtracted.

@jrp2014
Copy link
Author

jrp2014 commented May 24, 2024

... and the f32 version also stutters / hallucinates for me.

This is a pity. Most of the output is remarkably good, it just seems to be the chopping up the input, padding it, and stitching it back together seems to introduce errors.

@awni
Copy link
Member

awni commented May 25, 2024

There is an option that I don't think is currently implemented in the MLX example:

Edit: it is actually implemented and should be enabled by default.

    condition_on_previous_text: bool
        if True, the previous output of the model is provided as a prompt for the next window;
        disabling may make the text inconsistent across windows, but the model becomes less prone to
        getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants