-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using token sequences as stop criteria does not work in mlx_lm #524
Comments
Additionally the condition for checking the stop criteria is: if len(stop_sequence_buffer) > max_stop_id_sequence_len: It is not inclusive, meaning there will always be an extra token appended. Meaning no stop sequences can ever be matched. |
Just curious, how did you find the issue? I ran a few tests before and didn't see any extra tokens added. |
I'll try to sum it up max_stop_id_sequence_len is the length of the longest stop id sequence. Now let's assume the buffer is the same size as the longest stop sequence. if len(stop_sequence_buffer) > max_stop_id_sequence_len: Since this check is non-inclusive, it would loop one extra time before running what comes after the if statement. Meaning the length of the buffer would now be 1 larger than the longest stop sequence. Now when we take into account, that the stop_criteria function only checks for perfect matches, where the tail of the "tokens" matches a stop sequence. It can no longer ever match, because the extra tokens was generated before calling stop_criteria. |
I see, |
Yeah, I think it was intended that it runs on every token generated, and we can throw away the buffer entirely. This would address most issues. It still needs a special case for streaming, since we need to anticipate a stop word, or we might stream parts of it. I have a prototype for that check currently, but still doing some testing. |
Yeah, you are right. The original implementation didn't have a buffer, so it ended up sending the stop word back to the client in the streaming. The buffer was introduced to solve that issue, but it seems like it wasn't well thought out in the implementation. |
Probably, I have kind of the same problem. The 'generate' function outputs a single key per token, here is some pseudocode for the problem:
Is there an easier way /example to retrieve the produced tokens per generated token such as 8586 or 374, and use those for stop criteria? Ideally multiple tokens as a stop criteria would be best because of stopping synthetic JSON generation just as it finishes. |
The implementation of stop_criteria in mlx_lm.server is inherently flawed. Stop sequences only get matched when the newest tokens generated perfectly match a stop sequence. However it does not stop if the stop sequence is inside of the tokens in any other way.
This only checks if the newest tokens perfectly match a sequence
stopping_criteria only gets called when max_stop_id_sequence_len amount of tokens have been generated, which is the length of the longest stop sequence.
Example where it breaks:
I have two stop sequences, one is of length 4, and one of length 6. Once 6 tokens have been generated, stop_criteria is called. However the tokens I have generated only match the stop sequence of length 4, and the match happens at the start of the new tokens, not at the end. However since stop_criteria only checks the end of the full token list, it does not get matched and generation does not stop.
The text was updated successfully, but these errors were encountered: