-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model doesn't know when to stop generating. #745
Comments
It has the chat template, you can directly use tokenizer.apply_chat_template instead of doing the role mapping yourself.
|
Thanks a lot! I wouldn't have known about the tokenizer's Where or how can I learn more about these types of features? Another question I have is: How can I stream the model's response instead of waiting for the entire response to complete? |
If you want to print the streaming output to the console you can pass |
I have tried using |
You can take a look at the mlx_lm server's implementation here: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/server.py. It only has hundreds of lines of code and is quite self-contained. For more information, you can also refer to the SERVER.md file here: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/SERVER.md. |
I am relatively new to running inference on my own. Previously, I used ollama, but recently I decided to try out mlx since I have an M3 with sufficient unified memory and I was curious about how it compares to llama.cpp in terms of speed.
I have been trying to run phi3-128k-instruct. I converted the model to an mlx compatible format myself and uploaded it under my hf repository.
Microsoft doesn't provide as extensive an explanation of how to format chat prompts and utilize special tokens with their models, unlike Meta's llama3 models, which are well-documented (e.g., https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3).
Here is the code snippet I am using for inference:
This issue may not be directly related to mlx, but I need assistance with properly formatting prompts and using special tokens. I have tried running phi3 on HuggingChat, and there is a notable difference in the outputs. The responses from HuggingChat are significantly better compared to when I run the model locally with mlx. I would appreciate any guidance or recommendations on what I might be doing wrong.
Here is the response I am getting:
The text was updated successfully, but these errors were encountered: