New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Optimization for low memory machines #275
Comments
Thank you for reporting this! Can you please provide a minimum-reproducible example? My suspicion is that mistral.rs is running a non-GGUF model, while Ollama is running GGUF. Additionally, if you are using mistral.rs's ISQ feature, then it will load the model on CPU before copying to the device (GPU I would assume). |
the command I use in mistral.rs: |
Ah, ok, that makes sense. I'm working on support for quantized phi3 (#276) which should be done soon and allow you to run phi3. |
@chenwanqq, I just merged support for GGUF Phi3:
|
It works! That's so great! Thank you! |
Glad to help! Please feel free to reopen. |
Your mistral.rs is really an excellent work, especially considering that few frameworks in C++ and Rust currently support longrope, which you do support. However, I have found an issue. My machine's memory is not very large, (16GB RAM, 4GB VRAM). When I use mistral.rs to run the microsoft/Phi-3-mini-4k-instruct model, it prompts an "out of memory" error. But there is no problem running the same model using Ollama under the same conditions. It seems like there is still room for optimization in the program?
Additional Info:
I run the model on GPU and 4GB is enough for ollama. When switch to CPU, the process is still be killed(I run inside WSL2 and allow it to use 12GB RAM at most)
The text was updated successfully, but these errors were encountered: