Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Optimization for low memory machines #275

Closed
chenwanqq opened this issue May 9, 2024 · 6 comments
Closed

Memory Optimization for low memory machines #275

chenwanqq opened this issue May 9, 2024 · 6 comments

Comments

@chenwanqq
Copy link

chenwanqq commented May 9, 2024

Your mistral.rs is really an excellent work, especially considering that few frameworks in C++ and Rust currently support longrope, which you do support. However, I have found an issue. My machine's memory is not very large, (16GB RAM, 4GB VRAM). When I use mistral.rs to run the microsoft/Phi-3-mini-4k-instruct model, it prompts an "out of memory" error. But there is no problem running the same model using Ollama under the same conditions. It seems like there is still room for optimization in the program?
Additional Info:
I run the model on GPU and 4GB is enough for ollama. When switch to CPU, the process is still be killed(I run inside WSL2 and allow it to use 12GB RAM at most)

@EricLBuehler
Copy link
Owner

Thank you for reporting this! Can you please provide a minimum-reproducible example? My suspicion is that mistral.rs is running a non-GGUF model, while Ollama is running GGUF. Additionally, if you are using mistral.rs's ISQ feature, then it will load the model on CPU before copying to the device (GPU I would assume).

@chenwanqq
Copy link
Author

Thank you for reporting this! Can you please provide a minimum-reproducible example? My suspicion is that mistral.rs is running a non-GGUF model, while Ollama is running GGUF. Additionally, if you are using mistral.rs's ISQ feature, then it will load the model on CPU before copying to the device (GPU I would assume).

the command I use in mistral.rs:
cargo run --release --features cuda -- -i plain -m microsoft/Phi-3-mini-4k-instruct -a phi3
and it returns:
36%|██████████████████████████████████▉ | 47/128 [00:05<00:07, 10.22it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:06<00:00, 16.49it/s] Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
and in ollama:
ollama phi3
It works fine

@EricLBuehler
Copy link
Owner

cargo run --release --features cuda -- -i plain -m microsoft/Phi-3-mini-4k-instruct -a phi3

Ah, ok, that makes sense. I'm working on support for quantized phi3 (#276) which should be done soon and allow you to run phi3.

@EricLBuehler
Copy link
Owner

@chenwanqq, I just merged support for GGUF Phi3:

cargo run --release --features ...-- -i gguf -m microsoft/Phi-3-mini-4k-instruct-gguf -f Phi-3-mini-4k-instruct-q4.gguf -t microsoft/Phi-3-mini-4k-instruct

@chenwanqq
Copy link
Author

It works! That's so great! Thank you!

@EricLBuehler
Copy link
Owner

Glad to help! Please feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants