Memory Optimization for low memory machines #275

chenwanqq · 2024-05-09T03:36:33Z

Your mistral.rs is really an excellent work, especially considering that few frameworks in C++ and Rust currently support longrope, which you do support. However, I have found an issue. My machine's memory is not very large, (16GB RAM, 4GB VRAM). When I use mistral.rs to run the microsoft/Phi-3-mini-4k-instruct model, it prompts an "out of memory" error. But there is no problem running the same model using Ollama under the same conditions. It seems like there is still room for optimization in the program?
Additional Info:
I run the model on GPU and 4GB is enough for ollama. When switch to CPU, the process is still be killed(I run inside WSL2 and allow it to use 12GB RAM at most)

EricLBuehler · 2024-05-09T09:18:26Z

Thank you for reporting this! Can you please provide a minimum-reproducible example? My suspicion is that mistral.rs is running a non-GGUF model, while Ollama is running GGUF. Additionally, if you are using mistral.rs's ISQ feature, then it will load the model on CPU before copying to the device (GPU I would assume).

chenwanqq · 2024-05-09T09:22:49Z

Thank you for reporting this! Can you please provide a minimum-reproducible example? My suspicion is that mistral.rs is running a non-GGUF model, while Ollama is running GGUF. Additionally, if you are using mistral.rs's ISQ feature, then it will load the model on CPU before copying to the device (GPU I would assume).

the command I use in mistral.rs:
cargo run --release --features cuda -- -i plain -m microsoft/Phi-3-mini-4k-instruct -a phi3
and it returns:
36%|██████████████████████████████████▉ | 47/128 [00:05<00:07, 10.22it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:06<00:00, 16.49it/s] Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
and in ollama:
ollama phi3
It works fine

EricLBuehler · 2024-05-09T11:42:28Z

cargo run --release --features cuda -- -i plain -m microsoft/Phi-3-mini-4k-instruct -a phi3

Ah, ok, that makes sense. I'm working on support for quantized phi3 (#276) which should be done soon and allow you to run phi3.

EricLBuehler · 2024-05-09T12:17:48Z

@chenwanqq, I just merged support for GGUF Phi3:

cargo run --release --features ...-- -i gguf -m microsoft/Phi-3-mini-4k-instruct-gguf -f Phi-3-mini-4k-instruct-q4.gguf -t microsoft/Phi-3-mini-4k-instruct

chenwanqq · 2024-05-10T03:29:03Z

It works! That's so great! Thank you!

EricLBuehler · 2024-05-10T16:20:01Z

Glad to help! Please feel free to reopen.

EricLBuehler mentioned this issue May 9, 2024

Add the quantized phi3 model #276

Merged

EricLBuehler closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Optimization for low memory machines #275

Memory Optimization for low memory machines #275

chenwanqq commented May 9, 2024 •

edited

EricLBuehler commented May 9, 2024

chenwanqq commented May 9, 2024

EricLBuehler commented May 9, 2024

EricLBuehler commented May 9, 2024

chenwanqq commented May 10, 2024

EricLBuehler commented May 10, 2024

Memory Optimization for low memory machines #275

Memory Optimization for low memory machines #275

Comments

chenwanqq commented May 9, 2024 • edited

EricLBuehler commented May 9, 2024

chenwanqq commented May 9, 2024

EricLBuehler commented May 9, 2024

EricLBuehler commented May 9, 2024

chenwanqq commented May 10, 2024

EricLBuehler commented May 10, 2024

chenwanqq commented May 9, 2024 •

edited