Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

Open
bonuschild opened this issue Oct 26, 2023 · 2 comments
Open

Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

bonuschild opened this issue Oct 26, 2023 · 2 comments

Comments

@bonuschild
Copy link

Looking for help from 2 communities 馃槃 thx!

@bonuschild
Copy link
Author

I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that?
I use command:

python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code

That was so weired...

@jkrauss82
Copy link

jkrauss82 commented Nov 27, 2023

I had success running Mistral-7B-v0.1-AWQ and CodeLlama-7B-AWQ of TheBloke on an A6000 with 48G VRAM, restricted to ~8G VRAM with the following parameters:

python api_server.py --model path/to/model --port 8000 --quantization awq --dtype float16 --gpu-memory-utilization 0.167 --max-model-len 4096 --max-num-batched-tokens 4096

nvidia-smi then shows around 8G memory consumed by the python process, should run on the 3060 as well I hope (need to omit the --gpu-memory-utilization obviously).

Repository owner deleted a comment Feb 15, 2024
Repository owner deleted a comment Feb 15, 2024
Repository owner deleted a comment from dwcooper Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants