Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

bonuschild · 2023-10-26T05:06:35Z

Already posted on Running out of memory with TheBloke/CodeLlama-7B-AWQ vllm-project/vllm#1479
My GPU is RTX 3060 with 12GB VRAM
My target model isCodeLlama-7B-AWQ, which size is <= 4GB

Looking for help from 2 communities 😄 thx!

bonuschild · 2023-10-29T10:05:20Z

I've re-tested this on A100 instead of RTX3060, it show that finally it occupy about 20+GB VRAM! Why is that?
I use command:

python api_server.py --model path/to/7b-awq/model --port 8000 -q awq --dtype half --trust-remote-code

That was so weired...

jkrauss82 · 2023-11-27T10:59:31Z

I had success running Mistral-7B-v0.1-AWQ and CodeLlama-7B-AWQ of TheBloke on an A6000 with 48G VRAM, restricted to ~8G VRAM with the following parameters:

python api_server.py --model path/to/model --port 8000 --quantization awq --dtype float16 --gpu-memory-utilization 0.167 --max-model-len 4096 --max-num-batched-tokens 4096

nvidia-smi then shows around 8G memory consumed by the python process, should run on the 3060 as well I hope (need to omit the --gpu-memory-utilization obviously).

Repository owner deleted a comment Feb 15, 2024

Repository owner deleted a comment from dwcooper Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

bonuschild commented Oct 26, 2023

bonuschild commented Oct 29, 2023

jkrauss82 commented Nov 27, 2023 •

edited

Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

Running out of memory with TheBloke/CodeLlama-7B-AWQ #5

Comments

bonuschild commented Oct 26, 2023

bonuschild commented Oct 29, 2023

jkrauss82 commented Nov 27, 2023 • edited

jkrauss82 commented Nov 27, 2023 •

edited