How do you use large context? #6956

segmond · 2024-04-28T02:21:22Z

segmond
Apr 28, 2024

I'm running out of memory running commandR

llm_load_print_meta: model type = 35B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 34.98 B
llm_load_print_meta: model size = 34.62 GiB (8.50 BPW)
llm_load_print_meta: general.name = 9fe64d67d13873f218cb05083b6fc2faab2d034a
llm_load_print_meta: BOS token = 5 '<BOS_TOKEN>'
llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>'
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 136 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 6 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 4: Tesla P40, compute capability 6.1, VMM: yes
Device 5: Tesla P40, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size = 1.18 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 2125.00 MiB
llm_load_tensors: CUDA0 buffer size = 5831.22 MiB
llm_load_tensors: CUDA1 buffer size = 5831.22 MiB
llm_load_tensors: CUDA2 buffer size = 5831.22 MiB
llm_load_tensors: CUDA3 buffer size = 5831.22 MiB
llm_load_tensors: CUDA4 buffer size = 5831.22 MiB
llm_load_tensors: CUDA5 buffer size = 6290.19 MiB
............................................................................................
llama_new_context_with_model: n_ctx = 114688
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 8000000.0
llama_new_context_with_model: freq_scale = 1
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 16072.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model '/llmzoo/models/c4ai-command-r-v01-Q8_0.gguf'
{"tid":"129932240912384","timestamp":1714270719,"level":"ERR","function":"load_model","line":685,"msg":"unable to load model","model":"/llmzoo/models/c4ai-command-r-v01-Q8_0.gguf"}
Process failed with return code 1

Why is device 0 running out of memory? That's about 22gb of ram, the device has 24. If I load less on it, then it fails for device 1, etc. How much does kv actually use? How do I calculate the usage given model size and context size? I want to know if i'm doing something wrong or if it's broken. Thanks

I'm running
['llama.cpp/server', '-ngl', '155', '--host', '192.168.1.100', '--port', '8080', '-ctk', 'q4_0', '-c', '98304', '--slots-endpoint-disable', '-ts', '1,1,1,1,1,1', '--chat-template', 'command-r', '-m', '/llmzoo/models/c4ai-command-r-v01-Q8_0.gguf']

Answered by segmond

Apr 28, 2024

I'm running out of memory. After lots of expriments, I observed that besides the model using up ram, we need space for KV storage and especially plenty for compute buffers. commandR+ seems to also be on the high side with it demands. Looks like I would need 200+gb of vram to be able to get 128k context.

View full answer

segmond · 2024-04-28T15:41:58Z

segmond
Apr 28, 2024
Author

I'm running out of memory. After lots of expriments, I observed that besides the model using up ram, we need space for KV storage and especially plenty for compute buffers. commandR+ seems to also be on the high side with it demands. Looks like I would need 200+gb of vram to be able to get 128k context.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you use large context? #6956

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How do you use large context? #6956

segmond Apr 28, 2024

Replies: 1 comment

segmond Apr 28, 2024 Author

segmond
Apr 28, 2024

segmond
Apr 28, 2024
Author