CudaMalloc failed: out of memory with TinyLlama-1.1B #372

Lathanao · 2024-04-26T08:08:19Z

I am trying to make working with GPU Tinyllama with:

./TinyLlama-1.1B-Chat-v1.0.F32.llamafile -ngl 9999

But it seem not possible to allocate 66.50 MB of memory on my card, even if I just boot the machine without any use of the GPU before.

Here the error:

[...]
link_cuda_dso: note: dynamically linking /home/yo/.llamafile/ggml-cuda.so
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
link_cuda_dso: GPU support loaded
[...]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =   250.00 MiB
llm_load_tensors:      CUDA0 buffer size =  3946.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    66.50 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 66.50 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 69730304
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'TinyLlama-1.1B-Chat-v1.0.F32.gguf'
{"function":"load_model","level":"ERR","line":443,"model":"TinyLlama-1.1B-Chat-v1.0.F32.gguf","msg":"unable to load model","tid":"8545344","timestamp":1714117560}

I have the cuda in this version:

Version         : 12.3.2-1
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : LicenseRef-NVIDIA-CUDA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64  libcusolver.so=11-64  libcusolver.so=11-64
                  libcusparse.so=12-64  libcusparse.so=12-64

Here the spec of my machine.

System:
  Kernel: 6.6.26-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
  Desktop: GNOME v: 45.4 tk: GTK v: 3.24.41 Distro: Manjaro
    base: Arch Linux
Machine:
  Type: Laptop System: HP product: HP Pavilion Gaming Laptop 15-cx0xxx
Memory:
  System RAM: total: 32 GiB available: 31.24 GiB used: 4.16 GiB (13.3%)
CPU:
  Info: model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Coffee Lake
    gen: core 8 level: v3 note: 
Graphics:
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile]
    vendor: Hewlett-Packard driver: nvidia v: 550.67
    alternate: nouveau,nvidia_drm non-free: 545.xx+ status: current (as of
    2024-04; EOL~2026-12-xx) arch: Pascal code: GP10x process: TSMC 16nm
    built: 2016-2021 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 3
    speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1c8c class-ID: 0300

Is there a way to solve that?

qkiel · 2024-04-26T20:30:56Z

Try smaller version of TinyLlama, Q8 instead of F32: TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile

jart · 2024-04-26T22:09:43Z

Can you try llamafile-0.8.1 which was just released and tell me if it works?

Lathanao · 2024-04-27T04:42:33Z

Works perfectly, and it is far faster than before! Thank you.

Lathanao · 2024-04-27T07:37:34Z

Meaculpa, above, I make working a model with a lower quantization formats.
And now, I am not able to run the file again without error.

So I downloaded many models.
-Meta-Llama-3-8B-Instruct.F16.llamafile -> doeasn't load
-Meta-Llama-3-8B-Instruct.Q2_K.llamafile -> SIGSEGV
-Model/Meta-Llama-3-8B-Instruct.Q8_0.llamafile -> doeasn't load
-Model/Phi-3-mini-4k-instruct.Q8_0.llamafile -> doeasn't load
-Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile -> SIGSEGV
-Model/TinyLlama-1.1B-Chat-v1.0.F32.llamafile -> doeasn't load
-Model/TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile -> SIGSEGV

I reboot my machine and make test again. And the model what was working for me this morning (Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile), now is everytime in SIGSEGV.
No way to make it working again.

The SIGSEGV issue has been report there #378

jart added the awaiting response label Apr 26, 2024

Lathanao closed this as completed Apr 27, 2024

Lathanao reopened this Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CudaMalloc failed: out of memory with TinyLlama-1.1B #372

CudaMalloc failed: out of memory with TinyLlama-1.1B #372

Lathanao commented Apr 26, 2024

qkiel commented Apr 26, 2024

jart commented Apr 26, 2024

Lathanao commented Apr 27, 2024 •

edited

Lathanao commented Apr 27, 2024 •

edited

CudaMalloc failed: out of memory with TinyLlama-1.1B #372

CudaMalloc failed: out of memory with TinyLlama-1.1B #372

Comments

Lathanao commented Apr 26, 2024

qkiel commented Apr 26, 2024

jart commented Apr 26, 2024

Lathanao commented Apr 27, 2024 • edited

Lathanao commented Apr 27, 2024 • edited

Lathanao commented Apr 27, 2024 •

edited

Lathanao commented Apr 27, 2024 •

edited