Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CudaMalloc failed: out of memory with TinyLlama-1.1B #372

Open
Lathanao opened this issue Apr 26, 2024 · 4 comments
Open

CudaMalloc failed: out of memory with TinyLlama-1.1B #372

Lathanao opened this issue Apr 26, 2024 · 4 comments

Comments

@Lathanao
Copy link

I am trying to make working with GPU Tinyllama with:

./TinyLlama-1.1B-Chat-v1.0.F32.llamafile -ngl 9999

But it seem not possible to allocate 66.50 MB of memory on my card, even if I just boot the machine without any use of the GPU before.

Here the error:

[...]
link_cuda_dso: note: dynamically linking /home/yo/.llamafile/ggml-cuda.so
ggml_cuda_link: welcome to CUDA SDK with cuBLAS
link_cuda_dso: GPU support loaded
[...]
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1050 Ti, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:        CPU buffer size =   250.00 MiB
llm_load_tensors:      CUDA0 buffer size =  3946.35 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =    66.50 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 66.50 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 69730304
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model 'TinyLlama-1.1B-Chat-v1.0.F32.gguf'
{"function":"load_model","level":"ERR","line":443,"model":"TinyLlama-1.1B-Chat-v1.0.F32.gguf","msg":"unable to load model","tid":"8545344","timestamp":1714117560}

I have the cuda in this version:

Version         : 12.3.2-1
Description     : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL             : https://developer.nvidia.com/cuda-zone
Licenses        : LicenseRef-NVIDIA-CUDA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64  libcusolver.so=11-64  libcusolver.so=11-64
                  libcusparse.so=12-64  libcusparse.so=12-64

Here the spec of my machine.

System:
  Kernel: 6.6.26-1-MANJARO arch: x86_64 bits: 64 compiler: gcc v: 13.2.1
  Desktop: GNOME v: 45.4 tk: GTK v: 3.24.41 Distro: Manjaro
    base: Arch Linux
Machine:
  Type: Laptop System: HP product: HP Pavilion Gaming Laptop 15-cx0xxx
Memory:
  System RAM: total: 32 GiB available: 31.24 GiB used: 4.16 GiB (13.3%)
CPU:
  Info: model: Intel Core i7-8750H bits: 64 type: MT MCP arch: Coffee Lake
    gen: core 8 level: v3 note: 
Graphics:
  Device-2: NVIDIA GP107M [GeForce GTX 1050 Ti Mobile]
    vendor: Hewlett-Packard driver: nvidia v: 550.67
    alternate: nouveau,nvidia_drm non-free: 545.xx+ status: current (as of
    2024-04; EOL~2026-12-xx) arch: Pascal code: GP10x process: TSMC 16nm
    built: 2016-2021 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 link-max: gen: 3
    speed: 8 GT/s bus-ID: 01:00.0 chip-ID: 10de:1c8c class-ID: 0300

Is there a way to solve that?

@qkiel
Copy link

qkiel commented Apr 26, 2024

Try smaller version of TinyLlama, Q8 instead of F32: TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile

@jart
Copy link
Collaborator

jart commented Apr 26, 2024

Can you try llamafile-0.8.1 which was just released and tell me if it works?

@Lathanao
Copy link
Author

Lathanao commented Apr 27, 2024

Works perfectly, and it is far faster than before! Thank you.

image

@Lathanao
Copy link
Author

Lathanao commented Apr 27, 2024

Meaculpa, above, I make working a model with a lower quantization formats.
And now, I am not able to run the file again without error.

So I downloaded many models.
-Meta-Llama-3-8B-Instruct.F16.llamafile -> doeasn't load
-Meta-Llama-3-8B-Instruct.Q2_K.llamafile -> SIGSEGV
-Model/Meta-Llama-3-8B-Instruct.Q8_0.llamafile -> doeasn't load
-Model/Phi-3-mini-4k-instruct.Q8_0.llamafile -> doeasn't load
-Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile -> SIGSEGV
-Model/TinyLlama-1.1B-Chat-v1.0.F32.llamafile -> doeasn't load
-Model/TinyLlama-1.1B-Chat-v1.0.Q8_0.llamafile -> SIGSEGV

I reboot my machine and make test again. And the model what was working for me this morning (Model/TinyLlama-1.1B-Chat-v1.0.F16.llamafile), now is everytime in SIGSEGV.
No way to make it working again.

The SIGSEGV issue has been report there #378

@Lathanao Lathanao reopened this Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants