Shared GPU memory working for one GPU, but not two? #2917

chlimouj · 2024-05-16T15:16:48Z

chlimouj
May 16, 2024

My rig: i9-14000k, 96GB ram, 2x RTX A4000 16GB

I noticed that when running on one GPU I can load models larger than 16GB using shared GPU memory. Not sure how it determines how much system memory to dedicate to this, but it looks like default is 35.6GB, so I can load Llama 2 Chat 70B Q4, a nearly 40.9GB model, when I configure Jan to use one GPU. It's dog-slow (0.3t/s), and that's fine... understandable when sharing system memory through the bus.

I figured if I enable the second GPU it would still do this, but instead I get an error in the log saying "ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20038.81 MiB on device 0: cudaMalloc failed: out of memory"

Might be related to MMQ? I noticed force MMQ is enabled with one GPU, and not enabled with two. No idea what this setting does.

Is that the expected response? I figured it would fill up both GPUs then spill into system memory and share the processing load, even if it's ultimately limited by the shared memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jan

Shared GPU memory working for one GPU, but not two? #2917

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Jan

Shared GPU memory working for one GPU, but not two? #2917

chlimouj May 16, 2024

Replies: 0 comments

chlimouj
May 16, 2024