You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that when running on one GPU I can load models larger than 16GB using shared GPU memory. Not sure how it determines how much system memory to dedicate to this, but it looks like default is 35.6GB, so I can load Llama 2 Chat 70B Q4, a nearly 40.9GB model, when I configure Jan to use one GPU. It's dog-slow (0.3t/s), and that's fine... understandable when sharing system memory through the bus.
I figured if I enable the second GPU it would still do this, but instead I get an error in the log saying "ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20038.81 MiB on device 0: cudaMalloc failed: out of memory"
Might be related to MMQ? I noticed force MMQ is enabled with one GPU, and not enabled with two. No idea what this setting does.
Is that the expected response? I figured it would fill up both GPUs then spill into system memory and share the processing load, even if it's ultimately limited by the shared memory.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
My rig: i9-14000k, 96GB ram, 2x RTX A4000 16GB
I noticed that when running on one GPU I can load models larger than 16GB using shared GPU memory. Not sure how it determines how much system memory to dedicate to this, but it looks like default is 35.6GB, so I can load Llama 2 Chat 70B Q4, a nearly 40.9GB model, when I configure Jan to use one GPU. It's dog-slow (0.3t/s), and that's fine... understandable when sharing system memory through the bus.
I figured if I enable the second GPU it would still do this, but instead I get an error in the log saying "ggml_backend_cuda_buffer_type_alloc_buffer: allocating 20038.81 MiB on device 0: cudaMalloc failed: out of memory"
Might be related to MMQ? I noticed force MMQ is enabled with one GPU, and not enabled with two. No idea what this setting does.
Is that the expected response? I figured it would fill up both GPUs then spill into system memory and share the processing load, even if it's ultimately limited by the shared memory.
Beta Was this translation helpful? Give feedback.
All reactions