Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High RAM usage while loading Llama3 #817

Open
FrostyMisa opened this issue May 1, 2024 · 6 comments
Open

High RAM usage while loading Llama3 #817

FrostyMisa opened this issue May 1, 2024 · 6 comments

Comments

@FrostyMisa
Copy link

FrostyMisa commented May 1, 2024

Koboldccp 1.64, Hardware: Steamdeck

I was using Koboldcpp on Steamdeck LCD before with Vulkan and it works fast and great. After Llama3 I download latest version Koboldccp and even I use smaller Llama3 model, it looks like it takes 2 times more system RAM then for example Mistral model.

I will provide both logs if you can find something. I can load Mistral 7B Q6 model without problem. But I have problem run Llama3 8B Q3_K_M. In spike it takes around 15GB.

So it's loading Llama3 models different or Vulkan is not optimized for Llama3 now? Or something else?

https://wormhole.app/KvKBP#HPyt1TRWVKTwYT_uvQVKkw

@LostRuins
Copy link
Owner

Actually according to your logs about, the mistral model is using more memory than the llama 3 model. 15GB seems a bit much for a q3 8b.

@FrostyMisa
Copy link
Author

FrostyMisa commented May 2, 2024

According to the log yes, but you can see system monitor here in photos, Llama3 (the smaller one) spike to swap files. OpenBlas doesn't have this problem and the RAM usage is normal to the model size here. So it must be something with Vulkan. imageimage

@FrostyMisa
Copy link
Author

FrostyMisa commented May 3, 2024

I found someone write about Vulkan on Reddit and I tried version 1.61.2 and here it works like expected without the spikes.

Someone in the thread mention this: "We know 1.61 is the last version Vulkan works correct on, its because of a regression in Vulkan upstream that Occam didn't have time to submit his fixes for yet since its tied to MoE support. Will eventually be fixed, for now it's better to keep using 1.61 until you notice that we support MoE for Vulkan."

But it's definitely something between this version and 1.63 and 1.64. Those two I test and have problem loading Llama3 model.

Here is photo of RAM usage in 1.61. As you see, normal RAM usage, no spikes like in my photos I provide before in the last version. image

@henk717
Copy link

henk717 commented May 11, 2024

That reddit comment was by me. 1.65 will have the incoherency issues I was referencing fixed but the llama3 memory error we only discovered recently so that will remain a thing until occam finds a solution for that one seperately.

@FrostyMisa
Copy link
Author

I can confirm even 1.65 have this RAM spikes problem with Llama3 model with Vulkan. So I will wait if someone fix it and I will report if it works again.
Thanks guys for your hard work making Koboldcpp great!

@MadLightTheDoggo
Copy link

MadLightTheDoggo commented May 24, 2024

Yeah, i'm having exactly the same problem with any version above 1.61.2
In fact, on that one i can launch Mixtral 8x7 Q4_K_M with 8192 context with my 32gb memory easily, but on anything higher it fills all the memory and begins to spill out into the disk. Because of it i can't tell how much more memory it eats, but if the disk space is any indicator, i would say at least 10 gigs more.
I thought that maybe it was fixed in recent builds, but nope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants