-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlx 0.13 very slow with q8 and fp16 #776
Comments
I tested on M3 Max and got same result. with even lower power consumption and GPU frequency. Really strange. |
It's pretty slow for me also in 8-bit (and presumably 16 as well, but didn't test). Not sure why yet. |
@ivanfioravanti just curious you phrased this issue as if it used to be faster in previous MLX versions. Is that the case? |
Sorry for the delay I was out the whole week and happy to be back playing with MLX 🛝
Can this be related to model size? 70B? I will try comparing another large model. |
Oh yes, it's almost certainly related to the size / amount of RAM required. There seems to be a performance cliff for very large models. It shouldn't be swapping because its still not using all the RAM on the machine but it does seem related to memory page demand. Still debugging.. this one might take a little while to iron out. |
I was testing the new quantize by @angeloskath with some Italian prompts that were failing with previous version and now are PERFECT! But while doing this I have seen extreme slowness with q8 and fp16 version.
I'm using meta-llama/Meta-Llama-3-70B-Instruct for test on M2 Ultra 192GB.
I created 3 conversions of it with:
Then I tested generation with:
q4 14.9 t/s (GPU >1300Mhz 115W)
q8 & fp16 0.4 t/s (GPU <1100Mhz peak 2W)
The text was updated successfully, but these errors were encountered: