New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script to convert Grok-1 weights from raw JAX pickle files. #7058
base: master
Are you sure you want to change the base?
Conversation
Does this merge the experts into a single tensor? |
It does the opposite -- in the raw data, the 8 experts are part of the same tensor. This splits them, which is also what the chatllm.cpp script does. If there is a way to keep them within one tensor I'm happy to make that change. |
The preferred way to export the expert tensors is as a single 3D tensor for all the experts. It is still possible to use one tensor per expert for backwards compatibility, but it forces the model weights to be copied to a buffer while loading, rather than using them directly from the memory mapped file. For large models like grok, I think it is especially important to be able to avoid this copy and use mmap. |
Understood. That will actually make the script simpler. Would you happen to know the tensor names I should use in this case? Currently when using splitting, they are
|
The tensor names are defined in gguf-py: llama.cpp/gguf-py/gguf/constants.py Lines 249 to 251 in 60325fa
It would be good to use these constants rather than hardcoding the names. |
As per ggerganov#7058 (comment). This helps avoid a memcopy when running.
This saves weights in the order in which they are in the Grok-1 files. Since we operate weight-by-weight now, we no longer need caches and name2key translations. Per reviewer request, I also moved to using keys in gguf.TENSOR_NAMES.
Thanks! I have updated the branch to no longer split the MoE weights into separate tensors. That simplifies the script as it's now one weight per file. The original script permutated the order in which these weights are written for some reason, I stopped doing that now and thus there's only one list of weight names. I also moved to the values in the PTAL. |
@heiner, name of my project is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge after lint fixes
Hm, I tested
Might need more work |
convert_grok.py
Outdated
if name.endswith("attn_k"): | ||
return permute(tensor, config.num_key_value_heads) | ||
elif name.endswith("attn_q"): | ||
return permute(tensor, config.num_attention_heads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to llama_rope_type()
, Grok is using LLAMA_ROPE_TYPE_NEOX
, so do not permute
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the call to this function and the function itself.
"output_multiplier_scale": 0.5773502691896257, | ||
"embedding_multiplier_scale": 78.38367176906169, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedding_multiplier_scale
should be multiplied to embeddings.
output_multiplier_scale
can be ignored, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
if name == "token_embd":
weight *= config.embedding_multiplier_scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original reason for not doing that was that this appears to also happen in the C++ code here:
Line 7487 in befddd0
inpL = ggml_scale(ctx0, inpL, 78.38367176906169f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed this. Then, embedding_multiplier_scale
should not be multiplied.
But, it's better to do the multiplication once and offline than to do it each time at runtime.
My apologies. As I said above I couldn't actually test running the full model on my setup. I will fix @foldl's suggestions. Would you happen to have something like the sha-1 of each tensor of a checkpoint based on the HF weights? Otherwise I can download those and run that conversion for comparision. |
Thanks @foldl for the hints. It's well possible I mixed something else up as well, e.g., swapped two tensors with the same shape and dtype. Would you happen to have a |
Thanks. I have removed the multiplication with
It's likely something else is wrong but I'm unsure what it is, and the multiple-hour iteration time makes it infeasible to just try out random things. |
This adds a script to convert the raw weights in the pickle files to GGUF format. This allows using @arki05's work in #6204 directly from the Grok-1 torrent.
Code is based on @foldl's conversion script in chatllm.cpp, which in turn is based on @chu-tianxiang's gist.
Main ideas to avoid excessive memory:
mmap
.Note that I couldn't run the full model due to RAM constrains and it's possible I mixed up some tensor names.