Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to convert Grok-1 weights from raw JAX pickle files. #7058

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

heiner
Copy link

@heiner heiner commented May 3, 2024

This adds a script to convert the raw weights in the pickle files to GGUF format. This allows using @arki05's work in #6204 directly from the Grok-1 torrent.

Code is based on @foldl's conversion script in chatllm.cpp, which in turn is based on @chu-tianxiang's gist.

Main ideas to avoid excessive memory:

  • Parse pickle files using mmap.
  • Use PyTorch "meta" tensors to simulate shape and dtype results without having to do all conversions beforehand.

Note that I couldn't run the full model due to RAM constrains and it's possible I mixed up some tensor names.

@slaren
Copy link
Collaborator

slaren commented May 3, 2024

Does this merge the experts into a single tensor?

@heiner
Copy link
Author

heiner commented May 3, 2024

Does this merge the experts into a single tensor?

It does the opposite -- in the raw data, the 8 experts are part of the same tensor. This splits them, which is also what the chatllm.cpp script does.

If there is a way to keep them within one tensor I'm happy to make that change.

@slaren
Copy link
Collaborator

slaren commented May 3, 2024

The preferred way to export the expert tensors is as a single 3D tensor for all the experts. It is still possible to use one tensor per expert for backwards compatibility, but it forces the model weights to be copied to a buffer while loading, rather than using them directly from the memory mapped file. For large models like grok, I think it is especially important to be able to avoid this copy and use mmap.

@heiner
Copy link
Author

heiner commented May 3, 2024

Understood. That will actually make the script simpler. Would you happen to know the tensor names I should use in this case? Currently when using splitting, they are

| blk.{layer}.ffn_gate.{expert}.weight        | torch.Size([32768, 6144])  | Q4_0    |
| blk.{layer}.ffn_down.{expert}.weight        | torch.Size([6144, 32768])  | Q4_0    |
| blk.{layer}.ffn_up.{expert}.weight          | torch.Size([32768, 6144])  | Q4_0    |

@slaren
Copy link
Collaborator

slaren commented May 3, 2024

The tensor names are defined in gguf-py:

MODEL_TENSOR.FFN_GATE_EXP: "blk.{bid}.ffn_gate_exps",
MODEL_TENSOR.FFN_DOWN_EXP: "blk.{bid}.ffn_down_exps",
MODEL_TENSOR.FFN_UP_EXP: "blk.{bid}.ffn_up_exps",

It would be good to use these constants rather than hardcoding the names.

heiner added 2 commits May 3, 2024 22:38
As per ggerganov#7058 (comment).
This helps avoid a memcopy when running.
This saves weights in the order in which they are in the Grok-1 files.
Since we operate weight-by-weight now, we no longer need caches and
name2key translations.

Per reviewer request, I also moved to using keys in gguf.TENSOR_NAMES.
@heiner
Copy link
Author

heiner commented May 3, 2024

Thanks!

I have updated the branch to no longer split the MoE weights into separate tensors. That simplifies the script as it's now one weight per file. The original script permutated the order in which these weights are written for some reason, I stopped doing that now and thus there's only one list of weight names.

I also moved to the values in the gguf.TENSOR_NAMES dict as per your suggestion. I'm not sure that's a clear improvement ("Explicit is better than implicit."), especially in view of code like name.endswith("attn_k") and name.endswith("_exps") but it's also not much worse.

PTAL.

@foldl
Copy link
Contributor

foldl commented May 3, 2024

@heiner, name of my project is ChatLLM.cpp, not ChatLLM.ccp, 😄

@mofosyne mofosyne added the python python script changes label May 9, 2024
ggerganov
ggerganov previously approved these changes May 9, 2024
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge after lint fixes

@mofosyne mofosyne added the review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level label May 9, 2024
@ggerganov
Copy link
Owner

Hm, I tested Q4_0 conversion and it does not seem to work:

python3 convert_grok.py -i ~/Data/huggingface/grok-1/ckpt/ --vocab_dir ../grok-1 -o x.gguf -t q4_0

make -j && ./main -m x.gguf -p "I believe the meaning of life is" -ngl 0

...

[BOS] I believe the meaning of life is it:000000000000000 of0000000000000000000000000000000000000000000000

Might need more work

@ggerganov ggerganov marked this pull request as draft May 9, 2024 14:03
@ggerganov ggerganov dismissed their stale review May 9, 2024 14:03

Did not work on my machine

convert_grok.py Outdated
Comment on lines 289 to 292
if name.endswith("attn_k"):
return permute(tensor, config.num_key_value_heads)
elif name.endswith("attn_q"):
return permute(tensor, config.num_attention_heads)
Copy link
Contributor

@foldl foldl May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to llama_rope_type(), Grok is using LLAMA_ROPE_TYPE_NEOX, so do not permute.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the call to this function and the function itself.

Comment on lines +356 to +357
"output_multiplier_scale": 0.5773502691896257,
"embedding_multiplier_scale": 78.38367176906169,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding_multiplier_scale should be multiplied to embeddings.

output_multiplier_scale can be ignored, IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

    if name == "token_embd":
        weight *= config.embedding_multiplier_scale

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original reason for not doing that was that this appears to also happen in the C++ code here:

inpL = ggml_scale(ctx0, inpL, 78.38367176906169f);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed this. Then, embedding_multiplier_scale should not be multiplied.

But, it's better to do the multiplication once and offline than to do it each time at runtime.

@heiner
Copy link
Author

heiner commented May 9, 2024

[BOS] I believe the meaning of life is it:000000000000000 of0000000000000000000000000000000000000000000000

My apologies. As I said above I couldn't actually test running the full model on my setup. I will fix @foldl's suggestions.

Would you happen to have something like the sha-1 of each tensor of a checkpoint based on the HF weights? Otherwise I can download those and run that conversion for comparision.

@mofosyne mofosyne added the enhancement New feature or request label May 9, 2024
@heiner
Copy link
Author

heiner commented May 9, 2024

Thanks @foldl for the hints. It's well possible I mixed something else up as well, e.g., swapped two tensors with the same shape and dtype. Would you happen to have a tensor name -> hash table for a correct conversion?

@foldl
Copy link
Contributor

foldl commented May 9, 2024

@heiner You need to compare the result against #6204, although I don't think sha-1 could match.

chatllm.cpp permutes k_proj/q_proj weights, so, sha-1 would not match, either.

@heiner
Copy link
Author

heiner commented May 10, 2024

Thanks. I have removed the multiplication with embedding_multiplier_scale again and converted the full model with all 8 experts. The output is not great but also not as bad as before (gist with full output):

./build/bin/main -m grok.bin -p "I believe the meaning of life is" -s 2 -n 10 -ngl 0
(...)
[BOS] I believe the meaning of life is important it could possibly the general of the general I

It's likely something else is wrong but I'm unsure what it is, and the multiple-hour iteration time makes it infeasible to just try out random things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python python script changes review complexity : medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants