convert-hf : reduce repeated boilerplate from write_tensors #7031

compilade · 2024-05-01T23:39:15Z

A lot of the lines in convert-hf-to-gguf.py are repeated from Model.write_tensors.
With some refactoring, it is possible to remove 600+ lines while mostly keeping the original behavior.

Important

Continued at #7075. The testing section below is still valid for both PRs.

Changes Summary

Fix (almost) all type errors reported by Pyright
- I've also included an appropriate pyrightconfig.json to let it load the gguf library correctly
upgrade SentencePiece requirement to ~=0.2.0 from ~=0.1.98
rename all encoding="utf8" to encoding="utf-8", for consistency, even if they are the same in Python.
make Model ~~a Protocol~~ a plain class instead of an ABC (Abstract Base Class)
- no abstract methods are used, so using ABC doesn't seem necessary anymore
- but forbid direct instantiation in Model.__init__
- allow setting the model_arch field in subclasses exactly as it's already done, but without making pyright output type errors about Literal[gguf.MODEL_ARCH.SOMETHING] not compatible with type property
  - an error is now thrown when forgetting to set the model_arch field, by checking for this in __init_subclass__ (so it's checked even for unused subclasses of Model)
biggest change: add a modify_tensors callback which is used in the model classes instead of overriding write_tensors.
- called for each tensor of the model
- allows filtering, renaming, removing, modifying, and adding new tensors
- all modifications are done on data_torch, which should make lazy computation easier to implement in the future (instead of also having to wrap complex numpy operations)
  - MoE weights stacking have been simplified
when f16 output is chosen, it's possible to selectively override the type of output tensors by overriding extra_f32_tensors and/or extra_f16_tensors in the architecture-specific model classes
allow unusual model part names
- These unusual names were previously included in the part count, but not in the part names
- This allows converting (for example) https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe which has a single model-00001-of-00001.safetensors part (previously, single-part models were expected to name their model file model.safetensors)
support conversion of Mamba models with a SentencePiece tokenizer
- for example, the v0-mamba-* models in https://huggingface.co/delphi-suite
add ps.einops to llama-python-extra Nix packages
- was previously missing, and is used for InternLM2Model in convert-hf-to-gguf.py

Testing

To make sure I didn't break previous work, I need help with testing conversion of existing architectures; I'm sure some of you already have the original model files downloaded, so testing conversion would be greatly appreciated.

$ python3 convert-hf-to-gguf.py --outtype f16 --outfile ./path/to/output.f16.gguf ./path/to/original/model/

If your username is mentioned below, it's because I think you might have the relevant model nearby, and/or I'd like you to check whether or not I broke your previous work, because I did modify quite a bit of model-specific conversion code.

It would be nice to automate such testing in the future, though I'm not sure how feasible it is. It could require finding the smallest model for each architecture, though unfortunately some architecture have only been used for very large models.

Also, @mofosyne, note that this will cause (some) conflicts with #6511, though they shouldn't be too hard to resolve (since most of them will be caused by removed code). I would be fine with your PR getting merged first to handle conflicts here, though I don't know if you'd prefer to do it on your side instead.

Also pinging @cebtenzzre because of #5825 which was another effort at simplifying convert-hf-to-gguf.py, so you should be familiar with the code.

Out of scope

Anything related to BPE pre-tokenizer errors was not caused by this PR and won't be fixed here to reduce potential conflicts.

For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer.

`os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors".

CISC · 2024-05-02T02:22:26Z

gguf-py/scripts/gguf-new-metadata.py

@@ -101,7 +101,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new

    for tensor in reader.tensors:
        # Dimensions are written in reverse order, so flip them first
-        shape = np.flipud(tensor.shape)
+        shape = np.flipud(tensor.shape).tolist()


Any particular reason for coercing shape to list here (and not elsewhere)?

It's passed to GGUFWriter.add_tensor_info, which expects a Sequence[int] for the shape, and this shape is of type NDArray[uint32] which caused the error:

Argument of type "NDArray[uint32]" cannot be assigned to parameter "tensor_shape" of type "Sequence[int]" in function "add_tensor_info" "NDArray[uint32]" is incompatible with "Sequence[int]" (reportGeneralTypeIssues)

This comes from the shape field of a ReaderTensor, and it is coerced to list in other places, like in gguf-dump.py:

llama.cpp/gguf-py/scripts/gguf-dump.py

Line 93 in b0d943d

"shape": tensor.shape.tolist(),

llama.cpp/gguf-py/scripts/gguf-dump.py

Line 54 in b0d943d

prettydims = ', '.join('{0:5}'.format(d) for d in list(tensor.shape) + [1] * (4 - len(tensor.shape)))

But the shape field of ReaderTensor is only used in 7 places (including its definition). In other places, the "shape" usually directly come from either Numpy or PyTorch, which use tuple[int, ...] for the shape type, which is compatible with Sequence[int].

I see, although GGUFWriter.add_tensor_infos typing is then perhaps not correct I understand why it's simpler to just make it a list.

GGUFWriter.add_tensor_info's typing seems correct to me; it's used in 2 other places, and both use shapes which are already compatible with Sequence[int].

llama.cpp/gguf-py/gguf/gguf_writer.py

Lines 247 to 248 in b0d943d

shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape

self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)

llama.cpp/convert.py

Line 1145 in b0d943d

self.gguf.add_tensor_info(name, tensor.shape, data_type, data_nbytes, raw_dtype=raw_dtype)

So using Sequence[int] there seems appropriate, as it's the most general type (I think?) which can be indexed (it avoids having to cast tuple[int, ...] into list[int], or list[int] into tuple[int, ...]).
This is how the shape is used in add_tensor_info:

llama.cpp/gguf-py/gguf/gguf_writer.py

Lines 210 to 211 in b0d943d

for i in range(n_dims):

self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])

Sure, all I'm saying is that that also works with NDArray[uint32] (even though it's not compatible with Sequence[int]).

jploski · 2024-05-02T17:10:15Z

An attempt to convert MPT fails early with an unhelpful error message, as shown below:

$ python3 convert-hf-to-gguf.py --outfile /mnt/f2fs/mpt/mpt7b.gguf --outtype f16 /mnt/seagate/dalai/mpt/models/mpt-7b
Loading model: mpt-7b
Traceback (most recent call last):
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2306, in <module>
    main()
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2287, in main
    model_instance = model_class(dir_model, ftype_map[args.outtype], fname_out, args.bigendian, args.use_temp_file)
  File "/mnt/seagate/miniconda3/lib/python3.10/typing.py", line 1456, in _no_init_or_replace_init
    cls.__init__(self, *args, **kwargs)
TypeError: MPTModel.__init__() takes exactly one argument (the instance to initialize)

teleprint-me · 2024-05-02T18:49:10Z

Could be a bound method being treated as a static method. cls may have no self instance and so raises an error? It's strange because it's obviously receiving the model path as well as other required arguments for the factory.

jploski · 2024-05-02T18:56:24Z

Could be a bound method being treated as a static method. cls may have no self instance and so raises an error? It's strange because it's obviously receiving the model path as well as other required arguments for the factory.

Not sure what that means in terms of fixing it "properly".

The error disappears when I add an explicit init method iin MPTmodel which calls super().init with the same parameters. Then a different error occurs, which seems to be less related to Python and more to the actual conversion at hand:

Traceback (most recent call last):
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2311, in <module>
    main()
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 2295, in main
    model_instance.set_gguf_parameters()
  File "/mnt/seagate/dalai/llama.cpp.pr7031/convert-hf-to-gguf.py", line 628, in set_gguf_parameters
    block_count = self.hparams["n_layers"]
AttributeError: 'MPTModel' object has no attribute 'hparams'

It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses.

compilade · 2024-05-02T19:14:56Z

I think this error happens because Protocol in Python isn't meant to be used as a statically-checked ABC. Subclasses of Protocol aren't supposed to be instantiated.
It's weird that it worked for me though.

Seems like Model needs to be an ABC instead (or simply an object?), but I still want an error to be thrown when the model_arch property of any subclass of Model is not defined. Maybe using __init_subclass__ would be more appropriate than using the Model.register decorator for this check, ~~and would also make it possible to ensure that this decorator is always used?~~

…tion There are no abstract methods used anyway, so using ABC isn't really necessary.

jploski · 2024-05-02T20:56:48Z

I think this error happens because Protocol in Python isn't meant to be used as a statically-checked ABC. Subclasses of Protocol aren't supposed to be instantiated. It's weird that it worked for me though.

Seems like Model needs to be an ABC instead (or simply an object?), but I still want an error to be thrown when the model_arch property of any subclass of Model is not defined. Maybe using __init_subclass__ would be more appropriate than using the Model.register decorator for this check, ~~and would also make it possible to ensure that this decorator is always used?~~

I can't recommend anything in that regard. But I checked that after your last commit the script works and produces exactly the same GGUF for MPT-7B as the master branch version (and the resulting model works).

Galunid · 2024-05-03T05:38:39Z

convert-hf-to-gguf.py

            # we don't need these
-            if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq")):
+            if name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")):


If you want to blacklist tensors in a per model basis you can use

llama.cpp/gguf-py/gguf/constants.py

Lines 732 to 743 in 6ecf318

MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {

MODEL_ARCH.LLAMA: [

MODEL_TENSOR.ROPE_FREQS,

MODEL_TENSOR.ATTN_ROT_EMBD,

],

MODEL_ARCH.BAICHUAN: [

MODEL_TENSOR.ROPE_FREQS,

MODEL_TENSOR.ATTN_ROT_EMBD,

],

MODEL_ARCH.PERSIMMON: [

MODEL_TENSOR.ROPE_FREQS,

],

Nice idea! I didn't know about MODEL_TENSOR_SKIP. Right now it's only used for MODEL_ARCH.LLAMA in convert.py. The exclusion lists for the other architectures were never really tested (e.g. MODEL_ARCH.PLAMO and MODEL_ARCH.GROK both have MODEL_TENSOR.ATTN_ROT_EMBD in MODEL_TENSORS, but they are not in MODEL_TENSOR_SKIP, since they rely on matching the name like you pointed out).

I'm not sure about using this; modify_tensors can already be used to skip tensors, and I think it's cleaner this way for architecture-specific use, while common skips (as it is) belong in the base Model.write_tensors (the line on which this comment thread was started). BertModel has a nice example for an architecture-specific exclusion:

llama.cpp/convert-hf-to-gguf.py

Lines 1970 to 1972 in 3e5e0dc

# we are only using BERT for embeddings so we don't need the pooling layer

if name in ("embeddings.position_ids", "pooler.dense.weight", "pooler.dense.bias"):

return [] # we don't need these

It was already done similarly in write_tensors (I tried to keep most of the model-specific code from write_tensors unchanged into modify_tensors). Otherwise, this means tensor types that won't be serialized should still get a serialized name, even if they're never used. BERT's pooling layers don't have an enum, and they probably should not have one if it's to be carried around but still remain unused like LLM_TENSOR_ATTN_ROT_EMBD in the llama.cpp file. (if they do get used, I agree they should get added, but they aren't for now.)

llama.cpp/llama.cpp

Line 488 in a2ac89d

{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" },

So instead, maybe it would be worth considering removing MODEL_TENSOR.ROPE_FREQS and MODEL_TENSOR.ATTN_ROT_EMBD? But this is likely out of the scope of this PR.

CISC · 2024-05-04T18:33:36Z

@compilade BTW, I added your typing changes to #6778 (and a progress bar, inspired by your other PR), so will be a simpler rebase for whichever gets merged last. :)

compilade · 2024-05-05T23:34:37Z

(For anyone following only this and not #7075)

#7075 is starting to be more ready and I'm considering to unify both PRs to make the branches easier to manage. To keep the naming less confusing, I think it would be best to change the base branch of #7075 to master so that it would include the commits from here, then close this PR, having been superseded.

However, doing this won't move over the review comments, so I'm wondering if that would still be fine? (@CISC, @Galunid)

CISC · 2024-05-06T08:46:23Z

However, doing this won't move over the review comments, so I'm wondering if that would still be fine?

Fine by me, it was more for my benefit rather than this PR anyway. :)

Galunid · 2024-05-06T09:06:23Z

However, doing this won't move over the review comments, so I'm wondering if that would still be fine?

Sure, we can do that

compilade · 2024-05-06T12:14:04Z

Closing in favor of #7075

compilade added 11 commits April 30, 2024 14:07

convert-hf : begin refactoring write_tensor

47e02eb

Merge branch 'master' into compilade/convert-hf-refactor

0d720ac

convert : upgrade to sentencepiece v0.2.0

c33775b

convert-hf : remove unused n_dims in extra_*_tensors

698f0b3

convert-hf : simplify MoE weights stacking

cde9ea6

convert-hf : flake8 linter doesn't like semicolons

56f60f5

convert : use a string for the SentencePiece tokenizer path

dcd8dfa

convert-hf : display tensor shape

21068b6

convert-hf : convert norms to f32 by default

639b374

convert-hf : sort model part names

644c269

`os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors".

compilade added refactoring Refactoring need feedback Testing and feedback with results are needed labels May 1, 2024

CISC reviewed May 2, 2024

View reviewed changes

convert-hf : use a plain class for Model, and forbid direct instantia…

13f4cf7

…tion There are no abstract methods used anyway, so using ABC isn't really necessary.

compilade mentioned this pull request May 2, 2024

Remove .attention from skipped tensors to match more accurately #7051

Merged

Merge branch 'master' into compilade/convert-hf-refactor

6a54973

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

Galunid reviewed May 3, 2024

View reviewed changes

compilade added 2 commits May 3, 2024 16:20

Merge branch 'master' into compilade/convert-hf-refactor

3e5e0dc

convert-hf : more consistent formatting of cmdline args

98f2d0e

compilade force-pushed the compilade/convert-hf-refactor branch from 66ecf6c to 98f2d0e Compare May 4, 2024 03:06

convert-hf : align the message logged for converted tensors

f2099c5

compilade mentioned this pull request May 4, 2024

convert-hf : save memory with lazy evaluation #7075

Merged

7 tasks

compilade mentioned this pull request May 4, 2024

Suport for Jamba JambaForCausalLM #6372

Open

4 tasks

convert-hf : fix Refact conversion

215a0d3

ggerganov added the high priority Very important issue label May 5, 2024

Merge branch 'master' into compilade/convert-hf-refactor

c32d39c

compilade closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-hf : reduce repeated boilerplate from write_tensors #7031

convert-hf : reduce repeated boilerplate from write_tensors #7031

compilade commented May 1, 2024 •

edited

CISC May 2, 2024

compilade May 2, 2024 •

edited

CISC May 2, 2024

compilade May 2, 2024

CISC May 2, 2024 •

edited

jploski commented May 2, 2024

teleprint-me commented May 2, 2024 •

edited

jploski commented May 2, 2024

compilade commented May 2, 2024 •

edited

jploski commented May 2, 2024

Galunid May 3, 2024

compilade May 4, 2024 •

edited

CISC commented May 4, 2024

compilade commented May 5, 2024

CISC commented May 6, 2024

Galunid commented May 6, 2024

compilade commented May 6, 2024

	shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
	self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)

	for i in range(n_dims):
	self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])

	MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
	MODEL_ARCH.LLAMA: [
	MODEL_TENSOR.ROPE_FREQS,
	MODEL_TENSOR.ATTN_ROT_EMBD,
	],
	MODEL_ARCH.BAICHUAN: [
	MODEL_TENSOR.ROPE_FREQS,
	MODEL_TENSOR.ATTN_ROT_EMBD,
	],
	MODEL_ARCH.PERSIMMON: [
	MODEL_TENSOR.ROPE_FREQS,
	],

	# we are only using BERT for embeddings so we don't need the pooling layer
	if name in ("embeddings.position_ids", "pooler.dense.weight", "pooler.dense.bias"):
	return [] # we don't need these

convert-hf : reduce repeated boilerplate from write_tensors #7031

convert-hf : reduce repeated boilerplate from write_tensors #7031

Conversation

compilade commented May 1, 2024 • edited

Changes Summary

Testing

Out of scope

CISC May 2, 2024

Choose a reason for hiding this comment

compilade May 2, 2024 • edited

Choose a reason for hiding this comment

CISC May 2, 2024

Choose a reason for hiding this comment

compilade May 2, 2024

Choose a reason for hiding this comment

CISC May 2, 2024 • edited

Choose a reason for hiding this comment

jploski commented May 2, 2024

teleprint-me commented May 2, 2024 • edited

jploski commented May 2, 2024

compilade commented May 2, 2024 • edited

jploski commented May 2, 2024

Galunid May 3, 2024

Choose a reason for hiding this comment

compilade May 4, 2024 • edited

Choose a reason for hiding this comment

CISC commented May 4, 2024

compilade commented May 5, 2024

CISC commented May 6, 2024

Galunid commented May 6, 2024

compilade commented May 6, 2024

compilade commented May 1, 2024 •

edited

compilade May 2, 2024 •

edited

CISC May 2, 2024 •

edited

teleprint-me commented May 2, 2024 •

edited

compilade commented May 2, 2024 •

edited

compilade May 4, 2024 •

edited