Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf : reduce repeated boilerplate from write_tensors #7031

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
47e02eb
convert-hf : begin refactoring write_tensor
compilade Apr 30, 2024
0d720ac
Merge branch 'master' into compilade/convert-hf-refactor
compilade Apr 30, 2024
c33775b
convert : upgrade to sentencepiece v0.2.0
compilade Apr 30, 2024
698f0b3
convert-hf : remove unused n_dims in extra_*_tensors
compilade Apr 30, 2024
cde9ea6
convert-hf : simplify MoE weights stacking
compilade Apr 30, 2024
56f60f5
convert-hf : flake8 linter doesn't like semicolons
compilade May 1, 2024
3870164
convert-hf : allow unusual model part names
compilade May 1, 2024
dcd8dfa
convert : use a string for the SentencePiece tokenizer path
compilade May 1, 2024
21068b6
convert-hf : display tensor shape
compilade May 1, 2024
639b374
convert-hf : convert norms to f32 by default
compilade May 1, 2024
644c269
convert-hf : sort model part names
compilade May 1, 2024
ce067af
convert-hf : use an ABC for Model again
compilade May 2, 2024
13f4cf7
convert-hf : use a plain class for Model, and forbid direct instantia…
compilade May 2, 2024
6a54973
Merge branch 'master' into compilade/convert-hf-refactor
compilade May 3, 2024
3e5e0dc
Merge branch 'master' into compilade/convert-hf-refactor
compilade May 3, 2024
98f2d0e
convert-hf : more consistent formatting of cmdline args
compilade May 4, 2024
f2099c5
convert-hf : align the message logged for converted tensors
compilade May 4, 2024
215a0d3
convert-hf : fix Refact conversion
compilade May 5, 2024
c32d39c
Merge branch 'master' into compilade/convert-hf-refactor
mofosyne May 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions .devops/nix/package.nix
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ let
# TODO(Green-Sky): find a better way to opt-into the heavy ml python runtime
llama-python-extra = python3.withPackages (
ps: [
ps.einops
ps.numpy
ps.sentencepiece
ps.tiktoken
Expand Down
1,823 changes: 564 additions & 1,259 deletions convert-hf-to-gguf.py

Large diffs are not rendered by default.

20 changes: 12 additions & 8 deletions convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ def loadOriginalParamsJson(model: LazyModel, config_path: Path) -> Params:
n_experts = None
n_experts_used = None
f_rope_freq_base = None
n_ff = None

# hack to determine LLaMA v1 vs v2 vs CodeLlama
if config.get("moe"):
Expand All @@ -305,6 +306,8 @@ def loadOriginalParamsJson(model: LazyModel, config_path: Path) -> Params:
n_experts_used = config["moe"]["num_experts_per_tok"]
f_rope_freq_base = 1e6

assert n_ff is not None

return Params(
n_vocab = model["tok_embeddings.weight"].shape[0],
n_embd = config["dim"],
Expand Down Expand Up @@ -459,7 +462,8 @@ def __init__(self, base_path: Path):
# not found in alternate location either
raise FileNotFoundError('Cannot find tokenizer.model')

self.sentencepiece_tokenizer = SentencePieceProcessor(str(fname_tokenizer))
self.sentencepiece_tokenizer = SentencePieceProcessor()
self.sentencepiece_tokenizer.LoadFromFile(str(fname_tokenizer))
vocab_size = self.sentencepiece_tokenizer.vocab_size()

new_tokens = {id: piece for piece, id in added_tokens.items() if id >= vocab_size}
Expand All @@ -479,23 +483,23 @@ def __init__(self, base_path: Path):
def sentencepiece_tokens(self) -> Iterable[tuple[bytes, float, gguf.TokenType]]:
tokenizer = self.sentencepiece_tokenizer
for i in range(tokenizer.vocab_size()):
piece = tokenizer.id_to_piece(i)
piece = tokenizer.IdToPiece(i)
text = piece.encode("utf-8")
score: float = tokenizer.get_score(i)
score: float = tokenizer.GetScore(i)

toktype = gguf.TokenType.NORMAL
if tokenizer.is_unknown(i):
if tokenizer.IsUnknown(i):
toktype = gguf.TokenType.UNKNOWN
if tokenizer.is_control(i):
if tokenizer.IsControl(i):
toktype = gguf.TokenType.CONTROL

# NOTE: I think added_tokens are user defined.
# ref: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto
# if tokenizer.is_user_defined(i): toktype = gguf.TokenType.USER_DEFINED

if tokenizer.is_unused(i):
if tokenizer.IsUnused(i):
toktype = gguf.TokenType.UNUSED
if tokenizer.is_byte(i):
if tokenizer.IsByte(i):
toktype = gguf.TokenType.BYTE

yield text, score, toktype
Expand Down Expand Up @@ -904,7 +908,7 @@ def load() -> UnquantizedTensor:
def rebuild_from_type_v2(func, new_type, args, state):
return func(*args)

CLASSES = {
CLASSES: dict[tuple[str, str], type[LazyTensor] | LazyStorageKind] = {
# getattr used here as a workaround for mypy not being smart enough to determine
# the staticmethods have a __func__ attribute.
('torch._tensor', '_rebuild_from_type_v2'): getattr(rebuild_from_type_v2, '__func__'),
Expand Down
2 changes: 1 addition & 1 deletion examples/server/tests/features/steps/steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -882,7 +882,7 @@ async def oai_chat_completions(user_prompt,
while event_received:
event_received = False
async for line_in_bytes in response.content:
line = line_in_bytes.decode('utf8')
line = line_in_bytes.decode('utf-8')
line = line.rstrip('\n').rstrip('\r')
if line == '':
continue
Expand Down
2 changes: 1 addition & 1 deletion gguf-py/gguf/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -861,7 +861,7 @@ def get_type(val: Any) -> GGUFValueType:
# Note: Does not support GGML_QKK_64
QK_K = 256
# Items here are (block size, type size)
GGML_QUANT_SIZES = {
GGML_QUANT_SIZES: dict[GGMLQuantizationType, tuple[int, int]] = {
GGMLQuantizationType.F32: (1, 4),
GGMLQuantizationType.F16: (1, 2),
GGMLQuantizationType.Q4_0: (32, 2 + 16),
Expand Down
8 changes: 4 additions & 4 deletions gguf-py/gguf/gguf_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class ReaderTensor(NamedTuple):

class GGUFReader:
# I - same as host, S - swapped
byte_order: Literal['I' | 'S'] = 'I'
byte_order: Literal['I'] | Literal['S'] = 'I'
alignment: int = GGUF_DEFAULT_ALIGNMENT

# Note: Internal helper, API may change.
Expand All @@ -81,7 +81,7 @@ class GGUFReader:
GGUFValueType.BOOL: np.bool_,
}

def __init__(self, path: os.PathLike[str] | str, mode: Literal['r' | 'r+' | 'c'] = 'r'):
def __init__(self, path: os.PathLike[str] | str, mode: Literal['r'] | Literal['r+'] | Literal['c'] = 'r'):
self.data = np.memmap(path, mode = mode)
offs = 0
if self._get(offs, np.uint32, override_order = '<')[0] != GGUF_MAGIC:
Expand Down Expand Up @@ -126,7 +126,7 @@ def get_tensor(self, idx: int) -> ReaderTensor:
return self.tensors[idx]

def _get(
self, offset: int, dtype: npt.DTypeLike, count: int = 1, override_order: None | Literal['I' | 'S' | '<'] = None,
self, offset: int, dtype: npt.DTypeLike, count: int = 1, override_order: None | Literal['I'] | Literal['S'] | Literal['<'] = None,
) -> npt.NDArray[Any]:
count = int(count)
itemsize = int(np.empty([], dtype = dtype).itemsize)
Expand Down Expand Up @@ -248,7 +248,7 @@ def _build_tensors(self, start_offs: int, fields: list[ReaderField]) -> None:
raise ValueError(f'Found duplicated tensor with name {tensor_name}')
tensor_names.add(tensor_name)
ggml_type = GGMLQuantizationType(raw_dtype[0])
n_elems = np.prod(dims)
n_elems = int(np.prod(dims))
block_size, type_size = GGML_QUANT_SIZES[ggml_type]
n_bytes = n_elems * type_size // block_size
data_offs = int(start_offs + offset_tensor[0])
Expand Down
6 changes: 3 additions & 3 deletions gguf-py/gguf/gguf_writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ def add_val(self, val: Any, vtype: GGUFValueType | None = None, add_vtype: bool
if pack_fmt is not None:
self.kv_data += self._pack(pack_fmt, val, skip_pack_prefix = vtype == GGUFValueType.BOOL)
elif vtype == GGUFValueType.STRING:
encoded_val = val.encode("utf8") if isinstance(val, str) else val
encoded_val = val.encode("utf-8") if isinstance(val, str) else val
self.kv_data += self._pack("Q", len(encoded_val))
self.kv_data += encoded_val
elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and val:
Expand Down Expand Up @@ -202,7 +202,7 @@ def add_tensor_info(
raise ValueError(f'Duplicated tensor name {name}')
self.ti_names.add(name)

encoded_name = name.encode("utf8")
encoded_name = name.encode("utf-8")
self.ti_data += self._pack("Q", len(encoded_name))
self.ti_data += encoded_name
n_dims = len(tensor_shape)
Expand Down Expand Up @@ -476,7 +476,7 @@ def add_add_space_prefix(self, value: bool) -> None:
self.add_bool(Keys.Tokenizer.ADD_PREFIX, value)

def add_chat_template(self, value: str | Sequence[Mapping[str, str]]) -> None:
if isinstance(value, list):
if not isinstance(value, str):
template_default = None
template_names = set()

Expand Down
6 changes: 3 additions & 3 deletions gguf-py/gguf/vocab.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os
import sys
from pathlib import Path
from typing import Any, Callable
from typing import Any, Callable, Sequence, Mapping, Iterable

from .gguf_writer import GGUFWriter

Expand All @@ -13,11 +13,11 @@ class SpecialVocab:
merges: list[str]
add_special_token: dict[str, bool]
special_token_ids: dict[str, int]
chat_template: str | None
chat_template: str | Sequence[Mapping[str, str]] | None

def __init__(
self, path: str | os.PathLike[str], load_merges: bool = False,
special_token_types: tuple[str, ...] | None = None,
special_token_types: Iterable[str] | None = None,
n_vocab: int | None = None,
):
self.special_token_ids = {}
Expand Down
2 changes: 1 addition & 1 deletion gguf-py/scripts/gguf-dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ def dump_metadata(reader: GGUFReader, args: argparse.Namespace) -> None:
if len(field.types) == 1:
curr_type = field.types[0]
if curr_type == GGUFValueType.STRING:
print(' = {0}'.format(repr(str(bytes(field.parts[-1]), encoding='utf8')[:60])), end = '')
print(' = {0}'.format(repr(str(bytes(field.parts[-1]), encoding='utf-8')[:60])), end = '')
elif field.types[0] in reader.gguf_scalar_to_np:
print(' = {0}'.format(field.parts[-1][0]), end = '')
print()
Expand Down
12 changes: 6 additions & 6 deletions gguf-py/scripts/gguf-new-metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from pathlib import Path

import numpy as np
from typing import Any, Mapping, Sequence
from typing import Any, Sequence

# Necessary to load the local gguf package
if "NO_LOCAL_GGUF" not in os.environ and (Path(__file__).parent.parent.parent / 'gguf-py').exists():
Expand All @@ -34,19 +34,19 @@ def get_byteorder(reader: gguf.GGUFReader) -> gguf.GGUFEndian:
return host_endian


def decode_field(field: gguf.ReaderField) -> Any:
def decode_field(field: gguf.ReaderField | None) -> Any:
if field and field.types:
main_type = field.types[0]

if main_type == gguf.GGUFValueType.ARRAY:
sub_type = field.types[-1]

if sub_type == gguf.GGUFValueType.STRING:
return [str(bytes(field.parts[idx]), encoding='utf8') for idx in field.data]
return [str(bytes(field.parts[idx]), encoding='utf-8') for idx in field.data]
else:
return [pv for idx in field.data for pv in field.parts[idx].tolist()]
if main_type == gguf.GGUFValueType.STRING:
return str(bytes(field.parts[-1]), encoding='utf8')
return str(bytes(field.parts[-1]), encoding='utf-8')
else:
return field.parts[-1][0]

Expand All @@ -59,7 +59,7 @@ def get_field_data(reader: gguf.GGUFReader, key: str) -> Any:
return decode_field(field)


def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new_metadata: Mapping[str, str], remove_metadata: Sequence[str]) -> None:
def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new_metadata: dict[str, str], remove_metadata: Sequence[str]) -> None:
for field in reader.fields.values():
# Suppress virtual fields and fields written by GGUFWriter
if field.name == gguf.Keys.General.ARCHITECTURE or field.name.startswith('GGUF.'):
Expand Down Expand Up @@ -101,7 +101,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new

for tensor in reader.tensors:
# Dimensions are written in reverse order, so flip them first
shape = np.flipud(tensor.shape)
shape = np.flipud(tensor.shape).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for coercing shape to list here (and not elsewhere)?

Copy link
Collaborator Author

@compilade compilade May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's passed to GGUFWriter.add_tensor_info, which expects a Sequence[int] for the shape, and this shape is of type NDArray[uint32] which caused the error:

Argument of type "NDArray[uint32]" cannot be assigned to parameter "tensor_shape" of type "Sequence[int]" in function "add_tensor_info"
"NDArray[uint32]" is incompatible with "Sequence[int]"
(reportGeneralTypeIssues)

This comes from the shape field of a ReaderTensor, and it is coerced to list in other places, like in gguf-dump.py:

"shape": tensor.shape.tolist(),

prettydims = ', '.join('{0:5}'.format(d) for d in list(tensor.shape) + [1] * (4 - len(tensor.shape)))

But the shape field of ReaderTensor is only used in 7 places (including its definition). In other places, the "shape" usually directly come from either Numpy or PyTorch, which use tuple[int, ...] for the shape type, which is compatible with Sequence[int].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, although GGUFWriter.add_tensor_infos typing is then perhaps not correct I understand why it's simpler to just make it a list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGUFWriter.add_tensor_info's typing seems correct to me; it's used in 2 other places, and both use shapes which are already compatible with Sequence[int].

shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)

self.gguf.add_tensor_info(name, tensor.shape, data_type, data_nbytes, raw_dtype=raw_dtype)

So using Sequence[int] there seems appropriate, as it's the most general type (I think?) which can be indexed (it avoids having to cast tuple[int, ...] into list[int], or list[int] into tuple[int, ...]).
This is how the shape is used in add_tensor_info:

for i in range(n_dims):
self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i])

Copy link
Contributor

@CISC CISC May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, all I'm saying is that that also works with NDArray[uint32] (even though it's not compatible with Sequence[int]).

writer.add_tensor_info(tensor.name, shape, tensor.data.dtype, tensor.data.nbytes, tensor.tensor_type)

writer.write_header_to_file()
Expand Down
3 changes: 3 additions & 0 deletions pyrightconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"extraPaths": ["gguf-py"],
}
2 changes: 1 addition & 1 deletion requirements/requirements-convert.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
numpy~=1.24.4
sentencepiece~=0.1.98
sentencepiece~=0.2.0
transformers>=4.35.2,<5.0.0
gguf>=0.1.0
protobuf>=4.21.0,<5.0.0