Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for the ArcticForCausalLM. #7020

Merged
merged 19 commits into from
May 24, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
71d8bd6
Added support for the snowflake-arctic model.
sszymczy May 1, 2024
c95013d
Whitespace formatting fixes.
sszymczy May 2, 2024
c6f15a7
Read vocabulary for ArcticForCausalLM from sentencepiece model instea…
sszymczy May 7, 2024
0cffda8
Moved ArcticModel to the end of the file.
sszymczy May 9, 2024
f3d1227
Merge branch 'ggerganov:master' into snowflake-arctic-clean
fairydreaming May 9, 2024
a892571
Applied changes from upstream PR: save memory with lazy evaluation #7…
sszymczy May 9, 2024
4ebb52c
Replaced prints with logger calls.
sszymczy May 9, 2024
9acc3ec
Removed unnecessary method - LlamaModel.permute is used instead.
sszymczy May 9, 2024
f4421f7
convert-hf : Corrected sentencepiece API calls.
sszymczy May 14, 2024
7a5df5f
Merge branch 'ggerganov:master' into snowflake-arctic
sszymczy May 15, 2024
85263f0
Minor fixes after merging.
sszymczy May 15, 2024
5b2be25
gguf-py : Moved non-conflicting block mappings from architecture-spec…
sszymczy May 16, 2024
5553226
Reordered tensors for visual consistency.
sszymczy May 17, 2024
f93acb5
llama : Removed usage of bias tensors in LLM_ARCH_ARCTIC, as they are…
sszymczy May 17, 2024
b53fd29
Reordered tensors for visual consistency.
sszymczy May 17, 2024
eb58c4b
Merge remote-tracking branch 'upstream/master' into snowflake-arctic-…
sszymczy May 22, 2024
a1a5508
llama : Replaced obsolete ggml_rope_custom() calls with ggml_rope_ext().
sszymczy May 22, 2024
3aa20e1
Merge remote-tracking branch 'upstream/master' into snowflake-arctic-…
sszymczy May 24, 2024
602c80d
llama : fix whitespace formatting
sszymczy May 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
113 changes: 113 additions & 0 deletions convert-hf-to-gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -1517,6 +1517,119 @@ def write_tensors(self):
raise ValueError(f"Unprocessed experts: {experts.keys()}")


@Model.register("ArcticForCausalLM")
class ArcticModel(Model):
model_arch = gguf.MODEL_ARCH.ARCTIC

def set_vocab(self):
self._set_vocab_llama_hf()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: #6877 (comment), this should be:

Suggested change
self._set_vocab_llama_hf()
try:
self. _set_vocab_sentencepiece()
except FileNotFoundError:
self._set_vocab_llama_hf()

The assertion exists because LlamaHfVocab was primarily written to convert HF "fast" tokenizers with a tokenizer.json. Since before it existed, "slow" sentencepiece tokenizers with a tokenizer.model have (almost?) always been converted using SentencePieceProcessor, which doesn't depend on HF transformers and directly preserves the token types and scores.

If you want to start converting slow tokenizers using HfVocab as well, I won't stop you, but in order to be consistent you'd have to remove all references to SentencePieceProcessor in the convert scripts, and make HF transformers a hard requirement for converting models with a Llama vocab. Otherwise, we'd be making an exception for this model for no clear reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reason is that the official tokenizer.model file for snowflake-arctic-instruct contains wrong BOS and EOS tokens as confirmed in: https://huggingface.co/Snowflake/snowflake-arctic-instruct/discussions/12
That's why I used llama_hf vocab that reads tokens from json files instead. If there is a better solution for this I'm fully open to any suggestions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cebtenzzre What if I implement ArcticModel::set_vocab() myself like XverseForCausalLM did, is that acceptable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cebtenzzre I now load vocabulary with SentencePieceProcessor as you suggested and apply necessary token modifications based on added_tokens_decoder field from tokenizer_config.json.


def set_gguf_parameters(self):
super().set_gguf_parameters()
hparams = self.hparams
self.gguf_writer.add_vocab_size(hparams["vocab_size"])
self.gguf_writer.add_rope_dimension_count(hparams["hidden_size"] // hparams["num_attention_heads"])

# Same as super class, but permuting q_proj, k_proj
def write_tensors(self):
compilade marked this conversation as resolved.
Show resolved Hide resolved
block_count = self.hparams.get("n_layers", self.hparams.get("num_hidden_layers", self.hparams.get("n_layer")))
tensor_map = gguf.get_tensor_name_map(self.model_arch, block_count)
n_head = self.hparams.get("num_attention_heads")
n_kv_head = self.hparams.get("num_key_value_heads")
n_experts = self.hparams.get("num_local_experts")
experts = dict()
for name, data_torch in self.get_tensors():
# we don't need these
if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq")):
continue

old_dtype = data_torch.dtype

# convert any unsupported data types to float32
if data_torch.dtype not in (torch.float16, torch.float32):
data_torch = data_torch.to(torch.float32)

data = data_torch.numpy()

if name.endswith("q_proj.weight"):
data = permute(data, n_head, n_head)
if name.endswith("k_proj.weight"):
data = permute(data, n_head, n_kv_head)

data = data.squeeze()

# process the experts separately
if name.find("block_sparse_moe.experts") != -1:
experts[name] = data
if len(experts) >= n_experts:
# merge the experts into a single 3d tensor
for bid in range(block_count):
for wid in range(1, 4):
full = True
for xid in range(n_experts):
ename = f"model.layers.{bid}.block_sparse_moe.experts.{xid}.w{wid}.weight"
if ename not in experts:
full = False
break
if not full:
continue

datas = []
for xid in range(n_experts):
ename = f"model.layers.{bid}.block_sparse_moe.experts.{xid}.w{wid}.weight"
datas.append(experts[ename])
del experts[ename]

data = np.stack(datas, axis=0)
data_dtype = data.dtype

if self.ftype == 0 and data_dtype == np.float16:
data = data.astype(np.float32)

if self.ftype == 1 and data_dtype == np.float32:
data = data.astype(np.float16)

merged_name = f"layers.{bid}.feed_forward.experts.w{wid}.weight"

new_name = tensor_map.get_name(merged_name, try_suffixes=(".weight", ".bias"))
if new_name is None:
print(f"Can not map tensor {name!r}")
sys.exit()

print(f"{new_name}, n_dims = {len(data.shape)}, shape = {data.shape} --> {data.dtype}")

self.gguf_writer.add_tensor(new_name, data)
continue

# map tensor names
new_name = tensor_map.get_name(name, try_suffixes=(".weight", ".bias"))
if new_name is None:
print(f"Can not map tensor {name!r}")
sys.exit()

n_dims = len(data.shape)
data_dtype = data.dtype

# if f32 desired, convert any float16 to float32
if self.ftype == 0 and data_dtype == np.float16:
data = data.astype(np.float32)

# 1d tensors need to be converted to float32
if self.ftype == 1 and data_dtype == np.float16 and n_dims == 1:
data = data.astype(np.float32)

# if f16 desired, convert any float32 2-dim weight tensors to float16
if self.ftype == 1 and data_dtype == np.float32 and name.endswith(".weight") and n_dims == 2:
data = data.astype(np.float16)

print(f"{new_name}, n_dims = {n_dims}, {old_dtype} --> {data.dtype}")

self.gguf_writer.add_tensor(new_name, data)

if len(experts) > 0:
raise ValueError(f"Unprocessed experts: {experts.keys()}")


@Model.register("GrokForCausalLM")
class GrokModel(Model):
model_arch = gguf.MODEL_ARCH.GROK
Expand Down
25 changes: 25 additions & 0 deletions gguf-py/gguf/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ class MODEL_ARCH(IntEnum):
COMMAND_R = auto()
DBRX = auto()
OLMO = auto()
ARCTIC = auto()


class MODEL_TENSOR(IntEnum):
Expand Down Expand Up @@ -180,6 +181,7 @@ class MODEL_TENSOR(IntEnum):
SSM_A = auto()
SSM_D = auto()
SSM_OUT = auto()
FFN_NORM_EXP = auto()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the actual numbers associated to the enum values of MODEL_TENSOR don't really matter (their names (from TENSOR_NAMES) are used instead in GGUF), maybe FFN_NORM_EXP could be placed right before FFN_GATE_EXP, a bit like FFN_NORM is right before FFN_GATE, for consistency.

If this is changed, it should also be placed similarly in TENSOR_NAMES and MODEL_TENSORS[MODEL.ARCTIC] in gguf-py/gguf/constants.py as well as in the llm_tensor enum, the LLM_TENSOR_NAMES mapping, and the llama_layer struct (and maybe the LLM_ARCH_ARCTIC case in llm_load_tensors?) in llama.cpp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order as requested, but in llama_layer struct the order is different, so I didn't touch it. In llm_load_tensors I think it was already in the requested order.



MODEL_ARCH_NAMES: dict[MODEL_ARCH, str] = {
Expand Down Expand Up @@ -215,6 +217,7 @@ class MODEL_TENSOR(IntEnum):
MODEL_ARCH.COMMAND_R: "command-r",
MODEL_ARCH.DBRX: "dbrx",
MODEL_ARCH.OLMO: "olmo",
MODEL_ARCH.ARCTIC: "arctic",
}

TENSOR_NAMES: dict[MODEL_TENSOR, str] = {
Expand Down Expand Up @@ -257,6 +260,7 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.SSM_A: "blk.{bid}.ssm_a",
MODEL_TENSOR.SSM_D: "blk.{bid}.ssm_d",
MODEL_TENSOR.SSM_OUT: "blk.{bid}.ssm_out",
MODEL_TENSOR.FFN_NORM_EXP: "blk.{bid}.ffn_norm_exps",
}

MODEL_TENSORS: dict[MODEL_ARCH, list[MODEL_TENSOR]] = {
Expand Down Expand Up @@ -725,6 +729,27 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
],
MODEL_ARCH.ARCTIC: [
MODEL_TENSOR.TOKEN_EMBD,
MODEL_TENSOR.OUTPUT_NORM,
MODEL_TENSOR.OUTPUT,
MODEL_TENSOR.ROPE_FREQS,
MODEL_TENSOR.ATTN_NORM,
MODEL_TENSOR.ATTN_Q,
MODEL_TENSOR.ATTN_K,
MODEL_TENSOR.ATTN_V,
MODEL_TENSOR.ATTN_OUT,
MODEL_TENSOR.ATTN_ROT_EMBD,
MODEL_TENSOR.FFN_GATE_INP,
MODEL_TENSOR.FFN_NORM,
MODEL_TENSOR.FFN_GATE,
MODEL_TENSOR.FFN_DOWN,
MODEL_TENSOR.FFN_UP,
MODEL_TENSOR.FFN_GATE_EXP,
MODEL_TENSOR.FFN_DOWN_EXP,
MODEL_TENSOR.FFN_UP_EXP,
MODEL_TENSOR.FFN_NORM_EXP,
],
# TODO
}

Expand Down
66 changes: 64 additions & 2 deletions gguf-py/gguf/tensor_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -370,6 +370,64 @@ class TensorNameMap:
"model.layers.{bid}.out_proj",
"backbone.layers.{bid}.mixer.out_proj",
),

}

# architecture-specific block mappings
arch_block_mappings_cfg: dict[MODEL_ARCH, dict[MODEL_TENSOR, tuple[str, ...]]] = {
MODEL_ARCH.ARCTIC: {
MODEL_TENSOR.TOKEN_EMBD: (
"model.embed_tokens",
),
MODEL_TENSOR.OUTPUT_NORM: (
"model.norm",
),
MODEL_TENSOR.OUTPUT: (
"lm_head",
),
MODEL_TENSOR.ATTN_NORM: (
"model.layers.{bid}.input_layernorm",
),
MODEL_TENSOR.ATTN_Q: (
"model.layers.{bid}.self_attn.q_proj",
),
MODEL_TENSOR.ATTN_K: (
"model.layers.{bid}.self_attn.k_proj",
),
MODEL_TENSOR.ATTN_V: (
"model.layers.{bid}.self_attn.v_proj",
),
MODEL_TENSOR.ATTN_OUT: (
"model.layers.{bid}.self_attn.o_proj",
),
MODEL_TENSOR.FFN_GATE_INP: (
"model.layers.{bid}.block_sparse_moe.gate",
),
MODEL_TENSOR.FFN_NORM: (
"model.layers.{bid}.residual_layernorm",
),
MODEL_TENSOR.FFN_GATE: (
"model.layers.{bid}.residual_mlp.w1",
),
MODEL_TENSOR.FFN_DOWN: (
"model.layers.{bid}.residual_mlp.w2",
),
MODEL_TENSOR.FFN_UP: (
"model.layers.{bid}.residual_mlp.w3",
),
MODEL_TENSOR.FFN_GATE_EXP: (
"layers.{bid}.feed_forward.experts.w1",
),
MODEL_TENSOR.FFN_DOWN_EXP: (
"layers.{bid}.feed_forward.experts.w2",
),
MODEL_TENSOR.FFN_UP_EXP: (
"layers.{bid}.feed_forward.experts.w3",
),
MODEL_TENSOR.FFN_NORM_EXP: (
"model.layers.{bid}.post_attention_layernorm",
),
},
}

mapping: dict[str, tuple[MODEL_TENSOR, str]]
Expand All @@ -383,12 +441,16 @@ def __init__(self, arch: MODEL_ARCH, n_blocks: int):
self.mapping[tensor_name] = (tensor, tensor_name)
for key in keys:
self.mapping[key] = (tensor, tensor_name)
if arch in self.arch_block_mappings_cfg:
block_mappings = self.arch_block_mappings_cfg[arch]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means architecture-specific block mappings can't partially override the common mappings (they have to totally re-define everything)?

Maybe this is fixable by adding the common mappings first to self.mapping, then the architecture-specific mappings?

So maybe using the union operator for dicts would be appropriate here

if arch in self.arch_block_mappings_cfg:
    block_mappings = self.block_mappings_cfg | self.arch_block_mappings_cfg[arch]

But that's only supported since Python 3.9, and gguf-py targets python = ">=3.8"

In this case using {**x, **y} instead of x | y would be more compatible for older-than-3.9 versions of Python, and would allow making a new dict with the content of x augmented/overridden by y. But the new syntax is clearer in my opinion.

After that, the architecture-specific mapping of MODEL_ARCH.ARCTIC should be simpler (since they won't need to include duplicates of the common mappings).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is to keep only "conflicting" block mappings in architecture-specific mappings and "non-conflicting" mappings in general mappings? I think using dict.update() is a better idea then. Mappings for ARCTIC arch would be shortened to:

    # architecture-specific block mappings
    arch_block_mappings_cfg: dict[MODEL_ARCH, dict[MODEL_TENSOR, tuple[str, ...]]] = {
        MODEL_ARCH.ARCTIC: {
            MODEL_TENSOR.FFN_NORM: (
                "model.layers.{bid}.residual_layernorm",
            ),
            MODEL_TENSOR.FFN_NORM_EXP: (
                "model.layers.{bid}.post_attention_layernorm",
            ),
        },
    }

while in the TensorNameMap init we would only have to add:

        if arch in self.arch_block_mappings_cfg:
            self.block_mappings_cfg.update(self.arch_block_mappings_cfg[arch])

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea is to keep only "conflicting" block mappings in architecture-specific mappings and "non-conflicting" mappings in general mappings?

Yes, exactly.

What do you think?

I think using dict.update() would be good. My proposed approach would have made a copy of the dict, but you're right, updating in-place would work too and would be better, since the original block_mappings_cfg isn't used later on (I think?).

I agree with using dict.update() for this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done

else:
block_mappings = self.block_mappings_cfg
for bid in range(n_blocks):
for tensor, keys in self.block_mappings_cfg.items():
for tensor, keys in block_mappings.items():
if tensor not in MODEL_TENSORS[arch]:
continue
# TODO: make this configurable
n_experts = 60
n_experts = 128
for xid in range(n_experts):
tensor_name = TENSOR_NAMES[tensor].format(bid = bid, xid = xid)
self.mapping[tensor_name] = (tensor, tensor_name)
Expand Down