Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train new tokens while finetuning #31

Open
sandeep-krutrim opened this issue Apr 8, 2024 · 1 comment
Open

Cannot train new tokens while finetuning #31

sandeep-krutrim opened this issue Apr 8, 2024 · 1 comment

Comments

@sandeep-krutrim
Copy link

sandeep-krutrim commented Apr 8, 2024

I am trying to add new tokens to the tokenizer and want to train them during finetuning. For that model embedding size has to be increased. For standard bert architectures this is implemented. But for nomic bert, its giving the error -

`from transformers import AutoModelForMaskedLM, BertConfig,BertModel,BertForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("nomic-ai/nomic-bert-2048", trust_remote_code=True)
model.resize_token_embeddings(len(tokenizer))`

`

NotImplementedError Traceback (most recent call last)
Cell In[4], line 6
3 # Load pre-trained BERT model for masked language modeling
4 #model = BertForMaskedLM.from_pretrained('bert-base-uncased')
5 model = AutoModelForMaskedLM.from_pretrained("nomic-ai/nomic-bert-2048", trust_remote_code=True)
----> 6 model.resize_token_embeddings(len(tokenizer))
7 #model = AutoModelForMaskedLM.from_pretrained("albert/albert-large-v2")
8 #model = AutoModelForMaskedLM.from_pretrained("google-bert/bert-base-uncased")

File /disk1/sandeep/miniconda3/envs/nomic3/lib/python3.9/site-packages/transformers/modeling_utils.py:1786, in PreTrainedModel.resize_token_embeddings(self, new_num_tokens, pad_to_multiple_of)
1761 def resize_token_embeddings(
1762 self, new_num_tokens: Optional[int] = None, pad_to_multiple_of: Optional[int] = None
1763 ) -> nn.Embedding:
1764 """
1765 Resizes input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
1766
(...)
1784 torch.nn.Embedding: Pointer to the input tokens Embeddings Module of the model.
1785 """
-> 1786 model_embeds = self._resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
1787 if new_num_tokens is None and pad_to_multiple_of is None:
1788 return model_embeds

File /disk1/sandeep/miniconda3/envs/nomic3/lib/python3.9/site-packages/transformers/modeling_utils.py:1800, in PreTrainedModel._resize_token_embeddings(self, new_num_tokens, pad_to_multiple_of)
1799 def _resize_token_embeddings(self, new_num_tokens, pad_to_multiple_of=None):
-> 1800 old_embeddings = self.get_input_embeddings()
1801 new_embeddings = self._get_resized_embeddings(old_embeddings, new_num_tokens, pad_to_multiple_of)
1802 if hasattr(old_embeddings, "_hf_hook"):

File /disk1/sandeep/miniconda3/envs/nomic3/lib/python3.9/site-packages/transformers/modeling_utils.py:1574, in PreTrainedModel.get_input_embeddings(self)
1572 return base_model.get_input_embeddings()
1573 else:
-> 1574 raise NotImplementedError

NotImplementedError: `

@zanussbaum
Copy link
Collaborator

This seems to be an issue with how Huggingface expects the embeddings to be callable. I'll spend some time later this week to see if this makes sense to add but otherwise you can probably update the embedding layer manually following their code: https://github.com/huggingface/transformers/blob/v4.39.3/src/transformers/modeling_utils.py#L1778

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants