Skip to content

It is about how to load and aggregate pretrained word embeddings in pytorch, e.g., ELMo\BERT\XLNET.

Notifications You must be signed in to change notification settings

sz128/pretrained_word_embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pretrained_word_embeddings

It is about how to load pretrained word embeddings in pytorch, e.g., ELMo\BERT\XLNET.

Setup

Use

python elmo_bert_xlnet_layer.py

Alignment from BERT\XLNET tokens to original words

Usually, we want to get word embeddings from BERT\XLNET models, while one word may be split into multiple tokens after BERT\XLNET tokenization. In this case, we would like to get word embeddings by using the alignment from BERT\XLNET tokens to original words.

For example, the sentence

"i dont care wether it provides free wifi or not"

can be tokenized as

['i', 'dont', 'care', 'wet', '##her', 'it', 'provides', 'free', 'wi', '##fi', 'or', 'not']

.

We provide three types of alignment:

  • 'ori': we simply use the output embeddings of BERT\XLNET to represent each input sentence, while ignoring the output embeddings of special tokens like '[CLS]' and '[SEP]'.
  • 'first': using the embedding of the first token of each word as the word embedding.
  • 'avg': averaging the embeddings of all the tokens of each word as the word embedding.

1. alignment is 'ori'

alignment is first

2. alignment is 'first'

alignment is first

3. alignment is 'avg'

alignment is avg

About

It is about how to load and aggregate pretrained word embeddings in pytorch, e.g., ELMo\BERT\XLNET.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages