llama version in Minillm #218

kaizizzzzzz · 2024-05-06T03:58:38Z

Is it Llama1 or Llama2? Thx

t1101675 · 2024-05-10T13:51:17Z

The distilled models and the experiments in our paper are based on LLaMA-1.

kaizizzzzzz · 2024-05-10T15:37:57Z

Should it be easy to use this repo to KD llama2?

t1101675 · 2024-05-10T20:51:24Z

Yes. We have implemented the model parallism and SFT for LLaMA2. The KD scripts are easy to be adapted from LLaMA-1.

kaizizzzzzz · 2024-05-12T05:28:08Z

It seems that there is no need to modify the src code to adapt LLaMA2, just simply changing the script is enough?

t1101675 · 2024-05-12T15:23:09Z

Exactly.

kaizizzzzzz · 2024-05-12T20:03:04Z

Hello Yuxian, I'm a little curious about the minillm process and the dataset for using, and want to check my understanding. I have two questions

There are 2 datasets used for minillm: use gpt2 for example:

PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/"
LM_DATA_DIR="${BASE_PATH}/processed_data/openwebtext/gpt2/512/10M/"

The first one is also used for sft and kd baseline method, but the second dataset is only for minim, I'm not familiar with kd, and I saw the explanation in the paper and I am confused, which says

Input: Conditional generation dataset D consisting of prompts and ground-truth responses Pre-training corpus **_DPT_** consisting of long-document plain texts
A teacher model with output distribution p
An initial student model **_pre-trained on DPT_**, with the output distribution qθ0 Learning rate η; Batch size M; Clipping Threshold ε

As for my understanding, should these two DPT stand for different meanings? The first DPT is used for calculating language modeling loss LPT = − Ed∼DPT log qθ (d), which is openwebtext/gpt2 here. And the second DPT is the pretrain data the model used. Like GPT2 or llama, the pretrain data is private. so here we use other pretrain datasets like openwebtext and roberta to calculate language modeling loss LPT = − Ed∼DPT log qθ (d). This is only because we don't have the real pretrain data? And it could be better if we do have the real pretrain dataset? Is my understanding correct? And, another confusing point here is because the pretrain data we use here is different from the actual model pretrain data. So when we use minillm, do we need to train from scratch using the pretrain data here? Or we can keep this difference and just use the released gpt2 and llama?

I saw the hyperparameter setting for epoch is 10, the dataset is big and I don't have enough gpu for such a huge epoch, is small epoch still works? Such as using epoch 1 instead.

Such two redundant questions and really appreciate your responses, thanks!

t1101675 · 2024-05-13T09:22:50Z

We use openwebtext simply because the pre-training data of GPT2 is not available. GPT2 is pre-trained with WebText, which is generally assumed to share the similar distribution with openwebtext. The RoBERTa corpus is a subset of LLaMA's pre-training corpus, which ensures DPT does not introduce extra knowledge beyond pre-training. I think it would not make much difference when using the actual pre-training corpus. When we use MiniLLM, we just use the released GPT2 and LLaMA (DPT can be treated as a regularization).
The SFT baselines should be trained for about 10 epochs before they reach the best performance. The total training steps of MiniLLM are controlled by --total-iters 5000, which corresponds to 6 or 7 epochs. I think 1 epoch is not enough for the models to achive the performance in our paper. (NOTE: this epoch argument refers to the epoch of intruction data: PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/", not openwebtext. Acturally, openwebtext is trained for less than an epoch.)

kaizizzzzzz · 2024-05-13T14:44:03Z

Thanks, that makes sense!

BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm?

I have just finished the lora sft of llama2-1.1B, and in /results/llama2/train/sft, there are 10 folders, each storing the model for one epoch. So I just choose the optimal one based on the 'rougeL score' to be the sft final model?

kaizizzzzzz · 2024-05-13T23:41:42Z

Hello Yuxian, would you mind also sharing the link to the dataset of roberta you used before processing? I'm training minillm for llama2, and I saw there are two questions about the roberta dataset. I have tried to download those sub-dataset and tried to combine them by myself. But I am afraid there are some mistakes in my operations. The original dataset file you shared doesn't include roberta, and I want to process it based on llama2's tokenizer.

I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the roberta dataset link. Thanks!

t1101675 · 2024-05-14T02:00:24Z

Thanks, that makes sense!

BTW, I used lora to sft the model of student and teacher(the step to get the initialization models, later be used in minim), which is due to the GPU number constraint for full dimension sft. I still train for 10 epochs, will using lora to do the sft hugely affect the performance of minillm?

I have just finished the lora sft of llama2-1.1B, and in /results/llama2/train/sft, there are 10 folders, each storing the model for one epoch. So I just choose the optimal one based on the 'rougeL score' to be the sft final model?

We haven't tried using lora for MiniLLM. I guess it would not affect the performance much. Choosing the final model based on the 'rougeL score' is fine.

t1101675 · 2024-05-14T02:11:00Z

Hello Yuxian, would you mind also sharing the link to the dataset of roberta you used before processing? I'm training minillm for llama2, and I saw there are two questions about the roberta dataset. I have tried to download those sub-dataset and tried to combine them by myself. But I am afraid there are some mistakes in my operations. The original dataset file you shared doesn't include roberta, and I want to process it based on llama2's tokenizer.

I'm just using Llama1's processed roberta for training minillm of llama2. And I saw there is only a little difference between llama1's and llama2's tokenizer. I think the influence is small, but it would be better if you could share the roberta dataset link. Thanks!

It would take some time for us to get the roberta dataset ready. We construct roberta dataset simply by merging those sub-datasets and tokenizing them. Since the dataset is used for regularization and only a small subset of the data is acturally used in training (less than a epoch), little difference in merging sub-datasets will not make great difference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama version in Minillm #218

llama version in Minillm #218

kaizizzzzzz commented May 6, 2024

t1101675 commented May 10, 2024

kaizizzzzzz commented May 10, 2024

t1101675 commented May 10, 2024

kaizizzzzzz commented May 12, 2024

t1101675 commented May 12, 2024

kaizizzzzzz commented May 12, 2024

t1101675 commented May 13, 2024

kaizizzzzzz commented May 13, 2024 •

edited

kaizizzzzzz commented May 13, 2024

t1101675 commented May 14, 2024

t1101675 commented May 14, 2024

llama version in Minillm #218

llama version in Minillm #218

Comments

kaizizzzzzz commented May 6, 2024

t1101675 commented May 10, 2024

kaizizzzzzz commented May 10, 2024

t1101675 commented May 10, 2024

kaizizzzzzz commented May 12, 2024

t1101675 commented May 12, 2024

kaizizzzzzz commented May 12, 2024

t1101675 commented May 13, 2024

kaizizzzzzz commented May 13, 2024 • edited

kaizizzzzzz commented May 13, 2024

t1101675 commented May 14, 2024

t1101675 commented May 14, 2024

kaizizzzzzz commented May 13, 2024 •

edited