Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some clarifying questions #3

Open
kirk86 opened this issue Feb 18, 2024 · 9 comments
Open

Some clarifying questions #3

kirk86 opened this issue Feb 18, 2024 · 9 comments

Comments

@kirk86
Copy link

kirk86 commented Feb 18, 2024

Hi @kazemnejad,

Thanks for making things reproducible.

If you don't mind me asking a quick question.
In the data I see something like this {"values": [1, 0, 1, 0, 0, 1], "answer": 1, "cat": 6} what does cat represent is a category or something else?

If I remember correctly somewhere in the paper was stated that the models used are mostly decoder type, so to add new models one has to add a config file in configs/models and then the actual model in models/ directory?

Finally, for some data like scan there are only 16K training samples but for others like arithmetic tasks there are 100K, why this difference is there any particular reason?

If one wants to generate the data I suppose it suffices to call dataset_builders/make_..._dataset.py, right?

@kazemnejad
Copy link
Collaborator

kazemnejad commented Feb 19, 2024

Hi @kirk86.

Thanks for reaching out. I'd be more than happy to help :)

In the data I see something like this {"values": [1, 0, 1, 0, 0, 1], "answer": 1, "cat": 6} what does cat represent is a category or something else?

Generally in the code, when we have the cat/category field, we consider it as the length bucket of that data instance. We later use this value when we plot the model performance as function of instance length.

If I remember correctly somewhere in the paper was stated that the models used are mostly decoder type, so to add new models one has to add a config file in configs/models and then the actual model in models/ directory?

Exactly! Actually there is an example in https://github.com/kazemnejad/pt_hf_base?tab=readme-ov-file#adding-a-new-model for adding new models. Just as a reminder, this is mostly a research codebase, I tried to make it as clean as possible, but be prepared to find some inconsistencies here and there :)

Finally, for some data like scan there are only 16K training samples but for others like arithmetic tasks there are 100K, why this difference is there any particular reason?

Since SCAN is already a well-established dataset with its splits being available (Lake et al., 2018), we don't generate it from scratch, we rather use the already available files for it, which has 16K examples.

Yes, one can use those scripts to generate the dataset.

P.S. I've added a brief walk-through of the codebase to the readme. You might find it useful: https://github.com/McGill-NLP/length-generalization?tab=readme-ov-file#code-structure

@kirk86
Copy link
Author

kirk86 commented Feb 26, 2024

Hi @kazemnejad,

sorry to bother you.
I came across some error when using roberta, gpt, gpt-neo for sequence classification task (e.g. parity), which is related to the following line.

For some reason the variable self.label_list is None so enumerating over it will raise an error.
image

In the logs I see some messages like the following:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

The configs used for the run are the following:

# configs: ['configs/robertaArc_cls_base.jsonnet', 'configs/models/pe_none.jsonnet', 'configs/data/s2s_parity.jsonnet', 'configs/sweep.jsonnet', 'configs/hp_base.jsonnet', 'configs/final.jsonnet']

Any ideas what might be going wrong?

On another note, in the notebooks I couldn't find the test accuracy of these modes. It's just shown based on mean rank. Which variable is that stored to in wandb Is it pred/test_acc_overall?

If I go to the respective experiments directory I see there's a file named log_Seq2SeqAnalyzer_test.json and in it there's the variable pred/test_acc_overall does that represent the mode's test accuracy on the test dataset?
image

@kazemnejad
Copy link
Collaborator

Hey @kirk86

I came across some error when using roberta, gpt, gpt-neo for sequence classification task (e.g. parity), which is related to the following line.

In our paper, we only focus on sequence to sequence tasks. The classification tasks were considered in our early exploration and that's you can find their residue in the code. However, they were left out for the rest of the project. Unfortunately, I don't think they're readily usable at the current stage of the codebase.

If I go to the respective experiments directory I see there's a file named log_Seq2SeqAnalyzer_test.json and in it there's the variable pred/test_acc_overall does that represent the mode's test accuracy on the test dataset?

We were primarily interested in the per-bucket accuracy of these models. But yes, the pred/test_acc_overall represent the overall accuracy of model across all test length buckets. You can learn more on how they're computed in here:
https://github.com/McGill-NLP/length-generalization/blob/main/src/analyzers/seq2seq_analyzer.py

@kirk86
Copy link
Author

kirk86 commented Mar 1, 2024

Thanks for the reply.

In our paper, we only focus on sequence to sequence tasks. The classification tasks were considered in our early exploration and that's you can find their residue in the code

In figure F.5 there's a classification task, did you solve it as seq2seq or as regular classification (I'm assuming the latter)?

@kazemnejad
Copy link
Collaborator

For all tasks in the paper, we only consider their seq2seq form.

@kirk86
Copy link
Author

kirk86 commented Mar 4, 2024

Thanks again for your reply.
If you don't mind me asking something, are the results reported in the paper from a single model or multiple models. For instance in Figure 3 we can't tell which model is used. We presume that it's either the best performing model across tasks or some sort of aggregation? But which model is it from the following [T5, roberta, gpt2, gpt-neo, something else]

@kazemnejad
Copy link
Collaborator

As explained in Sec. 3, we report the results over three seeds. Please note that as these models are trained from scratch, the [T5, roberta, gpt2, gpt-neo, something else] doesn't really matter (i.e. no pretrained weight used). But, if you're interested in the base architecture we used, our Transformer models are based on T5 code: https://github.com/McGill-NLP/length-generalization/blob/main/src/models/custom_t5_decoder_only.py

@kirk86
Copy link
Author

kirk86 commented Mar 4, 2024

Thanks for the explanations,

these models are trained from scratch, the [T5, roberta, gpt2, gpt-neo, something else] doesn't really matter (i.e. no pretrained weight used)

I understand that the model is trained from scratch but does that also mean that the tokenizer of T5 is trained from scratch as well for each task or did you simply AutoTokenizer.from_pretrained('T5')?

@kazemnejad
Copy link
Collaborator

We don't train the tokenizer from scratch, rather we use the original T5 tokenizer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants