Some clarifying questions #3

kirk86 · 2024-02-18T19:01:38Z

Thanks for making things reproducible.

If you don't mind me asking a quick question.
In the data I see something like this {"values": [1, 0, 1, 0, 0, 1], "answer": 1, "cat": 6} what does cat represent is a category or something else?

If I remember correctly somewhere in the paper was stated that the models used are mostly decoder type, so to add new models one has to add a config file in configs/models and then the actual model in models/ directory?

Finally, for some data like scan there are only 16K training samples but for others like arithmetic tasks there are 100K, why this difference is there any particular reason?

If one wants to generate the data I suppose it suffices to call dataset_builders/make_..._dataset.py, right?

The text was updated successfully, but these errors were encountered:

kazemnejad · 2024-02-19T16:10:57Z

Hi @kirk86.

Thanks for reaching out. I'd be more than happy to help :)

In the data I see something like this {"values": [1, 0, 1, 0, 0, 1], "answer": 1, "cat": 6} what does cat represent is a category or something else?

Generally in the code, when we have the cat/category field, we consider it as the length bucket of that data instance. We later use this value when we plot the model performance as function of instance length.

If I remember correctly somewhere in the paper was stated that the models used are mostly decoder type, so to add new models one has to add a config file in configs/models and then the actual model in models/ directory?

Exactly! Actually there is an example in https://github.com/kazemnejad/pt_hf_base?tab=readme-ov-file#adding-a-new-model for adding new models. Just as a reminder, this is mostly a research codebase, I tried to make it as clean as possible, but be prepared to find some inconsistencies here and there :)

Finally, for some data like scan there are only 16K training samples but for others like arithmetic tasks there are 100K, why this difference is there any particular reason?

Since SCAN is already a well-established dataset with its splits being available (Lake et al., 2018), we don't generate it from scratch, we rather use the already available files for it, which has 16K examples.

Yes, one can use those scripts to generate the dataset.

P.S. I've added a brief walk-through of the codebase to the readme. You might find it useful: https://github.com/McGill-NLP/length-generalization?tab=readme-ov-file#code-structure

kirk86 · 2024-02-26T17:41:30Z

Hi @kazemnejad,

sorry to bother you.
I came across some error when using roberta, gpt, gpt-neo for sequence classification task (e.g. parity), which is related to the following line.

For some reason the variable self.label_list is None so enumerating over it will raise an error.

In the logs I see some messages like the following:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

The configs used for the run are the following:

# configs: ['configs/robertaArc_cls_base.jsonnet', 'configs/models/pe_none.jsonnet', 'configs/data/s2s_parity.jsonnet', 'configs/sweep.jsonnet', 'configs/hp_base.jsonnet', 'configs/final.jsonnet']

Any ideas what might be going wrong?

On another note, in the notebooks I couldn't find the test accuracy of these modes. It's just shown based on mean rank. Which variable is that stored to in wandb Is it pred/test_acc_overall?

If I go to the respective experiments directory I see there's a file named log_Seq2SeqAnalyzer_test.json and in it there's the variable pred/test_acc_overall does that represent the mode's test accuracy on the test dataset?

kazemnejad · 2024-02-29T16:43:34Z

Hey @kirk86

I came across some error when using roberta, gpt, gpt-neo for sequence classification task (e.g. parity), which is related to the following line.

In our paper, we only focus on sequence to sequence tasks. The classification tasks were considered in our early exploration and that's you can find their residue in the code. However, they were left out for the rest of the project. Unfortunately, I don't think they're readily usable at the current stage of the codebase.

If I go to the respective experiments directory I see there's a file named log_Seq2SeqAnalyzer_test.json and in it there's the variable pred/test_acc_overall does that represent the mode's test accuracy on the test dataset?

We were primarily interested in the per-bucket accuracy of these models. But yes, the pred/test_acc_overall represent the overall accuracy of model across all test length buckets. You can learn more on how they're computed in here:
https://github.com/McGill-NLP/length-generalization/blob/main/src/analyzers/seq2seq_analyzer.py

kirk86 · 2024-03-01T10:54:14Z

Thanks for the reply.

In our paper, we only focus on sequence to sequence tasks. The classification tasks were considered in our early exploration and that's you can find their residue in the code

In figure F.5 there's a classification task, did you solve it as seq2seq or as regular classification (I'm assuming the latter)?

kazemnejad · 2024-03-04T14:52:43Z

For all tasks in the paper, we only consider their seq2seq form.

kirk86 · 2024-03-04T17:53:41Z

Thanks again for your reply.
If you don't mind me asking something, are the results reported in the paper from a single model or multiple models. For instance in Figure 3 we can't tell which model is used. We presume that it's either the best performing model across tasks or some sort of aggregation? But which model is it from the following [T5, roberta, gpt2, gpt-neo, something else]

kazemnejad · 2024-03-04T18:15:26Z

As explained in Sec. 3, we report the results over three seeds. Please note that as these models are trained from scratch, the [T5, roberta, gpt2, gpt-neo, something else] doesn't really matter (i.e. no pretrained weight used). But, if you're interested in the base architecture we used, our Transformer models are based on T5 code: https://github.com/McGill-NLP/length-generalization/blob/main/src/models/custom_t5_decoder_only.py

kirk86 · 2024-03-04T18:59:03Z

Thanks for the explanations,

these models are trained from scratch, the [T5, roberta, gpt2, gpt-neo, something else] doesn't really matter (i.e. no pretrained weight used)

I understand that the model is trained from scratch but does that also mean that the tokenizer of T5 is trained from scratch as well for each task or did you simply AutoTokenizer.from_pretrained('T5')?

kazemnejad · 2024-03-08T14:47:17Z

We don't train the tokenizer from scratch, rather we use the original T5 tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some clarifying questions #3

Some clarifying questions #3

kirk86 commented Feb 18, 2024

kazemnejad commented Feb 19, 2024 •

edited

kirk86 commented Feb 26, 2024 •

edited

kazemnejad commented Feb 29, 2024

kirk86 commented Mar 1, 2024 •

edited

kazemnejad commented Mar 4, 2024

kirk86 commented Mar 4, 2024 •

edited

kazemnejad commented Mar 4, 2024

kirk86 commented Mar 4, 2024 •

edited

kazemnejad commented Mar 8, 2024

Some clarifying questions #3

Some clarifying questions #3

Comments

kirk86 commented Feb 18, 2024

kazemnejad commented Feb 19, 2024 • edited

kirk86 commented Feb 26, 2024 • edited

kazemnejad commented Feb 29, 2024

kirk86 commented Mar 1, 2024 • edited

kazemnejad commented Mar 4, 2024

kirk86 commented Mar 4, 2024 • edited

kazemnejad commented Mar 4, 2024

kirk86 commented Mar 4, 2024 • edited

kazemnejad commented Mar 8, 2024

kazemnejad commented Feb 19, 2024 •

edited

kirk86 commented Feb 26, 2024 •

edited

kirk86 commented Mar 1, 2024 •

edited

kirk86 commented Mar 4, 2024 •

edited

kirk86 commented Mar 4, 2024 •

edited