Seq2seq: failure when evaluating while training #1522

Futyn-Maker · 2023-05-07T03:08:48Z

Describe the bug
During the training of Seq2seq-type models, with evaluation enabled, a Pandas error ValueError: All arrays must be of the same length occurs in Google Colab (free plan) during evaluation.
To Reproduce
Steps to reproduce the behavior:

def main(args):
    model_args = {
        "do_lower_case": True,
        "reprocess_input_data": True,
        "overwrite_output_dir": True,
        "max_seq_length": max([len(token) for token in train_df["target_text"].tolist()]),
        "train_batch_size": 256
        "num_train_epochs": 5
        "save_eval_checkpoints": False,
        "save_model_every_epoch": False,
        "evaluate_during_training": True,
        "evaluate_during_training_verbose": True,
        "use_multiprocessing": False,
        "save_best_model": False,
        "max_length": max([len(token) for token in train_df["input_text"].tolist()]),
        "save_steps": -1,
    }
    model = Seq2SeqModel(
        encoder_decoder_type="bart"
        encoder_decoder_name="facebook/bart-base"
        args=model_args,
	use_cuda = torch.cuda.is_available(),)    
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)

Expected behavior
A clear and concise description of what you expected to happen.

Learning and evaluating without failures

Screenshots
If applicable, add screenshots to help explain your problem.
Not applicable

Desktop (please complete the following information):

OS
Windows 11 (But in fact using in Google Colab)
Additional context
Add any other context about the problem here.

Here are the reduced logs:

2023-05-06 17:56:43.337391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading (…)lve/main/config.json: 100% 1.72k/1.72k [00:00<00:00, 8.86MB/s]
Downloading pytorch_model.bin: 100% 558M/558M [00:25<00:00, 21.6MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 1.29MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 875kB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 1.56MB/s]
100% 79032/79032 [00:21<00:00, 3710.01it/s]
Epoch 1 of 5:   0% 0/5 [00:00<?, ?it/s]
Running Epoch 0 of 5:   0% 0/309 [00:00<?, ?it/s]
Epochs 1/5. Running Loss:   10.1010:   0% 0/309 [00:03<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Epochs 1/5. Running Loss:   10.1010:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   1% 2/309 [00:03<08:38,  1.69s/it]
...
Epochs 1/5. Running Loss:    0.0396: 100% 309/309 [02:17<00:00,  2.26it/s]
  0% 0/10011 [00:00<?, ?it/s]
  0% 1/10011 [00:31<86:57:41, 31.27s/it] (some strange deadlock here)
100% 10011/10011 [01:14<00:00, 135.21it/s]
Epoch 1 of 5:   0% 0/5 [03:59<?, ?it/s]
Traceback (most recent call last):
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 56, in <module>
    main(args)
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 45, in main
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 450, in train_model
    global_step, training_details = self.train(
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 1005, in train
    report = pd.DataFrame(training_progress_scores)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 664, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 666, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

The text was updated successfully, but these errors were encountered:

The-One-Who-Speaks-and-Depicts · 2023-05-29T11:53:15Z

The same problem, repeats almost everywhere, with the same conditions. OSs Ubuntu & Windows 11.

Moustafa-Banbouk · 2023-06-11T11:30:24Z

Hey @The-One-Who-Speaks-and-Depicts and @Futyn-Maker, Any luck in solving this issue as I am facing the same error during model_train

The-One-Who-Speaks-and-Depicts · 2023-06-11T21:59:42Z

@Moustafa-Banbouk I have been experiencing this for the year or so, no ideas. I just switched the validation off in args, and called it a day.

Futyn-Maker · 2023-06-11T22:06:33Z

I just switched the validation off in args, and called it a day.

The same for now, it didn't really interfere with the project I was working on at the time - but I consider it an extremely critical bug.

DamithDR · 2023-06-19T22:21:33Z

@The-One-Who-Speaks-and-Depicts @Futyn-Maker @Moustafa-Banbouk Can you guys try disabling multi-processing using,

use_multiprocessing = False
use_multiprocessing_for_evaluation = False

The-One-Who-Speaks-and-Depicts · 2023-06-20T10:01:08Z

@DamithDR At least in my case it works, I created a PR.

@Futyn-Maker @Moustafa-Banbouk /fyi

DamithDR · 2023-06-20T12:26:23Z

@The-One-Who-Speaks-and-Depicts
Glad that it helped :)
About the PR, I think this issue only re-creates on servers which are having multiple GPUs. The real issue is in the Seq2SeqDataset class where it initiates a pool of processes to get the sample list. A proper fix will have to look into this area.

The-One-Who-Speaks-and-Depicts · 2023-06-20T18:05:23Z

@DamithDR I had this issue on my laptop, and on a server, where I used only one GPU.

The-One-Who-Speaks-and-Depicts linked a pull request Jun 20, 2023 that will close this issue

[fix] delete multiprocessing for evaluation #1533

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seq2seq: failure when evaluating while training #1522

Seq2seq: failure when evaluating while training #1522

Futyn-Maker commented May 7, 2023

The-One-Who-Speaks-and-Depicts commented May 29, 2023

Moustafa-Banbouk commented Jun 11, 2023

The-One-Who-Speaks-and-Depicts commented Jun 11, 2023

Futyn-Maker commented Jun 11, 2023

DamithDR commented Jun 19, 2023 •

edited

The-One-Who-Speaks-and-Depicts commented Jun 20, 2023

DamithDR commented Jun 20, 2023 •

edited

The-One-Who-Speaks-and-Depicts commented Jun 20, 2023

Seq2seq: failure when evaluating while training #1522

Seq2seq: failure when evaluating while training #1522

Comments

Futyn-Maker commented May 7, 2023

The-One-Who-Speaks-and-Depicts commented May 29, 2023

Moustafa-Banbouk commented Jun 11, 2023

The-One-Who-Speaks-and-Depicts commented Jun 11, 2023

Futyn-Maker commented Jun 11, 2023

DamithDR commented Jun 19, 2023 • edited

The-One-Who-Speaks-and-Depicts commented Jun 20, 2023

DamithDR commented Jun 20, 2023 • edited

The-One-Who-Speaks-and-Depicts commented Jun 20, 2023

DamithDR commented Jun 19, 2023 •

edited

DamithDR commented Jun 20, 2023 •

edited