Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seq2seq: failure when evaluating while training #1522

Open
Futyn-Maker opened this issue May 7, 2023 · 8 comments · May be fixed by #1533
Open

Seq2seq: failure when evaluating while training #1522

Futyn-Maker opened this issue May 7, 2023 · 8 comments · May be fixed by #1533

Comments

@Futyn-Maker
Copy link

Describe the bug
During the training of Seq2seq-type models, with evaluation enabled, a Pandas error ValueError: All arrays must be of the same length occurs in Google Colab (free plan) during evaluation.
To Reproduce
Steps to reproduce the behavior:

def main(args):
    model_args = {
        "do_lower_case": True,
        "reprocess_input_data": True,
        "overwrite_output_dir": True,
        "max_seq_length": max([len(token) for token in train_df["target_text"].tolist()]),
        "train_batch_size": 256
        "num_train_epochs": 5
        "save_eval_checkpoints": False,
        "save_model_every_epoch": False,
        "evaluate_during_training": True,
        "evaluate_during_training_verbose": True,
        "use_multiprocessing": False,
        "save_best_model": False,
        "max_length": max([len(token) for token in train_df["input_text"].tolist()]),
        "save_steps": -1,
    }
    model = Seq2SeqModel(
        encoder_decoder_type="bart"
        encoder_decoder_name="facebook/bart-base"
        args=model_args,
	use_cuda = torch.cuda.is_available(),)    
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)

Expected behavior
A clear and concise description of what you expected to happen.

Learning and evaluating without failures

Screenshots
If applicable, add screenshots to help explain your problem.
Not applicable

Desktop (please complete the following information):

  • OS
    Windows 11 (But in fact using in Google Colab)
    Additional context
    Add any other context about the problem here.

Here are the reduced logs:

2023-05-06 17:56:43.337391: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading (…)lve/main/config.json: 100% 1.72k/1.72k [00:00<00:00, 8.86MB/s]
Downloading pytorch_model.bin: 100% 558M/558M [00:25<00:00, 21.6MB/s]
Downloading (…)olve/main/vocab.json: 100% 899k/899k [00:00<00:00, 1.29MB/s]
Downloading (…)olve/main/merges.txt: 100% 456k/456k [00:00<00:00, 875kB/s]
Downloading (…)/main/tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 1.56MB/s]
100% 79032/79032 [00:21<00:00, 3710.01it/s]
Epoch 1 of 5:   0% 0/5 [00:00<?, ?it/s]
Running Epoch 0 of 5:   0% 0/309 [00:00<?, ?it/s]
Epochs 1/5. Running Loss:   10.1010:   0% 0/309 [00:03<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

Epochs 1/5. Running Loss:   10.1010:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   0% 1/309 [00:03<18:27,  3.60s/it]
Epochs 1/5. Running Loss:   10.5836:   1% 2/309 [00:03<08:38,  1.69s/it]
...
Epochs 1/5. Running Loss:    0.0396: 100% 309/309 [02:17<00:00,  2.26it/s]
  0% 0/10011 [00:00<?, ?it/s]
  0% 1/10011 [00:31<86:57:41, 31.27s/it] (some strange deadlock here)
100% 10011/10011 [01:14<00:00, 135.21it/s]
Epoch 1 of 5:   0% 0/5 [03:59<?, ?it/s]
Traceback (most recent call last):
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 56, in <module>
    main(args)
  File "/content/transformer-lemmatiser-ruthenian/seq2seq.py", line 45, in main
    model.train_model(train_df, eval_data=eval_df, matches=count_matches, accuracy=accuracy_score, f1=f1_score)
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 450, in train_model
    global_step, training_details = self.train(
  File "/usr/local/lib/python3.10/dist-packages/simpletransformers/seq2seq/seq2seq_model.py", line 1005, in train
    report = pd.DataFrame(training_progress_scores)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 664, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 493, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 118, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/usr/local/lib/python3.10/dist-packages/pandas/core/internals/construction.py", line 666, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
@The-One-Who-Speaks-and-Depicts

The same problem, repeats almost everywhere, with the same conditions. OSs Ubuntu & Windows 11.

@Moustafa-Banbouk
Copy link

Hey @The-One-Who-Speaks-and-Depicts and @Futyn-Maker, Any luck in solving this issue as I am facing the same error during model_train

@The-One-Who-Speaks-and-Depicts

@Moustafa-Banbouk I have been experiencing this for the year or so, no ideas. I just switched the validation off in args, and called it a day.

@Futyn-Maker
Copy link
Author

I just switched the validation off in args, and called it a day.

The same for now, it didn't really interfere with the project I was working on at the time - but I consider it an extremely critical bug.

@DamithDR
Copy link
Contributor

DamithDR commented Jun 19, 2023

@The-One-Who-Speaks-and-Depicts @Futyn-Maker @Moustafa-Banbouk Can you guys try disabling multi-processing using,

use_multiprocessing = False
use_multiprocessing_for_evaluation = False

@The-One-Who-Speaks-and-Depicts

@DamithDR At least in my case it works, I created a PR.

@Futyn-Maker @Moustafa-Banbouk /fyi

@DamithDR
Copy link
Contributor

DamithDR commented Jun 20, 2023

@The-One-Who-Speaks-and-Depicts
Glad that it helped :)
About the PR, I think this issue only re-creates on servers which are having multiple GPUs. The real issue is in the Seq2SeqDataset class where it initiates a pool of processes to get the sample list. A proper fix will have to look into this area.

@The-One-Who-Speaks-and-Depicts

@DamithDR I had this issue on my laptop, and on a server, where I used only one GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants