save_steps not working , checkpoint getting generated on every epoch morethan once #1542

guruprasaad123 · 2023-07-28T10:17:31Z

Describe the bug
we are trying to save the model's checkpoint not too frequent cause we are running low on storage. So, our idea is that to save the model's checkpoint every 4/5 epochs. Kindly check the below code used to accomplish this mentioned task.

import math
import os
SAVE_EVERY_N_EPOCHS = 4
train_batch_size = 8
steps_per_epoch = math.floor(len(train_df) / train_batch_size)
save_steps = steps_per_epoch * SAVE_EVERY_N_EPOCHS

print(f'save_steps : {save_steps}')

args = {
    "output_dir": os.path.join(output_dir,"outputs"),
    "cache_dir": os.path.join(output_dir, "cache_dir"),
    "fp16": True,
    "train_batch_size": train_batch_size,
    "num_train_epochs": 16,

    "save_steps": save_steps,
    "save_model_every_epoch": False,
    "overwrite_output_dir": True,
    "reprocess_input_data": False,
    "evaluate_during_training": True,
    "evaluate_during_training_verbose":True,

    "process_count": cpu_count() - 2 if cpu_count() > 2 else 1,
    "n_gpu": 1,
}

# Optional model configuration
model_args = ClassificationArgs(**args)

# Create a ClassificationModel
model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    args=model_args
) 

# Train the model
model.train_model(train_df,eval_df=test_df)

To Reproduce
Steps to reproduce the behavior:

copy the code from above to a file - train.py
use this command python3 train.py to run the code

Expected behavior
we were expecting that the process would generate checkpoint only after the 4th epoch, 8th epoch etc. But its generating checkpoints on every epoch and that too every 2000 steps.

Screenshots
the process generates these many checkpoint like below :

Desktop (please complete the following information):

linux

Additional context
None

The text was updated successfully, but these errors were encountered:

DamithDR · 2023-07-28T11:15:35Z

@guruprasaad123 I am wondering whether this is correct ?

SAVE_EVERY_N_EPOCHS = 4
train_batch_size = 8
steps_per_epoch = math.floor(len(train_df) / train_batch_size)
save_steps = steps_per_epoch * SAVE_EVERY_N_EPOCHS

Basically save_steps moreless equal to len(train_df)/2 right ?

So if you do the calculation with your len(train_df) and if you notice the logic in the library,

Seems the behavior could be correct right ?

Above screenshot is from classification_model.py

guruprasaad123 · 2023-07-28T14:05:17Z

@DamithDR i believe not , because i have taken the code directly from the official documentation itself and produced the results with this code below :

import math
SAVE_EVERY_N_EPOCHS = 4
train_batch_size = 8
steps_per_epoch = math.floor(len(train_df) / SAVE_EVERY_N_EPOCHS) # train_df -> 40500
save_steps = steps_per_epoch * SAVE_EVERY_N_EPOCHS # save_steps -> 40400
print(f'save_steps : {save_steps}')

args = {
    "output_dir": os.path.join(output_dir,"outputs"),
    "cache_dir": os.path.join(output_dir, "cache_dir"),
    "fp16": True,
    "train_batch_size": train_batch_size,
    "num_train_epochs": 16,

    "save_steps": save_steps,
    "save_model_every_epoch": False,
    "overwrite_output_dir": True,
    "reprocess_input_data": False,
    "evaluate_during_training": True,
    "evaluate_during_training_verbose":True,

    "process_count": cpu_count() - 2 if cpu_count() > 2 else 1,
    "n_gpu": 1,
}

# Optional model configuration
model_args = ClassificationArgs(**args)

# Create a ClassificationModel
model = ClassificationModel(
    'bert',
    'bert-base-uncased',
    num_labels=num_labels,
    args=model_args
) 

# Train the model
model.train_model(train_df,eval_df=test_df)

Source

simpletransformers | Tips and Tricks

Results :

Conclusion

As i have ran the training script for 4 epochs , its expected to create checkpoint after the 4th epoch only which isnt the case here. Please let me know if i am wrong

guruprasaad123 changed the title ~~save_steps not working , checkpoint getting saved on every epoch morethan once~~ save_steps not working , checkpoint getting generated on every epoch morethan once Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

save_steps not working , checkpoint getting generated on every epoch morethan once #1542

save_steps not working , checkpoint getting generated on every epoch morethan once #1542

guruprasaad123 commented Jul 28, 2023 •

edited

DamithDR commented Jul 28, 2023

guruprasaad123 commented Jul 28, 2023

save_steps not working , checkpoint getting generated on every epoch morethan once #1542

save_steps not working , checkpoint getting generated on every epoch morethan once #1542

Comments

guruprasaad123 commented Jul 28, 2023 • edited

DamithDR commented Jul 28, 2023

guruprasaad123 commented Jul 28, 2023

Source

Results :

Conclusion

guruprasaad123 commented Jul 28, 2023 •

edited