Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early Stopping tf.keras Crashes #4027

Open
AllardJM opened this issue Mar 8, 2024 · 0 comments
Open

Early Stopping tf.keras Crashes #4027

AllardJM opened this issue Mar 8, 2024 · 0 comments
Labels

Comments

@AllardJM
Copy link

AllardJM commented Mar 8, 2024

I am using Sagemaker and Horovod with Tensorflow Keras and the error I am seeing suggests that when the rank 0 process ceases due to early stopping, the other processes continue and then crashes when they try to communicate with the stopped process.

I am using keras.fit() and a call back, added to rank 0:

    if hvd.rank() == 0:
        callbacks.append(EarlyStopping(monitor="val_factorized_top_k/top_10_categorical_accuracy", patience=2, mode='max', verbose=1, restore_best_weights=True, start_from_epoch=1))

......

tf_model.fit(interactions.batch(batchsize), 
                 epochs = epochs, 
                 callbacks = callbacks,
                 validation_data = val_ds,
                 verbose = 1 if hvd.rank() == 0 else 0
                 )

How can the early stopping be communicated to the other processes to avoid this? Is there anything else that needs to be done to ensure that the model is in sync when the rank 0 does validation and potentially stops?

Error:
AlgorithmError: UnknownError: ExitCode 1 ErrorMessage "tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllreduce_device_/job:localhost/replica:0/task:0/device:CPU:0}} Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. [Op:HorovodAllreduce] 2024-03-08 02:22:51.376469: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at mpi_ops.cc:497 : UNKNOWN: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. Traceback (most recent call last) File "/usr/local/lib/python3.9/

@AllardJM AllardJM added the bug label Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant