Early Stopping tf.keras Crashes #4027

AllardJM · 2024-03-08T03:16:15Z

I am using Sagemaker and Horovod with Tensorflow Keras and the error I am seeing suggests that when the rank 0 process ceases due to early stopping, the other processes continue and then crashes when they try to communicate with the stopped process.

I am using keras.fit() and a call back, added to rank 0:

    if hvd.rank() == 0:
        callbacks.append(EarlyStopping(monitor="val_factorized_top_k/top_10_categorical_accuracy", patience=2, mode='max', verbose=1, restore_best_weights=True, start_from_epoch=1))

......

tf_model.fit(interactions.batch(batchsize), 
                 epochs = epochs, 
                 callbacks = callbacks,
                 validation_data = val_ds,
                 verbose = 1 if hvd.rank() == 0 else 0
                 )

How can the early stopping be communicated to the other processes to avoid this? Is there anything else that needs to be done to ensure that the model is in sync when the rank 0 does validation and potentially stops?

Error:
AlgorithmError: UnknownError: ExitCode 1 ErrorMessage "tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllreduce_device_/job:localhost/replica:0/task:0/device:CPU:0}} Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. [Op:HorovodAllreduce] 2024-03-08 02:22:51.376469: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at mpi_ops.cc:497 : UNKNOWN: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. Traceback (most recent call last) File "/usr/local/lib/python3.9/

The text was updated successfully, but these errors were encountered:

AllardJM added the bug label Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early Stopping tf.keras Crashes #4027

Early Stopping tf.keras Crashes #4027

AllardJM commented Mar 8, 2024

Early Stopping tf.keras Crashes #4027

Early Stopping tf.keras Crashes #4027

Comments

AllardJM commented Mar 8, 2024