You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using Sagemaker and Horovod with Tensorflow Keras and the error I am seeing suggests that when the rank 0 process ceases due to early stopping, the other processes continue and then crashes when they try to communicate with the stopped process.
I am using keras.fit() and a call back, added to rank 0:
if hvd.rank() == 0:
callbacks.append(EarlyStopping(monitor="val_factorized_top_k/top_10_categorical_accuracy", patience=2, mode='max', verbose=1, restore_best_weights=True, start_from_epoch=1))
How can the early stopping be communicated to the other processes to avoid this? Is there anything else that needs to be done to ensure that the model is in sync when the rank 0 does validation and potentially stops?
Error: AlgorithmError: UnknownError: ExitCode 1 ErrorMessage "tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllreduce_device_/job:localhost/replica:0/task:0/device:CPU:0}} Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. [Op:HorovodAllreduce] 2024-03-08 02:22:51.376469: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at mpi_ops.cc:497 : UNKNOWN: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. Traceback (most recent call last) File "/usr/local/lib/python3.9/
The text was updated successfully, but these errors were encountered:
I am using Sagemaker and Horovod with Tensorflow Keras and the error I am seeing suggests that when the rank 0 process ceases due to early stopping, the other processes continue and then crashes when they try to communicate with the stopped process.
I am using keras.fit() and a call back, added to rank 0:
......
How can the early stopping be communicated to the other processes to avoid this? Is there anything else that needs to be done to ensure that the model is in sync when the rank 0 does validation and potentially stops?
Error:
AlgorithmError: UnknownError: ExitCode 1 ErrorMessage "tensorflow.python.framework.errors_impl.UnknownError: {{function_node __wrapped__HorovodAllreduce_device_/job:localhost/replica:0/task:0/device:CPU:0}} Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. [Op:HorovodAllreduce] 2024-03-08 02:22:51.376469: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at mpi_ops.cc:497 : UNKNOWN: Horovod has been shut down. This was caused by an exception on one of the ranks or an attempt to allreduce, allgather or broadcast a tensor after one of the ranks finished execution. If the shutdown was caused by an exception, you should see the exception in the log before the first shutdown message. Traceback (most recent call last) File "/usr/local/lib/python3.9/
The text was updated successfully, but these errors were encountered: