How to resume training with horovod spark keras? #3997

VAFSCLiX · 2023-10-25T21:23:42Z

VAFSCLiX
Oct 25, 2023

I have followed the example of Horovod Spark Estimator Keras notebook: https://learn.microsoft.com/en-us/azure/databricks/_extras/notebooks/source/deep-learning/horovod-spark-estimator-keras.html

Everything works fine. But it doesn't mention how to restore the KerasModel or KerasEstimator from the checkpoint to resume training. I have checked all the resources from horovod API https://horovod.readthedocs.io/en/latest/api.html#. But no useful information found.

Here is what I have tried:
After my initial run of keras_model = keras_estimator.fit(train_df).setOutputCols(['predict']), all the checkpoint data were saved in DBFS location work_dir.

To restore checkpoint, this is what did:

checkpoint = ModelCheckpoint(work_dir + '/runs/'+run_id+'/checkpoint.tf', monitor='loss')
callbacks_list = [checkpoint]

Btw, I also tried a function called _load_model_from_checkpoint keras_estimator._load_model_from_checkpoint(run_id). And model = horovod.tensorflow.keras.load_model(work_dir+'/runs/'+run_id+'/checkpoint.tf/'). It did reload the model, but I couldn't resume training the reloaded model with model.fit.

So, what I did was to create a same KerasEstimator with checkpoints from store:

keras_estimator = hvd.KerasEstimator(
  num_proc=num_proc,
  store=storeCon,
  model=model,
  optimizer=optimizer,
  loss=loss,
  metrics=['mse'],
  feature_cols=['description'],
  label_cols=['label'],
  batch_size=batch_size,
  epochs=epochs,
  verbose=0,
 callbacks=callbacks_list)

keras_model = keras_estimator.fit(train_df).setOutputCols(['predict'])

But I got this error:

[1,0]<stderr>:2023-10-25 15:45:52.693371: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]<stderr>:2023-10-25 15:45:52.693371: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,0]<stderr>:2023-10-25 15:45:52.859643: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[1,1]<stderr>:2023-10-25 15:45:52.859642: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[1,1]<stderr>:2023-10-25 15:45:52.918050: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,1]<stderr>:2023-10-25 15:45:52.918081: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[1,0]<stderr>:2023-10-25 15:45:52.921956: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,0]<stderr>:2023-10-25 15:45:52.921983: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[1,1]<stderr>:2023-10-25 15:45:53.775935: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,0]<stderr>:2023-10-25 15:45:53.782831: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,1]<stderr>:2023-10-25 15:45:53.833283: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,1]<stderr>:2023-10-25 15:45:53.833313: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,0]<stderr>:2023-10-25 15:45:53.841841: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,0]<stderr>:2023-10-25 15:45:53.841874: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[1,0]<stderr>:2023-10-25 15:45:55.508106: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,0]<stderr>:2023-10-25 15:45:55.508147: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
[1,0]<stderr>:2023-10-25 15:45:55.508168: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0828-195106-3fr7yeq8-10-125-12-137): /proc/driver/nvidia/version does not exist
[1,1]<stderr>:2023-10-25 15:45:55.510473: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/lib:
[1,1]<stderr>:2023-10-25 15:45:55.510504: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
[1,1]<stderr>:2023-10-25 15:45:55.510524: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (0828-195106-3fr7yeq8-10-125-12-137): /proc/driver/nvidia/version does not exist
[1,0]<stderr>:2023-10-25 15:45:55.516080: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[1,0]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]<stderr>:2023-10-25 15:45:55.518183: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[1,1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1,1]<stderr>:/databricks/python/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
[1,1]<stderr>:  self._filesystem = pyarrow.localfs
[1,0]<stderr>:/databricks/python/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
[1,0]<stderr>:  self._filesystem = pyarrow.localfs
[1,0]<stderr>:WARNING:tensorflow:From /databricks/python/lib/python3.9/site-packages/horovod/spark/keras/util.py:71: unbatch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:Use `tf.data.Dataset.unbatch()`.
[1,1]<stderr>:WARNING:tensorflow:From /databricks/python/lib/python3.9/site-packages/horovod/spark/keras/util.py:71: unbatch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
[1,1]<stderr>:Instructions for updating:
[1,1]<stderr>:Use `tf.data.Dataset.unbatch()`.
[1,1]<stderr>:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0013s vs `on_train_batch_end` time: 0.0428s). Check your callbacks.
[1,0]<stderr>:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0023s vs `on_train_batch_end` time: 0.0991s). Check your callbacks.
[1,1]<stderr>:/databricks/python/lib/python3.9/site-packages/petastorm/tf_utils.py:378: UserWarning: Running multiple iterations over make_petastorm_dataset is not recommend for performance issue. Use Reader's num_epochs contructor arguments to set number of iterations,or use tf.data.Dataset's cache() function to cache data of first iteration beforecalling 'repeat' method of Datset class.
[1,1]<stderr>:  warnings.warn(_RESET_READER_WARN, category=UserWarning)
[1,0]<stderr>:/databricks/python/lib/python3.9/site-packages/petastorm/tf_utils.py:378: UserWarning: Running multiple iterations over make_petastorm_dataset is not recommend for performance issue. Use Reader's num_epochs contructor arguments to set number of iterations,or use tf.data.Dataset's cache() function to cache data of first iteration beforecalling 'repeat' method of Datset class.
[1,0]<stderr>:  warnings.warn(_RESET_READER_WARN, category=UserWarning)
[1,0]<stderr>:2023-10-25 15:46:50.671958: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at save_restore_v2_ops.cc:284 : NOT_FOUND: /dbfs/horovod_spark_estimator/b2691623-bc61-45f0-b0d3-c08d147d1291/runs/keras_1698166317/checkpoint.tf/variables/variables_temp/part-00000-of-00001.data-00000-of-00001; No such file or directory
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
[1,0]<stderr>:    return _run_code(code, main_globals, None,
[1,0]<stderr>:  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
[1,0]<stderr>:    exec(code, run_globals)
[1,0]<stderr>:  File "/databricks/python/lib/python3.9/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 52, in <module>
[1,0]<stderr>:    main(codec.loads_base64(sys.argv[1]), codec.loads_base64(sys.argv[2]))
[1,0]<stderr>:  File "/databricks/python/lib/python3.9/site-packages/horovod/spark/task/mpirun_exec_fn.py", line 45, in main
[1,0]<stderr>:    task_exec(driver_addresses, settings, 'OMPI_COMM_WORLD_RANK', 'OMPI_COMM_WORLD_LOCAL_RANK')
[1,0]<stderr>:  File "/databricks/python/lib/python3.9/site-packages/horovod/spark/task/__init__.py", line 61, in task_exec
[1,0]<stderr>:    result = fn(*args, **kwargs)
[1,0]<stderr>:  File "/databricks/python/lib/python3.9/site-packages/horovod/spark/keras/remote.py", line 263, in train
[1,0]<stderr>:    history = fit(model, dm, steps_per_epoch,
[1,0]<stderr>:  File "/databricks/python/lib/python3.9/site-packages/horovod/spark/keras/util.py", line 41, in fn
[1,0]<stderr>:    return model.fit(dm.train_data(),
[1,0]<stderr>:  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-9c693e0f-b633-4b2d-bc74-15b6e705c45c/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
[1,0]<stderr>:    raise e.with_traceback(filtered_tb) from None
[1,0]<stderr>:  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-9c693e0f-b633-4b2d-bc74-15b6e705c45c/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
[1,0]<stderr>:    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
[1,0]<stderr>:tensorflow.python.framework.errors_impl[1,0]<stderr>:.NotFoundError: {{function_node __wrapped__MergeV2Checkpoints_device_/job:localhost/replica:0/task:0/device:CPU:0}} /dbfs/horovod_spark_estimator/b2691623-bc61-45f0-b0d3-c08d147d1291/runs/keras_1698166317/checkpoint.tf/variables/variables_temp/part-00000-of-00001.data-00000-of-00001; No such file or directory [Op:MergeV2Checkpoints]
[1,0]<stderr>:2023-10-25 15:46:50.817589: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.

Can anyone please help? Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume training with horovod spark keras? #3997

{{title}}

Replies: 0 comments

Select a reply

How to resume training with horovod spark keras? #3997

VAFSCLiX Oct 25, 2023

Replies: 0 comments

VAFSCLiX
Oct 25, 2023