You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
We met an issue after running TF Training w/ horovod in both CPU and GPU execution. The tf saved model is not loadable outside Horovod environment because HorovodAllReduce seems to be saved unexpected.
Ways to reproduce: running the following script for a simple keras model in the test case and saving it
# test.pyimporthorovod.tensorflowashvdimporttensorflowastfimportkerasimportnumpyasnphvd.init()
initial_lr=0.1*hvd.size()
opt=tf.keras.optimizers.Adam()
opt=hvd.DistributedOptimizer(opt)
deflinear_multiplier(epoch):
returnepochmodel=keras.models.Sequential()
model.add(keras.layers.Dense(2, input_shape=(3,)))
model.add(keras.layers.RepeatVector(3))
model.add(keras.layers.ThresholdedReLU(0.5))
model.compile(loss=keras.losses.mean_squared_error,
optimizer=opt,
metrics=[keras.metrics.categorical_accuracy],
experimental_run_tf_function=False)
x=np.random.random((10, 3))
y=np.random.random((10, 3, 2))
train_history=model.fit(x,
y,
steps_per_epoch=5,
epochs=20)
# test that the metrics average is being respectedloss_metrics=train_history.history["loss"]
loss_metrics_tensor=tf.convert_to_tensor(
loss_metrics, dtype=tf.float32)
expected_loss_metrics_tensor=hvd.broadcast(
loss_metrics_tensor, root_rank=0)
ifhvd.rank() ==0:
tf.saved_model.save(model, "test_space/hvd_saved_model_2")
and run python test.py
Then loading the model without horovd being imported
# test_2.py
import tensorflow as tf
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_2")
and run python test_2.py
it will return
Traceback (most recent call last):
File "/home/chzhu/test_space/test_tf_saved_model.py", line 3, in <module>
tf.saved_model.load("/home/chzhu/test_space/hvd_saved_model_1")
File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 828, in load
result = load_partial(export_dir, None, tags, options)["root"]
File "/home/chzhu/test_space/env_310/lib/python3.10/site-packages/tensorflow/python/saved_model/load.py", line 961, in load_partial
raise FileNotFoundError(
FileNotFoundError: Op type not registered 'HorovodAllreduce' in binary running on chzhu-ld4.linkedin.biz. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
You may be trying to load on a different device from the computational device. Consider setting the `experimental_io_device` option in `tf.saved_model.LoadOptions` to the io_device such as '/job:localhost'.
Note: Reverting to Horovod 0.26 or tf.keras.optimizer.legacy will resolve this issue. But we want to use latest horovod instead.
The text was updated successfully, but these errors were encountered:
supercharleszhu
changed the title
Saved model not portable with HorovodAllReduceOps
Tensorflow Saved model not portable with HorovodAllReduceOps
Mar 11, 2024
supercharleszhu
changed the title
Tensorflow Saved model not portable with HorovodAllReduceOps
Tensorflow Saved model not portable with HorovodAllReduce Ops
Mar 11, 2024
supercharleszhu
changed the title
Tensorflow Saved model not portable with HorovodAllReduce Ops
Tensorflow Saved model not portable with latest tf.keras.optimizers
Mar 15, 2024
Environment:
Checklist:
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
We met an issue after running TF Training w/ horovod in both CPU and GPU execution. The tf saved model is not loadable outside Horovod environment because HorovodAllReduce seems to be saved unexpected.
Ways to reproduce: running the following script for a simple keras model in the test case and saving it
and run
python test.py
Then loading the model without horovd being imported
and run
python test_2.py
it will return
Note: Reverting to Horovod 0.26 or tf.keras.optimizer.legacy will resolve this issue. But we want to use latest horovod instead.
The text was updated successfully, but these errors were encountered: