Resolve TF saved model not portable issue with tf.keras.optimizers #4031

supercharleszhu · 2024-03-15T04:15:21Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes #4028
Root Cause: the self._allreduce_grad will be treated as a tf.function when we initialize the optimizer outside tf.function(code). Since all the concrete function will be saved into saved_model.pb (see code here), it will cause the HorovodAllRuduce Op to be saved into the graph, which might not be registered in other environment.

When using the tf.keras.optimzer.legacy, It will not reach this path because when we call model.fit, the optimizer.minimize will call _compute_gradients which is not overriden by compute_gradient function in DistributedOptimizer, so the distributed optimizer is not taking effect at all!

Resolution

We don't need to explicitly register the allreduce_grad as a tf function, as in graph mode, it will be trace in the outer tf.function. In this case, the horovod ops will not be saved explicitly

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

github-actions · 2024-03-15T05:40:17Z

Unit Test Results

568 files - 156 568 suites - 156 6h 41m 28s ⏱️ - 1h 24m 39s
887 tests ± 0 692 ✅ - 76 195 💤 + 76 0 ❌ ±0
12 738 runs - 3 471 8 641 ✅ - 2 712 4 097 💤 - 759 0 ❌ ±0

Results for commit 36ea180. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.

test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

github-actions · 2024-03-15T05:41:28Z

Unit Test Results (with flaky tests)

593 files - 295 593 suites - 295 6h 50m 12s ⏱️ - 2h 8m 29s
887 tests ± 0 691 ✅ - 77 195 💤 + 76 0 ❌ ±0 1 🔥 +1
13 081 runs - 7 158 8 926 ✅ - 4 863 4 154 💤 - 2 296 0 ❌ ±0 1 🔥 +1

For more details on these errors, see this check.

Results for commit 36ea180. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.

test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

EnricoMi · 2024-03-21T09:26:14Z

Needs #4033 and #4035.

supercharleszhu · 2024-03-26T22:42:28Z

@EnricoMi Has the CI been fixed? Seems that above 2 prs are checked in.

Resolve TF saved model not portable issue with tf.keras.optimizers

36ea180

chongxiaoc requested review from EnricoMi and maxhgerlach March 15, 2024 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

supercharleszhu commented Mar 15, 2024 •

edited

github-actions bot commented Mar 15, 2024

github-actions bot commented Mar 15, 2024

EnricoMi commented Mar 21, 2024 •

edited

supercharleszhu commented Mar 26, 2024

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

Are you sure you want to change the base?

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

Conversation

supercharleszhu commented Mar 15, 2024 • edited

Checklist before submitting

Description

Resolution

Review process to land

github-actions bot commented Mar 15, 2024

Unit Test Results

github-actions bot commented Mar 15, 2024

Unit Test Results (with flaky tests)

EnricoMi commented Mar 21, 2024 • edited

supercharleszhu commented Mar 26, 2024

supercharleszhu commented Mar 15, 2024 •

edited

EnricoMi commented Mar 21, 2024 •

edited