Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve TF saved model not portable issue with tf.keras.optimizers #4031

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

supercharleszhu
Copy link
Contributor

@supercharleszhu supercharleszhu commented Mar 15, 2024

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #4028
Root Cause: the self._allreduce_grad will be treated as a tf.function when we initialize the optimizer outside tf.function(code). Since all the concrete function will be saved into saved_model.pb (see code here), it will cause the HorovodAllRuduce Op to be saved into the graph, which might not be registered in other environment.

When using the tf.keras.optimzer.legacy, It will not reach this path because when we call model.fit, the optimizer.minimize will call _compute_gradients which is not overriden by compute_gradient function in DistributedOptimizer, so the distributed optimizer is not taking effect at all!

Resolution

We don't need to explicitly register the allreduce_grad as a tf function, as in graph mode, it will be trace in the outer tf.function. In this case, the horovod ops will not be saved explicitly

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Copy link

Unit Test Results

   568 files   -   156    568 suites   - 156   6h 41m 28s ⏱️ - 1h 24m 39s
   887 tests ±    0    692 ✅  -    76    195 💤 + 76  0 ❌ ±0 
12 738 runs   - 3 471  8 641 ✅  - 2 712  4 097 💤  - 759  0 ❌ ±0 

Results for commit 36ea180. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

Copy link

Unit Test Results (with flaky tests)

   593 files   -   295    593 suites   - 295   6h 50m 12s ⏱️ - 2h 8m 29s
   887 tests ±    0    691 ✅  -    77    195 💤 +   76  0 ❌ ±0  1 🔥 +1 
13 081 runs   - 7 158  8 926 ✅  - 4 863  4 154 💤  - 2 296  0 ❌ ±0  1 🔥 +1 

For more details on these errors, see this check.

Results for commit 36ea180. ± Comparison against base commit 9f88e1d.

This pull request skips 76 tests.
test.parallel.test_mxnet1.MX1Tests ‑ test_gluon_trainer
test.parallel.test_mxnet1.MX1Tests ‑ test_gpu_required
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_allreduce_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allgather_cpu_gpu_error
test.parallel.test_mxnet1.MX1Tests ‑ test_horovod_grouped_allreduce_cpu_gpu_error
test.parallel.test_tensorflow.TensorFlowTests ‑ test_gpu_required
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_fused_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_grad_gpu
test.parallel.test_tensorflow.TensorFlowTests ‑ test_horovod_allgather_variable_size_fused_gpu
…

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 21, 2024

Needs #4033 and #4035.

@supercharleszhu
Copy link
Contributor Author

@EnricoMi Has the CI been fixed? Seems that above 2 prs are checked in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Tensorflow Saved model not portable with latest tf.keras.optimizers
2 participants