TF_NUM_INTEROP_THREADS=1
speeds up training time of tensorflow2_keras_mnist.py
. Why?
#4022
Unanswered
AlexisEspinosaGayosso
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
In our HPC centre, the Slurm scheduler assigns CPUcores and GPUs to the job. So, to run any GPU code in an optimal way, there are a list of recommended best practices for the
srun
options in order to achive the best possible CPUcore and GPU placement and binding. The use (or not) of the recommended practices forsrun
seems to be playing a huge role in the training speed oftensorflow2_keras_mnist.py
, but this needs to be combined with the use of a thread-controlling variable:TF_NUM_INTEROP_THREADS=1
.The main question I'm asking in this post is:
TF_NUM_INTEROP_THREADS
?For you to better understand why I'm asking this, let me quickly take you through a couple of tests that I have performed.
First successful execution (but slow)
The naive way of submitting (assuming you have exclusive access to 2 nodes with 8 GPUs each) is:
which allows the code to complete training and use the 16 GPUs in distributed training. (Note that according to Slurm documentation, the no use of
-c
really means a default-c 1
.)But it runs SLOW:
Code crashes when using
-c 8
but without an environment variable controlling the threadsI do not know the exact reason for the slowness, but I decided to try the recommended practice for the GPU nodes which is to use a
-c 8
option in thesrun
command in order to indicate that each task placement is going to "grab" a "core-space" of size 8 (a full chiplet). This recommendation also says that this is indeed of a "placement trick". But this trick comes with the side-effect of providing 8 threads per core, unless the real number of threads to be provided to the code is controlled with an environment variable. Still, first I naively tried:which crashes with a
Bus error
. (Other testing codes have shownSegmentation Fault
errors instead of this one.)Part of the error message is:
A faster training: using
-c 8
+TF_NUM_INTEROP_THREADS=1
As mentioned above, our documentation says that the "placement trick" needs to be combined with the use of an environment variable that controls the real number of threads given to the code (which I failed to do in the faulty attempt above). So, again naively, I tried the use of
export OMP_NUM_THREADS=1
to control the number of threads, but this also gave the same bus error. Clearly,OMP_NUM_THREADS
does not control anything on the code. But then I looked for an equivalent variable and found thatTF_NUM_INTEROP_THREADS
might be what I was looking for. So I tried:and this combination indeed allows the code to run. And it runs much more faster!:
So this indeed seems to be a better combination. But it's not clear to me what is the role of this combination (use of
-c 8
together withTF_NUM_INTEROP_THREADS=1
).Additional tests
It's important to say that the combination (without
-c 8
):runs as slow as the first slow example shown above. (Note that according to Slurm documentation, the no use of
-c
really means a default-c 1
.)And I guess it's also important to say that the use of
TF_NUM_INTEROP_THREADS=3
or higher also give a crash with bus error.Back to the questions
TF_NUM_INTEROP_THREADS
? (For thetensorflow2_keras_mnist.py
and in general)TF_NUM_INTEROP_THREADS=1
+-c 8
gives such a boost in the speed of training?TF_NUM_INTRAOP_THREADS
. I tried this other variable, but it does not seem to show any effect on behaviour. But still, for completeness, I would like to ask: what would be the role of this other variable? (For thetensorflow2_keras_mnist.py
and in general)Additional comments
I instrumented the code with:
which prints the following for the slow run, consistent with the no use of
-c
which really means an implicit use of-c 1
:And gives the following for the fast run, consistent with the use of
-c 8
:(Note that this instrumentation is querying the operating systems and it's not a proper tensorflow or horovod query. So these are the CPUcores provided to the job step by
srun
command)I also instrumented the code with:
which gives the following for BOTH cases:
I also instrumented the code with:
But the query always gives back a
0
for BOTH cases (also independent of additional testing settingTF_NUM_INTEROP_THREADS
to different values or not setting it at all):The Tensorflow documentation says it means:
So I was not able to check for the effect of different values of the environment variable on the query.
Beta Was this translation helpful? Give feedback.
All reactions