`TF_NUM_INTEROP_THREADS=1` speeds up training time of `tensorflow2_keras_mnist.py`. Why? #4022

AlexisEspinosaGayosso · 2024-02-08T09:39:41Z

AlexisEspinosaGayosso
Feb 8, 2024

In our HPC centre, the Slurm scheduler assigns CPUcores and GPUs to the job. So, to run any GPU code in an optimal way, there are a list of recommended best practices for the srun options in order to achive the best possible CPUcore and GPU placement and binding. The use (or not) of the recommended practices forsrun seems to be playing a huge role in the training speed of tensorflow2_keras_mnist.py, but this needs to be combined with the use of a thread-controlling variable: TF_NUM_INTEROP_THREADS=1.

The main question I'm asking in this post is:

What is the role of the "number of threads for parallelism between independent operations" ruled by TF_NUM_INTEROP_THREADS?

For you to better understand why I'm asking this, let me quickly take you through a couple of tests that I have performed.

First successful execution (but slow)

The naive way of submitting (assuming you have exclusive access to 2 nodes with 8 GPUs each) is:

srun -N 2 -n 16 --gres=gpu:8 python3 tensorflow2_keras_mnist.py

which allows the code to complete training and use the 16 GPUs in distributed training. (Note that according to Slurm documentation, the no use of -c really means a default -c 1.)

But it runs SLOW:

Epoch 1/24
31/31 [==============================] - 18s 20ms/step - loss: 0.6840 - accuracy: 0.7856 - lr: 0.0060
Epoch 2/24
31/31 [==============================] - 1s 19ms/step - loss: 0.1083 - accuracy: 0.9672 - lr: 0.0110
Epoch 3/24
31/31 [==============================] - 1s 19ms/step - loss: 0.0641 - accuracy: 0.9802 - lr: 0.0160
Epoch 4/24
31/31 [==============================] - 1s 19ms/step - loss: 0.0567 - accuracy: 0.9820 - lr: 0.0160
Epoch 5/24
31/31 [==============================] - 1s 19ms/step - loss: 0.0492 - accuracy: 0.9848 - lr: 0.0160
Epoch 6/24
31/31 [==============================] - 1s 20ms/step - loss: 0.0441 - accuracy: 0.9860 - lr: 0.0160
Epoch 7/24
31/31 [==============================] - 1s 18ms/step - loss: 0.0422 - accuracy: 0.9869 - lr: 0.0160
Epoch 8/24
31/31 [==============================] - 1s 20ms/step - loss: 0.0406 - accuracy: 0.9878 - lr: 0.0160
Epoch 9/24
31/31 [==============================] - 1s 19ms/step - loss: 0.0366 - accuracy: 0.9886 - lr: 0.0160
Epoch 10/24
31/31 [==============================] - 1s 20ms/step - loss: 0.0373 - accuracy: 0.9885 - lr: 0.0160
Epoch 11/24
31/31 [==============================] - 1s 20ms/step - loss: 0.0394 - accuracy: 0.9876 - lr: 0.0160
Epoch 12/24
31/31 [==============================] - 1s 18ms/step - loss: 0.0313 - accuracy: 0.9901 - lr: 0.0160
Epoch 13/24
31/31 [==============================] - 1s 16ms/step - loss: 0.0326 - accuracy: 0.9899 - lr: 0.0160
Epoch 14/24
31/31 [==============================] - 1s 17ms/step - loss: 0.0257 - accuracy: 0.9920 - lr: 0.0160
Epoch 15/24
31/31 [==============================] - 1s 20ms/step - loss: 0.0241 - accuracy: 0.9917 - lr: 0.0160
Epoch 16/24
31/31 [==============================] - 1s 17ms/step - loss: 0.0254 - accuracy: 0.9919 - lr: 0.0160
Epoch 17/24
31/31 [==============================] - 1s 18ms/step - loss: 0.0203 - accuracy: 0.9930 - lr: 0.0160
Epoch 18/24
31/31 [==============================] - 1s 17ms/step - loss: 0.0260 - accuracy: 0.9922 - lr: 0.0160
Epoch 19/24
31/31 [==============================] - 1s 18ms/step - loss: 0.0230 - accuracy: 0.9923 - lr: 0.0160
Epoch 20/24
31/31 [==============================] - 1s 17ms/step - loss: 0.0229 - accuracy: 0.9927 - lr: 0.0160
Epoch 21/24
31/31 [==============================] - 1s 18ms/step - loss: 0.0222 - accuracy: 0.9931 - lr: 0.0160
Epoch 22/24
31/31 [==============================] - 1s 18ms/step - loss: 0.0206 - accuracy: 0.9936 - lr: 0.0160
Epoch 23/24
31/31 [==============================] - 0s 15ms/step - loss: 0.0206 - accuracy: 0.9932 - lr: 0.0160
Epoch 24/24
31/31 [==============================] - 1s 16ms/step - loss: 0.0261 - accuracy: 0.9916 - lr: 0.0160

Code crashes when using `-c 8` but without an environment variable controlling the threads

I do not know the exact reason for the slowness, but I decided to try the recommended practice for the GPU nodes which is to use a -c 8 option in the srun command in order to indicate that each task placement is going to "grab" a "core-space" of size 8 (a full chiplet). This recommendation also says that this is indeed of a "placement trick". But this trick comes with the side-effect of providing 8 threads per core, unless the real number of threads to be provided to the code is controlled with an environment variable. Still, first I naively tried:

srun -N 2 -n 16 -c 8 --gres=gpu:8 python3 tensorflow2_keras_mnist.py

which crashes with a Bus error. (Other testing codes have shown Segmentation Fault errors instead of this one.)

Part of the error message is:

/software/containers/modules-long/shpc_registry/tensorflow/rocm5.6-tf2.12/bin/python3: line 7: 126256 Bus error               singularity ${SINGULARITY_OPTS} exec ${SINGULARITY_COMMAND_OPTS} -B $moduleDir/99-shpc.sh:/.singularity.d/env/99-shpc.sh /software/containers/sif/shpc_registry/tensorflow/rocm5.6-tf2.12/sha256:96cc3e467aaafa12c558251350ef45a5d46e1d2f7e6e91e4b9cf50334e81736c.sif /usr/bin/python3 "$@"

A faster training: using `-c 8` + `TF_NUM_INTEROP_THREADS=1`

As mentioned above, our documentation says that the "placement trick" needs to be combined with the use of an environment variable that controls the real number of threads given to the code (which I failed to do in the faulty attempt above). So, again naively, I tried the use of export OMP_NUM_THREADS=1 to control the number of threads, but this also gave the same bus error. Clearly, OMP_NUM_THREADS does not control anything on the code. But then I looked for an equivalent variable and found that TF_NUM_INTEROP_THREADS might be what I was looking for. So I tried:

export TF_NUM_INTEROP_THREADS=1
srun -N 2 -n 16 -c 8 --gres=gpu:8 python3 tensorflow2_keras_mnist.py

and this combination indeed allows the code to run. And it runs much more faster!:

Epoch 1/24
31/31 [==============================] - 8s 17ms/step - loss: 0.7409 - accuracy: 0.7633 - lr: 0.0060
Epoch 2/24
31/31 [==============================] - 0s 12ms/step - loss: 0.1184 - accuracy: 0.9645 - lr: 0.0110
Epoch 3/24
31/31 [==============================] - 0s 11ms/step - loss: 0.0734 - accuracy: 0.9766 - lr: 0.0160
Epoch 4/24
31/31 [==============================] - 0s 11ms/step - loss: 0.0584 - accuracy: 0.9813 - lr: 0.0160
Epoch 5/24
31/31 [==============================] - 0s 10ms/step - loss: 0.0541 - accuracy: 0.9839 - lr: 0.0160
Epoch 6/24
31/31 [==============================] - 0s 10ms/step - loss: 0.0451 - accuracy: 0.9857 - lr: 0.0160
Epoch 7/24
31/31 [==============================] - 0s 10ms/step - loss: 0.0454 - accuracy: 0.9859 - lr: 0.0160
Epoch 8/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0430 - accuracy: 0.9872 - lr: 0.0160
Epoch 9/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0411 - accuracy: 0.9872 - lr: 0.0160
Epoch 10/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0395 - accuracy: 0.9875 - lr: 0.0160
Epoch 11/24
31/31 [==============================] - 0s 8ms/step - loss: 0.0408 - accuracy: 0.9875 - lr: 0.0160
Epoch 12/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0391 - accuracy: 0.9877 - lr: 0.0160
Epoch 13/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0338 - accuracy: 0.9899 - lr: 0.0160
Epoch 14/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0281 - accuracy: 0.9904 - lr: 0.0160
Epoch 15/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0282 - accuracy: 0.9911 - lr: 0.0160
Epoch 16/24
31/31 [==============================] - 0s 7ms/step - loss: 0.0250 - accuracy: 0.9922 - lr: 0.0160
Epoch 17/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0232 - accuracy: 0.9923 - lr: 0.0160
Epoch 18/24
31/31 [==============================] - 0s 7ms/step - loss: 0.0239 - accuracy: 0.9920 - lr: 0.0160
Epoch 19/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0240 - accuracy: 0.9920 - lr: 0.0160
Epoch 20/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0252 - accuracy: 0.9923 - lr: 0.0160
Epoch 21/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0253 - accuracy: 0.9917 - lr: 0.0160
Epoch 22/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0222 - accuracy: 0.9924 - lr: 0.0160
Epoch 23/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0235 - accuracy: 0.9926 - lr: 0.0160
Epoch 24/24
31/31 [==============================] - 0s 9ms/step - loss: 0.0204 - accuracy: 0.9932 - lr: 0.0160

So this indeed seems to be a better combination. But it's not clear to me what is the role of this combination (use of -c 8 together with TF_NUM_INTEROP_THREADS=1).

Additional tests

It's important to say that the combination (without -c 8):

export TF_NUM_INTEROP_THREADS=1
srun -N 2 -n 16 --gres=gpu:8 python3 tensorflow2_keras_mnist.py

runs as slow as the first slow example shown above. (Note that according to Slurm documentation, the no use of -c really means a default -c 1.)

And I guess it's also important to say that the use of TF_NUM_INTEROP_THREADS=3 or higher also give a crash with bus error.

Back to the questions

What is the role of the "number of threads for parallelism between independent operations" ruled by TF_NUM_INTEROP_THREADS? (For the tensorflow2_keras_mnist.py and in general)
Why the combination of use of the TF_NUM_INTEROP_THREADS=1 + -c 8 gives such a boost in the speed of training?
When investigating for thread-related environmental variables I also read about TF_NUM_INTRAOP_THREADS. I tried this other variable, but it does not seem to show any effect on behaviour. But still, for completeness, I would like to ask: what would be the role of this other variable? (For the tensorflow2_keras_mnist.py and in general)

Additional comments

I instrumented the code with:

cpu_affinity = os.sched_getaffinity(0)
print(f"Process rank {hvd.rank()}, local rank {hvd.local_rank()}: Assigned CPU Cores: {cpu_affinity}")

which prints the following for the slow run, consistent with the no use of -c which really means an implicit use of -c 1:

Process rank 0, local rank 0: Assigned CPU Cores: {0}
Process rank 1, local rank 1: Assigned CPU Cores: {8}
Process rank 2, local rank 2: Assigned CPU Cores: {16}
Process rank 3, local rank 3: Assigned CPU Cores: {24}
Process rank 4, local rank 4: Assigned CPU Cores: {32}
Process rank 5, local rank 5: Assigned CPU Cores: {40}
Process rank 6, local rank 6: Assigned CPU Cores: {48}
Process rank 7, local rank 7: Assigned CPU Cores: {56}
Process rank 8, local rank 0: Assigned CPU Cores: {0}
Process rank 9, local rank 1: Assigned CPU Cores: {8}
Process rank 10, local rank 2: Assigned CPU Cores: {16}
Process rank 11, local rank 3: Assigned CPU Cores: {24}
Process rank 12, local rank 4: Assigned CPU Cores: {32}
Process rank 13, local rank 5: Assigned CPU Cores: {40}
Process rank 14, local rank 6: Assigned CPU Cores: {48}
Process rank 15, local rank 7: Assigned CPU Cores: {56}

And gives the following for the fast run, consistent with the use of -c 8:

Process rank 0, local rank 0: Assigned CPU Cores: {0, 1, 2, 3, 4, 5, 6, 7}
Process rank 1, local rank 1: Assigned CPU Cores: {8, 9, 10, 11, 12, 13, 14, 15}
Process rank 2, local rank 2: Assigned CPU Cores: {16, 17, 18, 19, 20, 21, 22, 23}
Process rank 3, local rank 3: Assigned CPU Cores: {24, 25, 26, 27, 28, 29, 30, 31}
Process rank 4, local rank 4: Assigned CPU Cores: {32, 33, 34, 35, 36, 37, 38, 39}
Process rank 5, local rank 5: Assigned CPU Cores: {40, 41, 42, 43, 44, 45, 46, 47}
Process rank 6, local rank 6: Assigned CPU Cores: {48, 49, 50, 51, 52, 53, 54, 55}
Process rank 7, local rank 7: Assigned CPU Cores: {56, 57, 58, 59, 60, 61, 62, 63}
Process rank 8, local rank 0: Assigned CPU Cores: {0, 1, 2, 3, 4, 5, 6, 7}
Process rank 9, local rank 1: Assigned CPU Cores: {8, 9, 10, 11, 12, 13, 14, 15}
Process rank 10, local rank 2: Assigned CPU Cores: {16, 17, 18, 19, 20, 21, 22, 23}
Process rank 11, local rank 3: Assigned CPU Cores: {24, 25, 26, 27, 28, 29, 30, 31}
Process rank 12, local rank 4: Assigned CPU Cores: {32, 33, 34, 35, 36, 37, 38, 39}
Process rank 13, local rank 5: Assigned CPU Cores: {40, 41, 42, 43, 44, 45, 46, 47}
Process rank 14, local rank 6: Assigned CPU Cores: {48, 49, 50, 51, 52, 53, 54, 55}
Process rank 15, local rank 7: Assigned CPU Cores: {56, 57, 58, 59, 60, 61, 62, 63}

(Note that this instrumentation is querying the operating systems and it's not a proper tensorflow or horovod query. So these are the CPUcores provided to the job step by srun command)

I also instrumented the code with:

cpu_logical_devices = tf.config.experimental.list_logical_devices('CPU')
num_cpu_cores = len(cpu_logical_devices)
print(f"Process rank {hvd.rank()}, local rank {hvd.local_rank()}: TensorFlow sees {num_cpu_cores} CPU cores")

which gives the following for BOTH cases:

Process rank 0, local rank 0: TensorFlow sees 1 CPU cores
Process rank 1, local rank 1: TensorFlow sees 1 CPU cores
Process rank 2, local rank 2: TensorFlow sees 1 CPU cores
Process rank 3, local rank 3: TensorFlow sees 1 CPU cores
Process rank 4, local rank 4: TensorFlow sees 1 CPU cores
Process rank 5, local rank 5: TensorFlow sees 1 CPU cores
Process rank 6, local rank 6: TensorFlow sees 1 CPU cores
Process rank 7, local rank 7: TensorFlow sees 1 CPU cores
Process rank 8, local rank 0: TensorFlow sees 1 CPU cores
Process rank 9, local rank 1: TensorFlow sees 1 CPU cores
Process rank 10, local rank 2: TensorFlow sees 1 CPU cores
Process rank 11, local rank 3: TensorFlow sees 1 CPU cores
Process rank 12, local rank 4: TensorFlow sees 1 CPU cores
Process rank 13, local rank 5: TensorFlow sees 1 CPU cores
Process rank 14, local rank 6: TensorFlow sees 1 CPU cores
Process rank 15, local rank 7: TensorFlow sees 1 CPU cores

I also instrumented the code with:

# Print information about assigned CPU cores
inter_op_threads = tf.config.threading.get_inter_op_parallelism_threads()
print(f"Process rank {hvd.rank()}, local rank {hvd.local_rank()}: Inter-op Parallelism Threads: {inter_op_threads}")

But the query always gives back a 0 for BOTH cases (also independent of additional testing setting TF_NUM_INTEROP_THREADS to different values or not setting it at all):

Process rank 0, local rank 0: Inter-op Parallelism Threads: 0
Process rank 1, local rank 1: Inter-op Parallelism Threads: 0
Process rank 2, local rank 2: Inter-op Parallelism Threads: 0
Process rank 3, local rank 3: Inter-op Parallelism Threads: 0
Process rank 4, local rank 4: Inter-op Parallelism Threads: 0
Process rank 5, local rank 5: Inter-op Parallelism Threads: 0
Process rank 6, local rank 6: Inter-op Parallelism Threads: 0
Process rank 7, local rank 7: Inter-op Parallelism Threads: 0
Process rank 8, local rank 0: Inter-op Parallelism Threads: 0
Process rank 9, local rank 1: Inter-op Parallelism Threads: 0
Process rank 10, local rank 2: Inter-op Parallelism Threads: 0
Process rank 11, local rank 3: Inter-op Parallelism Threads: 0
Process rank 12, local rank 4: Inter-op Parallelism Threads: 0
Process rank 13, local rank 5: Inter-op Parallelism Threads: 0
Process rank 14, local rank 6: Inter-op Parallelism Threads: 0
Process rank 15, local rank 7: Inter-op Parallelism Threads: 0

The Tensorflow documentation says it means:

Determines the number of threads used by independent non-blocking operations. 0 means the system picks an appropriate number.

So I was not able to check for the effect of different values of the environment variable on the query.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TF_NUM_INTEROP_THREADS=1` speeds up training time of `tensorflow2_keras_mnist.py`. Why? #4022

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

TF_NUM_INTEROP_THREADS=1 speeds up training time of tensorflow2_keras_mnist.py. Why? #4022

AlexisEspinosaGayosso Feb 8, 2024

First successful execution (but slow)

Code crashes when using -c 8 but without an environment variable controlling the threads

A faster training: using -c 8 + TF_NUM_INTEROP_THREADS=1

Additional tests

Back to the questions

Additional comments

Replies: 0 comments

`TF_NUM_INTEROP_THREADS=1` speeds up training time of `tensorflow2_keras_mnist.py`. Why? #4022

AlexisEspinosaGayosso
Feb 8, 2024

Code crashes when using `-c 8` but without an environment variable controlling the threads

A faster training: using `-c 8` + `TF_NUM_INTEROP_THREADS=1`