Multi Node Multi GPU TensorFlow 2.0 Distributed Training Example

Romeo Kienzler ( 1 s t n a m e . l a s t n a m e a t c h . i b m . c o m )

Jerome Nilmeier ( l a s t n a m e a t us . i b m . c o m )


Ported the TensorFlow example to run on Satori

Prerequisites if you are not yet running TensorFlow 2.0

Install TensorFlow 2.0 as described here => IMPORTANT <= this might change, please ask John or Chris => => it is important that you name your environment wmlce-ea

Commands to run this example

  1. Login to Satori Login Node

  2. wget

  3. chmod 755

  4. wget

  5. chmod 755

  6. bsub -W 3:00 -q normalx -x -n 8 -gpu “num=4” -R “span[ptile=4]” -I “while (true) do ls > /dev/null; done” (replace 2586 with a number smaller equals than 256 :)

  7. login to a new shell

  8. nodes=`bjobs |grep 4*node |awk -F”*” ‘{print $2}’ |awk -F”.” ‘{print $1}’`

  9. echo $nodes |python

Wait until training starts, please run different new terminals on your worker nodes to observe what’s happening

watch -n 0.1 nvidia-smi

What’s going on here?

All scrips running on all nodes start a Service component which communicates with the other scripts in the background for parameter averaging.

See for more information

Code and input data repositories for this example


Useful references