Multi Node Multi GPU TensorFlow 2.0 Distributed Training Example

Ported the TensorFlow example to run on Satori

Prerequisites if you are not yet running TensorFlow 2.0

Install TensorFlow 2.0 as described here => IMPORTANT <= this might change, please ask John or Chris => => it is important that you name your environment wmlce-ea

Commands to run this example

  1. Login to Satori Login Node

  2. wget

  3. chmod 755

  4. wget

  5. chmod 755

  6. bsub -W 3:00 -q normalx -x -n 8 -gpu “num=4” -R “span[ptile=4]” -I “while (true) do ls > /dev/null; done” (replace 2586 with a number smaller equals than 256 :)

  7. login to a new shell

  8. nodes=`bjobs |grep 4*node |awk -F”*” ‘{print $2}’ |awk -F”.” ‘{print $1}’`

  9. echo $nodes |python

Wait until training starts, please run different new terminals on your worker nodes to observe what’s happening

watch -n 0.1 nvidia-smi

What’s going on here?

All scrips running on all nodes start a Service component which communicates with the other scripts in the background for parameter averaging.

