Distributed Deep Learning
The following tools are libraries, which provide the communication functions necessary to perform distributed training. Primarily allReduce and broadcast functions.
IBM Spectrum MPI: Classic tool for distributed computing. Still commonly used for distributed deep learning.
NVIDIA NCCL: Nvidia’s gpu-to-gpu communication library. Since NCCL2, between-node communication is supported.
IBM DDL: Provides a topology-aware all-Reduce. Capable of optimally dividing communication across hierarchies of fabrics. Utilizes different communication protocols at different hierarchies. When WMLCE is installed all related frameworks are comming with IBM DDL support, you don’t have to compile additional software packages, only to modify your training scripts to make use of the need distributed deep learning APIs.
Integrations into deep learning frameworks to enable distributed training is using common communication libraries such as:
TensorFlow Distribution Strategies. Native Tensorflow distribution methods.
IBM DDL. Provides integrations into common frameworks, including a Tensorflow operator that integrates IBM DDL with Tensorflow and similar for Pytorch.
Horovod [Sergeev et al. 2018]. Provides integration libraries into common frameworks which enable distributed training with common communication libraries, including. IBM DDL can be used as backend for Horovod implementation.
IBM DDL - Documentation and Tutorial:
IBM DDL APIs for a better integration
Examples:
Pytorch
How to get Horovod with DDL? follow bellow instructions (optional 0 - 2 if you have already install WMLCE):
Add ppc64le conda channel for WMLCE
conda config --prepend channels \
https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
Create Conda Virtual Environment
conda create --name horovod python=3.6
Install WMLCE (TF, Pytorch, DDL etc)
conda install powerai
Install the packages to build Horovod
conda install gxx_linux-ppc64le=7.3.0 cffi cudatoolkit-dev
Install Horovod with DDL backend
HOROVOD_CUDA_HOME=$CONDA_PREFIX HOROVOD_GPU_ALLREDUCE=DDL pip install horovod --no-cache-dir
or with NCCL direct support (recomanded for Pytorch)
env HOROVOD_CUDA_HOME=$CONDA_PREFIX HOROVOD_NCCL_HOME=$CONDA_PREFIX HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install --no-cache horovod
Original IBM DDL paper, can be found at this URL: https://arxiv.org/pdf/1708.02188.pdf