MIT Satori User Documentation
Satori Basics
What is Satori?
How can I get an account?
Getting help?
Satori Login
Web Portal Login
SSH Login
Satori Portal
[1] Accessing the Portal
[2] Using the Portal
[3] Using Jupyter Notebooks
[4] Viewing and Accessing Notebooks
Starting up on Satori
Getting Your Account
Shared HPC Clusters
Logging in to Satori
The Satori Portal
Setting up Your Environment
Transferring Files
Using scp or rysnc
Satori Portal File Explorer
Types of Jobs
Running Interactive Jobs
Running Batch Jobs
Project Groups
Creating project groups
Using Anaconda on Satori
[1] Using Anaconda
[2] Creating and activating conda environments
[3] Setting up conda channels
[4] Searching for and installing conda packages
[5] Listing the contents of your conda environment
[6] Leaving your conda environment
Training for faster onboarding in the system HW and SW architecture
Running your AI training jobs on Satori using Slurm
A Note on Exclusivity
Interactive Jobs
Batch Scripts
Monitoring Jobs
Canceling Jobs
Scheduling Policy
Batch Queue Policy
Queue Policies
Running jobs in series
Note on Pytorch 1.4
Troubleshooting
IBM Watson Machine Learning Community Edition (WML-CE) and Open Cognitive Environment (Open-CE)
[1] Install Anaconda
[2] WML-CE and Open-CE: Setting up the software repository
[3] WML-CE and Open-CE: Creating and activate conda environments (recommended)
[4] WML-CE: Installing all frameworks at the same time
[5] WML-CE: Testing ML/DL frameworks (Pytorch, TensorFlow etc) installation
Controlling WML-CE release packages
Additional conda channels
The WML CE Supplementary channel is available at: https://anaconda.org/powerai/.
The WML-CE Early Access channel is available at: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/.
Distributed Deep Learning
IBM Large Model Support (LMS)
Julia on Satori
Getting started
Getting help?
A simple batch script example
Recipe for running single GPU, single threaded interactive session with CUDA aware MPI
Running a multi-process julia program somewhat interactively
An example of installing https://github.com/clima/climatemachine.jl on Satori
R on Satori
Getting Started with R
Installing Packages
A Simple Batch Script Example
R and Python
Running R in a container
Using MPI and CUDA on Satori
Getting started
Compiling
Submiting a batch script
Batch script header
Assigning GPUs to MPI ranks
Running the MPI program within the batch script
A complete example SLURM batch script
Using alternate MPI builds
Satori Public Datasets
Example machine learning LSF jobs
A single node, 4 GPU Keras example
A single node, 4 GPU Caffe example
A multi-node, pytorch example
A multi-node, pytorch example with the horovod conda environment
Satori Howto Video Sessions
Installing WMCLE on Satori
Pytorch with DDL on Satori
Tensorflow with DDL on Satori
Jupyterlab with SSH Tunnel on Satori
Satori Public Datasets
Singularity for Satorians
Fast start
Other notes
Interactive Allocation:
Non interactive / batch mode
Relion Cryoem for Satorians
Prerequisites
Quick start
Other notes
Copying larger files and large file sets
Using mrsync
Using Aspera for remote file transfer to Satori cluster
FAQ
Tips, tricks and questions
How can I see disk usage?
Where should I put world or project shared datasets?
How can I create custom Jupyter kernels for the Satori web portal?
Steps to create a kernel
How do I set up a basic conda environment?
System software queries
What Linux distribution version am I running?
What Linux kernel level am I running?
What software levels are installed on the system?
System hardware queries
What is my CPU configuration?
How much RAM is there on my nodes?
What SMT mode are my nodes in?
What CPU governor is in effect on my nodes?
What are the logical IDs and UUIDs for the GPUs on my nodes?
What is the IBM model of my system?
Which logical CPUs belong to which socket?
Questions about my jobs
How can I establish which logical CPU IDs my process is bound to?
Can I see the output of my job before it completes?
I have a job waiting in the queue, and I want to modify the options I had selected
I have submitted my job several times, but I get no output
How do I set a time limit on my job?
Can I make a job’s startup depend on the completion of a previous one?
How do I select a specific set of hosts for my job?
How do I deselect specific nodes for my job?
My job’s runtime environment is different from what I expected
I want to know precisely what my job’s runtime environment is
Portal queries
I see no active sessions in My Interactive Sessions?
How do I build a Singularity image from scratch?
Set up to run Docker in ppc64le mode on an x86 machine
Run Docker in ppc64le mode on an x86 machine to generate an image for Satori
Import new Docker hub image into Singularity on Satori
Using Singularity instead of Docker
Green Up Hackathon IAP 2020
Tutorial Examples
Pytorch Style Transfer
Description
Commands to run this example
Code and input data repositories for this example
Useful references
Neural network DNA
Description
Commands to run this example
Code and input data repositories for this example
Useful references
Pathology Image Classification Transfer Learning
Description
Commands to run this example
Code and input data repositories for this example
Useful references
Multi Node Multi GPU TensorFlow 2.0 Distributed Training Example
Description
Prerequisites if you are not yet running TensorFlow 2.0
Commands to run this example
What’s going on here?
Code and input data repositories for this example
Useful references
WMLCE demonstration notebooks
Description
Commands to run this example
Code and input data repositories for this example
Useful references
Finding clusters in high-dimensional data using tSNE and DB-SCAN
Description
Commands to run this example
Code and input data repositories for this example
Useful references
BigGAN-PyTorch
Description
Commands to run this example
Code and input data repositories for this example
Useful references
Measuring Resource Use
Intergrated energy use profiling
Description
Commands to run this example
Code and input data repositories for this example
Useful references
Profiling code with nvprof
Description
Commands to run the examples
Useful references
Getting help on Satori
Email help
Slack
Slack or orcd-help-satori@mit.edu
Tips and Tricks
Acceptable Use and Code of Conduct
Acceptable Use Guidelines
Code of Conduct
MIT Satori User Documentation
Green Up IAP 2020 Material
Edit on GitHub
Green Up IAP 2020 Material
Tutorial Examples
Pytorch Style Transfer
Neural network DNA
Pathology Image Classification Transfer Learning
Multi Node Multi GPU TensorFlow 2.0 Distributed Training Example
WMLCE demonstration notebooks
Finding clusters in high-dimensional data using tSNE and DB-SCAN
BigGAN-PyTorch
Measuring Resource Use
Intergrated energy use profiling
Profiling code with nvprof