MIT Satori User Documentation
  • Satori Basics
    • What is Satori?
    • How can I get an account?
    • Getting help?
  • Satori Login
    • Web Portal Login
    • SSH Login
  • Satori Portal
    • [1] Accessing the Portal
    • [2] Using the Portal
    • [3] Using Jupyter Notebooks
    • [4] Viewing and Accessing Notebooks
  • Starting up on Satori
    • Getting Your Account
    • Shared HPC Clusters
    • Logging in to Satori
    • The Satori Portal
    • Setting up Your Environment
    • Transferring Files
      • Using scp or rysnc
      • Satori Portal File Explorer
    • Types of Jobs
      • Running Interactive Jobs
      • Running Batch Jobs
    • Project Groups
      • Creating project groups
  • Using Anaconda on Satori
    • [1] Using Anaconda
    • [2] Creating and activating conda environments
    • [3] Setting up conda channels
    • [4] Searching for and installing conda packages
    • [5] Listing the contents of your conda environment
    • [6] Leaving your conda environment
  • Training for faster onboarding in the system HW and SW architecture
  • Running your AI training jobs on Satori using Slurm
    • A Note on Exclusivity
    • Interactive Jobs
    • Batch Scripts
      • Monitoring Jobs
      • Canceling Jobs
      • Scheduling Policy
      • Batch Queue Policy
      • Queue Policies
      • Running jobs in series
      • Note on Pytorch 1.4
  • Troubleshooting
  • IBM Watson Machine Learning Community Edition (WML-CE) and Open Cognitive Environment (Open-CE)
    • [1] Install Anaconda
    • [2] WML-CE and Open-CE: Setting up the software repository
    • [3] WML-CE and Open-CE: Creating and activate conda environments (recommended)
    • [4] WML-CE: Installing all frameworks at the same time
    • [5] WML-CE: Testing ML/DL frameworks (Pytorch, TensorFlow etc) installation
      • Controlling WML-CE release packages
      • Additional conda channels
    • The WML CE Supplementary channel is available at: https://anaconda.org/powerai/.
    • The WML-CE Early Access channel is available at: https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access/.
  • Distributed Deep Learning
  • IBM Large Model Support (LMS)
  • Julia on Satori
    • Getting started
    • Getting help?
    • A simple batch script example
    • Recipe for running single GPU, single threaded interactive session with CUDA aware MPI
    • Running a multi-process julia program somewhat interactively
    • An example of installing https://github.com/clima/climatemachine.jl on Satori
  • R on Satori
    • Getting Started with R
    • Installing Packages
    • A Simple Batch Script Example
    • R and Python
    • Running R in a container
  • Using MPI and CUDA on Satori
    • Getting started
    • Compiling
    • Submiting a batch script
      • Batch script header
      • Assigning GPUs to MPI ranks
      • Running the MPI program within the batch script
    • A complete example SLURM batch script
    • Using alternate MPI builds
  • Satori Public Datasets
  • Example machine learning LSF jobs
    • A single node, 4 GPU Keras example
    • A single node, 4 GPU Caffe example
    • A multi-node, pytorch example
    • A multi-node, pytorch example with the horovod conda environment
  • Satori Howto Video Sessions
    • Installing WMCLE on Satori
    • Pytorch with DDL on Satori
    • Tensorflow with DDL on Satori
    • Jupyterlab with SSH Tunnel on Satori
  • Satori Public Datasets
  • Singularity for Satorians
    • Fast start
    • Other notes
    • Interactive Allocation:
    • Non interactive / batch mode
  • Relion Cryoem for Satorians
    • Prerequisites
    • Quick start
    • Other notes
  • Copying larger files and large file sets
  • Using mrsync
  • Using Aspera for remote file transfer to Satori cluster
  • FAQ
    • Tips, tricks and questions
      • How can I see disk usage?
      • Where should I put world or project shared datasets?
      • How can I create custom Jupyter kernels for the Satori web portal?
        • Steps to create a kernel
      • How do I set up a basic conda environment?
      • System software queries
        • What Linux distribution version am I running?
        • What Linux kernel level am I running?
        • What software levels are installed on the system?
      • System hardware queries
        • What is my CPU configuration?
        • How much RAM is there on my nodes?
        • What SMT mode are my nodes in?
        • What CPU governor is in effect on my nodes?
        • What are the logical IDs and UUIDs for the GPUs on my nodes?
        • What is the IBM model of my system?
        • Which logical CPUs belong to which socket?
      • Questions about my jobs
        • How can I establish which logical CPU IDs my process is bound to?
        • Can I see the output of my job before it completes?
        • I have a job waiting in the queue, and I want to modify the options I had selected
        • I have submitted my job several times, but I get no output
        • How do I set a time limit on my job?
        • Can I make a job’s startup depend on the completion of a previous one?
        • How do I select a specific set of hosts for my job?
        • How do I deselect specific nodes for my job?
        • My job’s runtime environment is different from what I expected
        • I want to know precisely what my job’s runtime environment is
      • Portal queries
        • I see no active sessions in My Interactive Sessions?
      • How do I build a Singularity image from scratch?
        • Set up to run Docker in ppc64le mode on an x86 machine
        • Run Docker in ppc64le mode on an x86 machine to generate an image for Satori
        • Import new Docker hub image into Singularity on Satori
        • Using Singularity instead of Docker
  • Green Up Hackathon IAP 2020
    • Tutorial Examples
      • Pytorch Style Transfer
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
      • Neural network DNA
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
      • Pathology Image Classification Transfer Learning
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
      • Multi Node Multi GPU TensorFlow 2.0 Distributed Training Example
        • Description
        • Prerequisites if you are not yet running TensorFlow 2.0
        • Commands to run this example
        • What’s going on here?
        • Code and input data repositories for this example
        • Useful references
      • WMLCE demonstration notebooks
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
      • Finding clusters in high-dimensional data using tSNE and DB-SCAN
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
      • BigGAN-PyTorch
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
    • Measuring Resource Use
      • Intergrated energy use profiling
        • Description
        • Commands to run this example
        • Code and input data repositories for this example
        • Useful references
      • Profiling code with nvprof
        • Description
        • Commands to run the examples
        • Useful references
  • Getting help on Satori
    • Email help
    • Slack
    • Slack or orcd-help-satori@mit.edu
    • Tips and Tricks
  • Acceptable Use and Code of Conduct
    • Acceptable Use Guidelines
    • Code of Conduct
MIT Satori User Documentation
  • Search


© Copyright 2024, MIT Satori Project.

Built with Sphinx using a theme provided by Read the Docs.