Introducing Slurm

On all of the cluster systems (except Nobel and Tigressdata), you run programs by submitting scripts to the Slurm job scheduler. A Slurm script must do three things: (1) prescribe the resource requirements for the job, (2) set the environment and (3) specify the work to be carried out in the form of shell commands.

Below is a sample Slurm script for running a Python code:

#!/bin/bash
#SBATCH --job-name=myjob         # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.7
conda activate pytools-env

python myscript.py

The first line of a Slurm script specifies the Unix shell to be used. This is followed by a series of #SBATCH directives which set the resource requirements and other parameters of the job. While many directives are optional, a Slurm script is required to set the number of nodes, number of tasks and time. The script above requests 1 CPU-core and 4 GB of memory for 1 minute of run time. The necessary changes to the environment are made by loading the anaconda3 environment module and activating a particular Conda environment. Lastly, the work to be done, which is the execution of a Python script, is specified in the final line.

See below for information about the correspondence between tasks and CPU-cores. If your job fails to finish before the specified time limit then it will be killed. You should use an accurate value for the time limit but include an extra 20% for safety.

A job script named job.slurm is submitted to the Slurm scheduler with the sbatch command:

$ sbatch job.slurm

The job must be submitted to the scheduler from the head node of a cluster. The scheduler will queue the job where it will remain until it has sufficient priority to run on a compute node. Depending on the nature of the job and available resources, the queue time will vary between seconds to many days. When the job finishes the user will receive an email. To check the status of queued and running jobs, use the following command:

$ squeue -u <YourNetID>

To see the expected start times of your queued jobs:

$ squeue -u <YourNetID> --start

See Slurm scripts for Python, R, MATLAB, Julia and Stata.

 

Useful SLURM Commands

Command Description
sbatch <slurm_script> Submit a job (e.g., sbatch job.slurm)
salloc Interactive allocation (see below)
srun Parallel job launcher (Slurm analog of mpirun)
scancel <jobid> Cancel a job (e.g., scancel 2534640)
squeue Show all jobs in the queue
squeue -u <username> Show jobs in the queue for a specific user (e.g., squeue -u aturing)
squeue --start Report the expected start time for pending jobs
squeue -j <jobid> Show the nodes allocated to a running job
snodes Show properties of the nodes of a cluster (e.g., maximum memory)
sshare/sprio Show cluster shares by group/Show the priority assigned to jobs
slurmtop Text-based view of cluster nodes
scontrol show config View default parameter settings

Download a command summary: PDF

 

Serial Jobs

Serial jobs use only a single CPU-core. This is in contrast to parallel jobs which use multiple CPU-cores simultaneously.

#!/bin/bash
#SBATCH --job-name=slurm-test    # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
Rscript myscript.R

Slurm scripts are more or less shell scripts with some extra parameters to set the resource requirements:

  • --nodes=1 - specify one node
  • --ntasks=1 - claim one task (by default 1 per CPU-core)
  • --time - claim a time allocation, here 1 minute. Format is DAYS-HOURS:MINUTES:SECONDS

The other settings configure automated emails. You can delete these lines if you prefer not to receive emails.

As a Slurm job runs, unless you redirect output, a file named slurm-<jobid>.out will be produced in the directory where the sbatch command was ran. You can use cat, less or any text editor to view it. The file contains the output your program would have written to a terminal if run interactively.

If you are new to the HPC clusters or Slurm then see this tutorial for running your first job. Read this page to know where to write the output files of your jobs.

 

Interactive Allocations with salloc

The head node of a cluster can only be used for very light interactive work using up to 10% of the machine (CPU-cores and memory) for up to 10 minutes. This is strictly enforced because violations of this rule can often adversely affect the work of other users. Intensive interactive work must be carried out on the compute nodes using the salloc command. To work interactively on a compute node with 1 CPU-core and 4 GB of memory for 20 minutes, use the following command:

$ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00

As with batch jobs, interactive allocations go through the queuing system. This means that when the cluster is busy you will have to wait before the allocation is granted. You will see output like the following:

salloc: Pending job allocation 32280311
salloc: job 32280311 queued and waiting for resources
salloc: job 32280311 has been allocated resources
salloc: Granted job allocation 32280311
salloc: Waiting for resource configuration
salloc: Nodes della-r4c1n13 are ready for job
[aturing@della-r4c1n13 ~]$

After the wait time, you will be placed in a shell on a compute node where you can begin working interactively. Note that your allocation will be terminated when the time limit is reached. You can use the exit command to end the session and return to the head node at anytime.

To request a node with a GPU:

$ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00 --gres=gpu:1

If you are working with graphics then to enable X11 forwarding add --x11.

 

Considerations

Some things to think about:

  • Make sure your Slurm script loads any dependencies or path changes (i.e., you need python3, so module load anaconda3)
  • Make sure you call your executable with its full path or cd to the appropriate directory.
  • If you call the executable directly and not via an interpreter like srun or python or Rscript, etc., make sure you have +x permissions on it.
  • Think about file systems. Different ones are useful for different things, have different sizes, and they don't tall talk to each other. (i.e., /scratch is local to a specific node, /network/scratch/<YourNetID> is networked to the entire cluster. This has implication for temporary files and data).
  • /home has a default 10 GB quota on most clusters and should be used mostly for code and packages needed to run tasks. On the Tigress clusters there is a shared GPFS filesystem and Adroit has scratch storage. You can request an increase up to 10 GB for your /home directory if necessary for larger packages (see this page).
 

Large Memory Serial Jobs

One advantage of using the HPC clusters over your laptop or workstation is the large amount of RAM available per node. You can run a serial job with 100's of GB of memory, for example. This can be very useful for working with a large data set. To find out how much memory each node has, run the snodes command and look at the MEMORY column which is in megabytes.

#!/bin/bash
#SBATCH --job-name=slurm-test    # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=100G               # memory per node (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.7
conda activate micro-env

python myscript.py

The example above runs a Python script using 1 CPU-core and 100 GB of memory. In all Slurm scripts you should use an accurate value for the required memory but include an extra 20% for safety. For more on job memory see this page.

 

Multithreaded Jobs

Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have written using shared-memory parallel programming models like OpenMP,  pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core.

Below is an appropriate Slurm script for a multithreaded job:

#!/bin/bash
#SBATCH --job-name=multithread   # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:15:00          # maximum time needed (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module purge
module load matlab/R2019a

matlab -nodisplay -nosplash -r for_loop

In the script above the cpus-per-task parameter is used to tell Slurm to run the single task using four CPU-cores. In general, as cpus-per-task increases, the execution time of the job decreases while the queue time increases and the parallel efficiency decreases. The optimal value of cpus-per-task must be determined empirically. Here are some examples of multithreaded codes: C++, Python and MATLAB. A list of external resources for learning more about OpenMP is here.

IMPORTANT: Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead you will waste resources and have a lower priority for your next job submission.

 

Multinode Jobs

Many scientific codes use of a form of distributed-memory parallelism based on MPI (Message Passing Interface). These codes are able to use multiple CPU-cores on multiple nodes simultaneously. For example, the script below uses 32 CPU-cores on each of 2 nodes:

#!/bin/bash
#SBATCH --job-name=multinode     # create a short name for your job
#SBATCH --nodes=2                # node count
#SBATCH --ntasks-per-node=32     # number of tasks per node
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load intel intel-mpi

srun /home/aturing/.local/bin/lmp_mpi

IMPORTANT: Only codes that have been explicitly written to run in parallel can take advantage of multiple cores on multiple nodes. Using a value of --ntasks greater than 1 for a code that has not been parallelized will not improve its performance. Instead you will waste resources and have a lower priority for your next job submission.

IMPORTANT: The optimal value of --nodes and --ntasks for a parallel code must be determined empirically. As these quantities increase, the parallel efficiency tends to decrease and queue times increase. The parallel efficiency is the serial execution time divided by the product of the parallel execution time and the number of tasks. If multiple nodes are used then in most cases one should try to use all of the CPU-cores on each node.

You should try to use all the cores on one node before requesting additional nodes. That is, it is better to use one node and 32 cores than two nodes and 16 cores per node.

MPI jobs are composed of multiple processes running across one or more nodes. The processes coordinate through point-to-point and collective operations. More information about running MPI jobs is in Compiling and Running MPI Jobs.

The Princeton HPC clusters are configured to use srun which is the Slurm analog of mpirun or mpiexec.

 

Multinode, Multithreaded Jobs

Many codes combine multithreading with multinode parallelism using a hybrid OpenMP/MPI approach. Below is a Slurm script appropriate for such a code:

#!/bin/bash
#SBATCH --job-name=hybrid        # create a short name for your job
#SBATCH --nodes=2                # node count
#SBATCH --ntasks-per-node=8      # total number of tasks per node
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

module purge
module load intel intel-mpi

srun /home/aturing/.local/bin/mdrun_mpi

The script above allocates 2 nodes with 32 CPU-cores per node. For OpenMP/MPI codes, the script above would produce 8 MPI processes per node. When an OpenMP parallel directive is encountered, each process would execute the work using 4 CPU-cores. For a simple C++ example of this see this page. For more details about the SBATCH options see this page.

As discussed above, the optimal values of nodes, ntasks-per-node and cpus-per-task must be determined empirically. Many codes that use the hybrid OpenMP/MPI model will run sufficiently fast on a single node.

 

Job Arrays

Job arrays are used for running the same job a large number of times with only slight differences between the jobs. For instance, let's say that you need to run 100 jobs, each with a different seed value for the random number generator. Or maybe you want to run the same analysis script on data for each of the 50 states in the USA. Job arrays are the best choice for such cases.

Below is an example Slurm script where there are 5 jobs in the array:

#!/bin/bash
#SBATCH --job-name=array-job     # create a short name for your job
#SBATCH --output=slurm-%A.%a.out # STDOUT file
#SBATCH --error=slurm-%A.%a.err  # STDERR file
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --array=0-4              # job array with index values 0, 1, 2, 3, 4
#SBATCH --mail-type=all          # send email on job start, end and fault
#SBATCH --mail-user=<YourNetID>@princeton.edu

echo "My SLURM_ARRAY_JOB_ID is $SLURM_ARRAY_JOB_ID."
echo "My SLURM_ARRAY_TASK_ID is $SLURM_ARRAY_TASK_ID"
echo "Executing on the machine:" $(hostname)

python myscript.py

The key line in the Slurm script above is:

#SBATCH --array=0-4

In this example, the Slurm script will run five jobs. Each job will have a different value of SLURM_ARRAY_TASK_ID (i.e., 0, 1, 2, 3, 4). The value of SLURM_ARRAY_TASK_ID can be used to differentiate the jobs within the array. One can either pass SLURM_ARRAY_TASK_ID to the executable as a command-line parameter or reference it as an environment variable. For instance, the first few lines of myscript.py might look like this:

import os
idx = int(os.environ["SLURM_ARRAY_TASK_ID"])
parameters = [2.5, 5.0, 7.5, 10.0, 12.5]
myparam = parameters[idx]
# execute the rest of the script using myparam

For an R script you can use:

idx <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))
parameters <- c(2.5, 5.0, 7.5, 10.0, 12.5)
myparam <- parameters[idx + 1]
# execute the rest of the script using myparam

For Julia:

idx = parse(Int64, ENV["SLURM_ARRAY_TASK_ID"])
parameters = (2.5, 5.0, 7.5, 10.0, 12.5)
myparam = parameters[idx + 1]
# execute the rest of the script using myparam

Job arrays produce outputs with the job id and the individual task id.

You can set the array numbers to any arbitrary set of numbers and ranges, for example:

#SBATCH --array=0,100,200,300,400,500
#SBATCH --array=0-24,42,56-99
#SBATCH --array=1-1000

Note that it is normal to see (QOSMaxJobsPerUserLimit) listed in the NODELIST(REASON) column of squeue output for job arrays. It indicates that you can only have a certain number of jobs actively queued. Just wait and all the jobs of the array will run. Use the qos command to see the limits. A maximum number of simultaneously running tasks from the job array may be specified using a "%" separator. For example "--array=0-15%4" will limit the number of simultaneously running tasks from this job array to 4.

To see the limit on the number of jobs in an array:

# ssh della
$ scontrol show config | grep Array
MaxArraySize      = 2501

Each job in the array will have the same values for nodes, ntasks, cpus-per-task, time and so on. This means that job arrays can be used to handle everything from serial jobs to large multi-node cases.

See the Slurm documentation for more on job arrays.

 

Running Multiple Jobs in Parallel as a Single Job

In general one should use job arrays for this task but in some cases different executables need to run simultaneously. In the example below all the executables are the same but this is not requred. If we have, for example, three jobs and we want to run them in parallel as a single Slurm job, we can use the following script:

#!/bin/bash
#SBATCH --job-name=myjob         # create a short name for your job
#SBATCH --nodes=3                # node count
#SBATCH --ntasks=3               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.7

srun -N 1 -n 1 python demo.py 0 &
srun -N 1 -n 1 python demo.py 1 &
srun -N 1 -n 1 python demo.py 2 &
wait

Since we want to run the jobs in parallel, we place the & character at the end of each srun command so that each job runs in the background. The wait command serves as a barrier keeping the overall job running until all jobs are complete. Since sruns cannot share nodes by default, we need to request three nodes and three tasks, one for each srun. In the execution command we then distribute the resources by giving each srun one task on one node.

Do not use this method if the tasks running within the overall job are expected to have significantly different execution times. Doing so would result in idle processors until the longest task finished. Job arrays should always be used in place of this method when possible.

If you want to run a large number of short jobs while reusing the same Slurm allocation then see this example.

For more see "MULTIPLE PROGRAM CONFIGURATION" on this page.

 

GPUs

GPUs are available on Tiger, Adroit and Traverse. On Tiger and Traverse there are four GPUs on each GPU-enabled compute node. To use GPUs in a job add an SBATCH statement with the --gres option:

#!/bin/bash
#SBATCH --job-name=mnist         # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=6        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.7
conda activate tf2-gpu

python myscript.py

To use, for instance, four GPUs per node the appropriate line would be:

#SBATCH --gres=gpu:4

IMPORTANT: Only codes that have been explicitly written to run on GPUs can take advantage of GPUs. Adding the --gres option to a Slurm script for a CPU-only code will not speed-up the execution time but it will waste resources, increase your queue time and lower the priority of your next job submission. Furthermore, some codes are only written to use a single GPU. Do not request multiple GPUs unless your code can use them.

For additional tips on effectively using GPUs and to monitor GPU utilization on TigerGPU see this page.

On Adroit, use --gres=gpu:tesla_v100:1 to run on the V100 GPUs instead of the older K40c GPUs.

More information about GPUs and GPU programming at Princeton is here. For OpenACC see this material.

 

Efficiency Reports

If you include the appropriate SBATCH mail directives in your Slurm script then you will receive an email after each job finishes. Below is a sample report:

Job ID: 670018
Cluster: adroit
User/Group: ceisgrub/pres
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:00:05
CPU Efficiency: 22.73% of 00:00:22 core-walltime
Job Wall-clock time: 00:00:22
Memory Utilized: 1.41 MB
Memory Efficiency: 0.14% of 1.00 GB

The report provides information about run time, CPU usage, memory usage and so on. You should inspect these values to determine if you are using the resources properly. Your queue time is in part determined by the amount of resources your are requesting. Your fairshare value, which in part determines the priority of your next job, is decreased in proportion to the amount of resources requested in the previous 30 days.

 

Why Won't My Job Run?

See this page to learn about job priority.

 

More Slurm Resources

See this page for a comprehensive list of external resources for learning Slurm.

 

Complete HPC Guide

For more on the Princeton HPC systems: Getting Started with the Research Computing Clusters