On all of the cluster systems (except Nobel and Tigressdata), you run programs by submitting scripts to the Slurm job scheduler. A Slurm script must do three things: (1) prescribe the resource requirements for the job, (2) set the environment and (3) specify the work to be carried out in the form of shell commands.
Below is a sample Slurm script for running a Python script:
#!/bin/bash #SBATCH --job-name=myjob # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load anaconda3/2020.7 conda activate pytools-env python myscript.py
The first line of a Slurm script specifies the Unix shell to be used. This is followed by a series of #SBATCH directives which set the resource requirements and other parameters of the job. While many directives are optional, a Slurm script is required to set the number of nodes, number of tasks and time. The script above requests 1 CPU-core and 4 GB of memory for 1 minute of run time. The necessary changes to the environment are made by loading the anaconda3 environment module and activating a particular Conda environment. Lastly, the work to be done, which is the execution of a Python script, is specified in the final line.
See below for information about the correspondence between tasks and CPU-cores. If your job fails to finish before the specified time limit then it will be killed. You should use an accurate value for the time limit but include an extra 20% for safety.
A job script named job.slurm is submitted to the Slurm scheduler with the sbatch command:
$ sbatch job.slurm
The job must be submitted to the scheduler from the head node of a cluster. The scheduler will queue the job where it will remain until it has sufficient priority to run on a compute node. Depending on the nature of the job and available resources, the queue time will vary between seconds to many days. When the job finishes the user will receive an email. To check the status of queued and running jobs, use the following command:
$ squeue -u <YourNetID>
To see the expected start times of your queued jobs:
$ squeue -u <YourNetID> --start
See Slurm scripts for Python, R, MATLAB, Julia and Stata.
Useful SLURM Commands
| Command | Description |
| sbatch <slurm_script> | Submit a job (e.g., sbatch job.slurm) |
| salloc | Interactive allocation (see below) |
| srun | Parallel job launcher (Slurm analog of mpirun) |
| scancel <jobid> | Cancel a job (e.g., scancel 2534640) |
| squeue | Show all jobs in the queue |
| squeue -u <username> | Show jobs in the queue for a specific user (e.g., squeue -u aturing) |
| squeue --start | Report the expected start time for pending jobs |
| squeue -j <jobid> | Show the nodes allocated to a running job |
| snodes | Show properties of the nodes of a cluster (e.g., maximum memory) |
| sshare/sprio | Show cluster shares by group/Show the priority assigned to jobs |
| slurmtop | Text-based view of cluster nodes |
| scontrol show config | View default parameter settings |
Download a command summary: PDF
Serial Jobs
Serial jobs use only a single CPU-core. This is in contrast to parallel jobs which use multiple CPU-cores simultaneously.
#!/bin/bash #SBATCH --job-name=slurm-test # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge Rscript myscript.R
Slurm scripts are more or less shell scripts with some extra parameters to set the resource requirements:
- --nodes=1 - specify one node
- --ntasks=1 - claim one task (by default 1 per CPU-core)
- --time - claim a time allocation, here 1 minute. Format is DAYS-HOURS:MINUTES:SECONDS
The other settings configure automated emails. You can delete these lines if you prefer not to receive emails.
As a Slurm job runs, unless you redirect output, a file named slurm-<jobid>.out will be produced in the directory where the sbatch command was ran. You can use cat, less or any text editor to view it. The file contains the output your program would have written to a terminal if run interactively.
If you are new to the HPC clusters or Slurm then see this tutorial for running your first job. Read this page to know where to write the output files of your jobs.
Interactive Allocations with salloc
The head node of a cluster can only be used for very light interactive work using up to 10% of the machine (CPU-cores and memory) for up to 10 minutes. This is strictly enforced because violations of this rule can often adversely affect the work of other users. Intensive interactive work must be carried out on the compute nodes using the salloc command. To work interactively on a compute node with 1 CPU-core and 4 GB of memory for 20 minutes, use the following command:
$ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00
As with batch jobs, interactive allocations go through the queuing system. This means that when the cluster is busy you will have to wait before the allocation is granted. You will see output like the following:
salloc: Pending job allocation 32280311 salloc: job 32280311 queued and waiting for resources salloc: job 32280311 has been allocated resources salloc: Granted job allocation 32280311 salloc: Waiting for resource configuration salloc: Nodes della-r4c1n13 are ready for job [aturing@della-r4c1n13 ~]$
After the wait time, you will be placed in a shell on a compute node where you can begin working interactively. Note that your allocation will be terminated when the time limit is reached. You can use the exit command to end the session and return to the head node at anytime.
To request a node with a GPU:
$ salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00 --gres=gpu:1
If you are working with graphics then to enable X11 forwarding add --x11.
Considerations
Some things to think about:
- Make sure your Slurm script loads any dependencies or path changes (i.e., you need python3, so module load anaconda3)
- Make sure you call your executable with its full path or cd to the appropriate directory.
- If you call the executable directly and not via an interpreter like srun or python or Rscript, etc., make sure you have +x permissions on it.
- Think about file systems. Different ones are useful for different things, have different sizes, and they don't tall talk to each other. (i.e., /scratch is local to a specific node, /network/scratch/<YourNetID> is networked to the entire cluster. This has implication for temporary files and data).
- /home has a default 10 GB quota on most clusters and should be used mostly for code and packages needed to run tasks. On the Tigress clusters there is a shared GPFS filesystem and Adroit has scratch storage. You can request an increase up to 10 GB for your /home directory if necessary for larger packages (see this page).
Large Memory Serial Jobs
One advantage of using the HPC clusters over your laptop or workstation is the large amount of RAM available per node. You can run a serial job with 100's of GB of memory, for example. This can be very useful for working with a large data set. To find out how much memory each node has, run the snodes command and look at the MEMORY column which is in megabytes.
#!/bin/bash #SBATCH --job-name=slurm-test # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem=100G # memory per node (4G per cpu-core is default) #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load anaconda3/2020.7 conda activate micro-env python myscript.py
The example above runs a Python script using 1 CPU-core and 100 GB of memory. In all Slurm scripts you should use an accurate value for the required memory but include an extra 20% for safety. For more on job memory see this page.
Multithreaded Jobs
Some software like the linear algebra routines in NumPy and MATLAB are able to use multiple CPU-cores via libraries that have written using shared-memory parallel programming models like OpenMP, pthreads or Intel Threading Building Blocks (TBB). OpenMP programs, for instance, run as multiple "threads" on a single node with each thread using one CPU-core.
Below is an appropriate Slurm script for a multithreaded job:
#!/bin/bash #SBATCH --job-name=multithread # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:15:00 # maximum time needed (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK module purge module load matlab/R2019a matlab -nodisplay -nosplash -r for_loop
In the script above the cpus-per-task parameter is used to tell Slurm to run the single task using four CPU-cores. In general, as cpus-per-task increases, the execution time of the job decreases while the queue time increases and the parallel efficiency decreases. The optimal value of cpus-per-task must be determined empirically. Here are some examples of multithreaded codes: C++, Python and MATLAB. A list of external resources for learning more about OpenMP is here.
IMPORTANT: Only codes that have been explicitly written to use multiple threads will be able to take advantage of multiple CPU-cores. Using a value of cpus-per-task greater than 1 for a code that has not been parallelized will not improve its performance. Instead you will waste resources and have a lower priority for your next job submission.
Multinode Jobs
Many scientific codes use of a form of distributed-memory parallelism based on MPI (Message Passing Interface). These codes are able to use multiple CPU-cores on multiple nodes simultaneously. For example, the script below uses 32 CPU-cores on each of 2 nodes:
#!/bin/bash #SBATCH --job-name=multinode # create a short name for your job #SBATCH --nodes=2 # node count #SBATCH --ntasks-per-node=32 # number of tasks per node #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:05:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load intel intel-mpi srun /home/aturing/.local/bin/lmp_mpi
IMPORTANT: Only codes that have been explicitly written to run in parallel can take advantage of multiple cores on multiple nodes. Using a value of --ntasks greater than 1 for a code that has not been parallelized will not improve its performance. Instead you will waste resources and have a lower priority for your next job submission.
IMPORTANT: The optimal value of --nodes and --ntasks for a parallel code must be determined empirically. As these quantities increase, the parallel efficiency tends to decrease and queue times increase. The parallel efficiency is the serial execution time divided by the product of the parallel execution time and the number of tasks. If multiple nodes are used then in most cases one should try to use all of the CPU-cores on each node.
MPI jobs are composed of multiple processes running across one or more nodes. The processes coordinate through point-to-point and collective operations. More information about running MPI jobs is in Compiling and Running MPI Jobs.
The Princeton HPC clusters are configured to use srun which is the Slurm analog of mpirun or mpiexec.
Multinode, Multithreaded Jobs
Many codes combine multithreading with multinode parallelism using a hybrid OpenMP/MPI approach. Below is a Slurm script appropriate for such a code:
#!/bin/bash #SBATCH --job-name=hybrid # create a short name for your job #SBATCH --nodes=2 # node count #SBATCH --ntasks-per-node=8 # total number of tasks per node #SBATCH --cpus-per-task=4 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK module purge module load intel intel-mpi srun /home/aturing/.local/bin/mdrun_mpi
The script above allocates 2 nodes with 32 CPU-cores per node. For OpenMP/MPI codes, the script above would produce 8 MPI processes per node. When an OpenMP parallel directive is encountered, each process would execute the work using 4 CPU-cores. For a simple C++ example of this see this page. For more details about the SBATCH options see this page.
As discussed above, the optimal values of nodes, ntasks-per-node and cpus-per-task must be determined empirically. Many codes that use the hybrid OpenMP/MPI model will run sufficiently fast on a single node.
Job Arrays
Job arrays are used for running the same job a large number of times with only slight differences between the jobs. For instance, let's say that you need to run 100 jobs, each with a different seed value for the random number generator. Or maybe you want to run the same analysis script on data for each of the 50 states in the USA. Job arrays are the best choice for such cases.
Below is an example Slurm script where there are 5 jobs in the array:
#!/bin/bash #SBATCH --job-name=array-job # create a short name for your job #SBATCH --output=slurm-%A.%a.out # STDOUT file #SBATCH --error=slurm-%A.%a.err # STDERR file #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --array=0-4 # job array with index values 0, 1, 2, 3, 4 #SBATCH --mail-type=all # send email on job start, end and fault #SBATCH --mail-user=<YourNetID>@princeton.edu echo "My SLURM_ARRAY_JOB_ID is $SLURM_ARRAY_JOB_ID." echo "My SLURM_ARRAY_TASK_ID is $SLURM_ARRAY_TASK_ID" echo "Executing on the machine:" $(hostname) python myscript.py $SLURM_ARRAY_TASK_ID
The first few lines of myscript.py might look like this:
import sys idx = int(sys.argv[-1]) # get the value of SLURM_ARRAY_TASK_ID parameters = [2.5, 5.0, 7.5, 10.0, 12.5] myparam = parameters[idx] # execute the rest of the script using myparam
For an R script you can use:
args <- commandArgs(TRUE) idx <- as.numeric(args[1]) parameters <- c(2.5, 5.0, 7.5, 10.0, 12.5) myparam <- parameters[idx + 1] # execute the rest of the script using myparam
Job arrays produce outputs with the job id and the individual task id that echo their subtask number. You can set the array numbers to any arbitrary set of numbers, so that you can subset processing a larger list by referencing the value of $SLURM_ARRAY_TASK_ID. For example:
#SBATCH --array=0,100,200,300,400,500 ./myprogram $SLURM_ARRAY_TASK_ID
This snippet shows a six-task array that will pass increments of 100 to the program in question. It can then start processing a data frame, for example, at rows 0, 100, 200, 300, 400 and 500 amoung the 6 sub-jobs. Thus if these arrays run in parallel, you would complete 600 rows.
Note that it is normal to see (QOSMaxJobsPerUserLimit) listed in the NODELIST(REASON) column of squeue output for array jobs. It indicates that you can only have a certain number of jobs actively queued. Just wait and all the jobs of the array will run.
Each job in the array will have the same values for nodes, ntasks, cpus-per-task, time and so on. This means that job arrays can be used to handle everything from serial jobs to large multi-node cases.
To see the limit on the number of jobs in an array:
# ssh della $ scontrol show config | grep Array MaxArraySize = 2501
Running Multiple Jobs in Parallel as a Single Job
In general one should use job arrays for this task but in some cases different executables need to run simultaneously. In the example below all the executables are the same but this is not requred. If we have, for example, three jobs and we want to run them in parallel as a single Slurm job, we can use the following script:
#!/bin/bash #SBATCH --job-name=myjob # create a short name for your job #SBATCH --nodes=3 # node count #SBATCH --ntasks=3 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load anaconda3/2020.7 srun -N 1 -n 1 python demo.py 0 & srun -N 1 -n 1 python demo.py 1 & srun -N 1 -n 1 python demo.py 2 & wait
Since we want to run the jobs in parallel, we place the & character at the end of each srun command so that each job runs in the background. The wait command serves as a barrier keeping the overall job running until all jobs are complete. Since sruns cannot share nodes by default, we need to request three nodes and three tasks, one for each srun. In the execution command we then distribute the resources by giving each srun one task on one node.
Do not use this method if the tasks running within the overall job are expected to have significantly different execution times. Doing so would result in idle processors until the longest task finished. Job arrays should always be used in place of this method when possible.
If you want to run a large number of short jobs while reusing the same Slurm allocation then see this example.
For more see "MULTIPLE PROGRAM CONFIGURATION" on this page.
GPUs
GPUs are available on Tiger, Adroit and Traverse. On Tiger and Traverse there are four GPUs on each GPU-enabled compute node. To use GPUs in a job add an SBATCH statement with the --gres option:
#!/bin/bash #SBATCH --job-name=mnist # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=1 # total number of tasks across all nodes #SBATCH --cpus-per-task=6 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G is default) #SBATCH --gres=gpu:1 # number of gpus per node #SBATCH --time=00:01:00 # total run time limit (HH:MM:SS) #SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu module purge module load anaconda3/2020.7 conda activate tf2-gpu python myscript.py
To use, for instance, four GPUs per node the appropriate line would be:
#SBATCH --gres=gpu:4
IMPORTANT: Only codes that have been explicitly written to run on GPUs can take advantage of GPUs. Adding the --gres option to a Slurm script for a CPU-only code will not speed-up the execution time but it will waste resources, increase your queue time and lower the priority of your next job submission. Furthermore, some codes are only written to use a single GPU. Do not request multiple GPUs unless your code can use them.
For additional tips on effectively using GPUs and to monitor GPU utilization on TigerGPU see this page.
On Adroit, use --gres=gpu:tesla_v100:1 to run on the V100 GPUs instead of the older K40c GPUs.
More information about GPUs and GPU programming at Princeton is here. For OpenACC see this material.
Efficiency Reports
If you include the appropriate SBATCH mail directives in your Slurm script then you will receive an email after each job finishes. Below is a sample report:
Job ID: 670018 Cluster: adroit User/Group: ceisgrub/pres State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:00:05 CPU Efficiency: 22.73% of 00:00:22 core-walltime Job Wall-clock time: 00:00:22 Memory Utilized: 1.41 MB Memory Efficiency: 0.14% of 1.00 GB
The report provides information about run time, CPU usage, memory usage and so on. You should inspect these values to determine if you are using the resources properly. Your queue time is in part determined by the amount of resources your are requesting. Your fairshare value, which in part determines the priority of your next job, is decreased in proportion to the amount of resources requested in the previous 30 days.
Why Won't My Job Run?
See this page to learn about job priority.
More Slurm Resources
See this page for a comprehensive list of external resources for learning Slurm.
Complete HPC Guide
For more on the Princeton HPC systems: Getting Started with the Research Computing Clusters