This page explains how to request memory in Slurm scripts and how to deal with common errors involving CPU and GPU memory. Note that "memory" on this website always refers to RAM and never storage space.
A common error to encounter when running jobs on the HPC clusters is
srun: error: tiger-i23g11: task 0: Out Of Memory srun: Terminating job step 3955284.0 slurmstepd: error: Detected 1 oom-kill event(s) in step 3955284.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
This error indicates that your job tried to use more memory (RAM) than was requested by your Slurm script. By default, on most clusters, you are given 4 GB per CPU-core by the Slurm scheduler. If you need more or less than this then you need to explicitly set the amount in your Slurm script. The most common way to do this is with the following Slurm directive:
#SBATCH --mem-per-cpu=8G # memory per cpu-core
An alternative directive to specify the required memory is
#SBATCH --mem=2G # total memory per node
How do you know how much memory to request? For a simple code, one can look at the data structures that are used and calculate it by hand. For instance, a code that declares an array of 1 million elements in double precision will require 8 MB since a double requires 8 bytes. For other cases, such as a pre-compiled executable or a code that dynamically allocates memory during execution, estimating the memory requirement is much harder. In some cases an estimate can be obtained by running the code on a laptop or workstation and using the Linux command htop -u $USER or on a Mac the Activity Monitor which is found in /Applications/Utilities. If using htop then look at the RES column for the process of interest.
Checking the Memory Usage of a Running Job
Use the squeue -u $USER command to get the hostname of the compute node that the job is running on (see the rightmost column labeled "NODELIST(REASON)"). Then ssh to this node: ssh <hostname> (e.g., ssh tiger-i19g1). Finally, run htop -u $USER which will produce output like this:
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 176776 aturing 21 1 4173M 3846M 13604 R 98.2 2.0 0:36.32 python myscript.py
The RES column shows the memory usage of the job. In this case it is using 3846M or 3.846 GB. To exit htop press Ctrl+C. Run the exit command to leave the compute node and return to the login node.
Another strategy to estimate the required memory is to start with the default 4 GB per CPU-core and run the job. If it runs successfully then look at the email report (see below) and adjust the required memory as needed. If it fails with an out-of-memory error then double the memory requirement and re-submit. Continue this procedure until the job runs successfully then use the email report to set the value more accurately.
To receive email reports, add the following lines to your Slurm script:
#SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-type=fail # send email if job fails #SBATCH --mail-user=<YourNetID>@princeton.edu
Be sure to substitute your NetID for <YourNetID> above.
Below is an example email report from Slurm:
Job ID: 4281297 Cluster: tiger2 User/Group: aturing/math State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 00:01:09 CPU Efficiency: 97.18% of 00:01:11 core-walltime Job Wall-clock time: 00:01:11 Memory Utilized: 1.89 GB Memory Efficiency: 47.37% of 4.00 GB
We see from the report that the job only used 1.89 GB but it requested 4 GB resulting in a memory efficiency of only 47.37%. In this case the correct action would be to decrease the required memory by using something like #SBATCH --mem-per-cpu=3G. It is wise to request more memory than you actually need for safety. Remember that the job will fail if it runs out of memory. However, it is important not to request an excessive amount because that would make it harder for the job scheduler to schedule the job which results in longer queue times.
Another way to see the memory usage of a completed job is to use the seff command:
$ seff <JobID>
The JobID can be obtained from the Slurm output file in the directory where the job was ran (e.g., seff 4281297). Note that this command will produce inaccurate values when used on actively running jobs.
Memory per Cluster
To find out how much memory there is per node on a given cluster, use the snodes command and look at the MEMORY column which lists values in units of MB. Note that some of the nodes may not be available to you since they were purchased by certain groups or departments.
The CPU nodes of Tiger have 192 or 768 GB of memory leading to 4.8 or 19.2 GB per CPU-core, respectively, while the GPU nodes have 256 GB or 9.1 GB per CPU-core. Perseus nodes have 128 GB or 4.6 GB per CPU-core. Traverse nodes have 250 GB or 1.95 GB per hardware thread. On Della, it varies between 128 to 190 GB or 4.6 to 5.9 GB per CPU-core. Members of the physics group have access to additional nodes with 380 GB of memory. Della also has a few high-memory nodes that belong to CSML but are available to all users when not in use. There is one node with 1.51 TB, ten nodes with 3 TB and three with 6.15 TB. These may only be used for jobs that cannot be ran on the regular nodes. The Adroit node adroit-h11g1 has 770 GB of RAM, 40 CPU-cores and four V100 GPUs. The Intel nodes on Stellar have 768 GB or 8 GB per core.
Note that you can request much more memory than the per CPU-core value, up to the total memory of a node. For instance, you could request 1 CPU-core and 50 GB of memory on any of the nodes mentioned above. Be aware that snodes uses the convention of 1 MB is equal to 1024 kilobytes (binary). If you are requesting all the memory of the node then you have to take this into account when specifying the value. The simple solution is to specify the value in megabytes (e.g., #SBATCH --mem=192000M).
In summary, if you request too little memory then your job will fail with an out-of-memory (OOM) error. If you request an excessive amount then the job will run successfully but you may have to wait slightly longer than necessary for it to start. Use Slurm email reports and seff to set the requested memory for future jobs. In doing so be sure to request slightly more memory than you think you will need for safety.
Memory Usage versus Time
See the memory_profiler Python package to monitor memory usage over time as well as line-by-line in your Python script. It is available via conda or pip. The MAP profiler can also be used to measure the memory usage as a function of time.
Just as a CPU has its own memory so does a GPU. GPU memory is much smaller than CPU memory. For instance, each GPU on the TigerGPU cluster has only 16 GB of memory compared to the 256 GB available to the CPU-cores. The Traverse nodes have 32 GB per GPU which is the same as the V100 node of Adroit.
Requesting more GPU memory than what is available will result in an error. Here is an example for PyTorch:
Traceback (most recent call last): File "mem.py", line 8, in <module> y = torch.randn(N, dtype=torch.float64, device=torch.device('cuda:0')) RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached) srun: error: tiger-i23g11: task 0: Exited with exit code 1 srun: Terminating job step 3955266.0
There are no Slurm directives for specifying the GPU memory. In the event of an out-of-memory (OOM) error, one must modify the application script or the application itself to resolve the error. When training neural networks, the most common cause of out-of-memory errors on the GPU is using too large of a batch size.
Slurm email reports and seff say nothing about the amount of GPU memory being used by a job. To see this value one must SSH to the compute node where the job is running and run the nvidia-smi command. For more on this see the bottom of the TigerGPU Utilization page.