Allocation for Slurm script explained

This page explains how to request memory in Slurm scripts and how to deal with common errors involving CPU and GPU memory. Note that "memory" on the Research Computing website always refers to RAM and never storage space for files.

CPU Memory

A common error to encounter when running jobs on the HPC clusters is

srun: error: tiger-i23g11: task 0: Out Of Memory
srun: Terminating job step 3955284.0
slurmstepd: error: Detected 1 oom-kill event(s) in step 3955284.0 cgroup. Some of
your processes may have been killed by the cgroup out-of-memory handler.

This error indicates that your job tried to use more memory (RAM) than was requested by your Slurm script. By default, on most clusters, you are given 4 GB per CPU-core by the Slurm scheduler. If you need more or less than this then you need to explicitly set the amount in your Slurm script. The most common way to do this is with the following Slurm directive:

#SBATCH --mem-per-cpu=8G   # memory per cpu-core

An alternative directive to specify the required memory is

#SBATCH --mem=2G           # total memory per node

Open OnDemand

Select the amount of CPU memory when creating the session:

Open OnDemand Memory

Estimating Memory Requirements

How do you know how much memory to request? For a simple code, one can look at the data structures that are used and calculate it by hand. For instance, a code that declares an array of 1 million elements in double precision will require 8 MB since a double requires 8 bytes. For other cases, such as a pre-compiled executable or a code that dynamically allocates memory during execution, estimating the memory requirement is much harder. Two approaches are described next.

Checking the Memory Usage of a Running Job

The easiest way to see the memory usage of a job is to use the "jobstats" command on a given JobID:

$ jobstats 1234567

The JobID can be obtained by running the "shistory" command. Learn more about Slurm job statistics. One can also see how the memory usage changes over time using the Jobstats web interface. This is useful for detecting memory leaks where the memory steadily increases over time until the limit is reached and the job is canceled.

In some cases an estimate of the required memory can be obtained by running the code on a laptop or workstation and using the Linux command htop -u $USER or on a Mac the Activity Monitor which is found in /Applications/Utilities. If using htop then look at the RES column for the process of interest.

When doing this on a Research Computing cluster one can use the stats.rc web interface.

This can also be done as follows. Use the squeue -u $USER command to get the hostname of the compute node that the job is running on (see the rightmost column labeled "NODELIST(REASON)"). Then ssh to this node: ssh <hostname> (e.g., ssh tiger-i19g1). Finally, run htop -u $USER which will produce output like this:

   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
176776 aturing    21   1 4173M 3846M 13604 R 98.2  2.0  0:36.32 python myscript.py

The RES column shows the memory usage of the job. In this case it is using 3846M or 3.846 GB. To exit htop press Ctrl+C. Run the exit command to leave the compute node and return to the login node.

Empirical Approach

The seconds strategy to estimate the required memory is to start with the default 4 GB per CPU-core and run the job. If it runs successfully then look at the email report (see below) and adjust the required memory as needed. If it fails with an out-of-memory error then double the memory requirement and re-submit. Continue this procedure until the job runs successfully then use the email report to set the value more accurately.

To receive email reports, add the following lines to your Slurm script:

#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-type=fail         # send email if job fails
#SBATCH --mail-user=<YourNetID>@princeton.edu

Be sure to substitute your NetID for <YourNetID> above.

Below is an example email report from Slurm:

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 1234567
  NetID/Account: aturing/math
       Job Name: myjob
          State: RUNNING
          Nodes: 1
      CPU Cores: 4
     CPU Memory: 20GB (5GB per CPU-core)
  QOS/Partition: medium/cpu
        Cluster: della
     Start Time: Sun Jun 26, 2022 at 1:34 PM
       Run Time: 1-01:18:59 (in progress)
     Time Limit: 2-23:59:00
                              Overall Utilization
================================================================================
  CPU utilization  [|||||||||||||||||||||||||||||||||||||||||||||||97%]
  CPU memory usage [|||||||||||||||                                31%]
                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      della-i13n7: 4-02:20:54/4-05:15:58 (efficiency=97.1%)
  CPU memory usage per node - used/allocated
      della-i13n7: 6.0GB/19.5GB (1.5GB/4.9GB per core of 4)
                                     Notes
================================================================================
  * For additional job metrics including metrics plotted against time:
    https://mydella.princeton.edu/pun/sys/jobstats  (VPN required off-campus)

We see from the report that the job only used 1.89 GB but it requested 4 GB resulting in a memory efficiency of only 47.37%. In this case the correct action would be to decrease the required memory by using something like #SBATCH --mem-per-cpu=3G. It is wise to request more memory than you actually need for safety. Remember that the job will fail if it runs out of memory. However, it is important not to request an excessive amount because that would make it harder for the job scheduler to schedule the job which results in longer queue times.

In summary, if you request too little memory then your job will fail with an out-of-memory (OOM) error. If you request an excessive amount then the job will run successfully but you may have to wait slightly longer than necessary for it to start. Use Slurm email reports and jobstats to set the requested memory for future jobs. In doing so be sure to request slightly more memory than you think you will need for safety.

Memory per Cluster

To find out how much memory there is per node on a given cluster, use the snodes command and look at the MEMORY column which lists values in units of MB. You can also use the shownodes command. Note that some of the nodes may not be available to you since they were purchased by certain groups or departments.

ClusterMemory per Node (GB)Memory per CPU-core (GB)
Adroit (skylake)
Adroit (broadwell)
Adroit (V100 node)
Adroit (A100 node)
384
128
770
1000
12 (384/32)
4.6 (128/28)
19.3 (770/40)
31.3 (1000/32)
Della (broadwell)
Della (cascade)
Della (GPU)
Della (physics)
Della (large memory)
128
190
768
380
1510, 3080 or 6150
4.6 (128/28)
5.9 (190/32)
6 (768/128)
9.5 (380/40)
31.5 (1510/48), 32.1 (3080/96) or 64.1 (6150/96)
Stellar (Intel)
Stellar (AMD)
Stellar (GPU)
Stellar (bigmem)
768
512
512
4000
8 (768/96)
4 (512/128)
4 (512/128)
31.2 (4000/128)
Tiger192 or 7684.8 (192/40) or 19.2 (768/40)
Traverse2501.95 (250/128)

The Adroit node adroit-h11g1 has 770 GB of RAM, 40 CPU-cores and four V100 GPUs. Each V100 GPU has 32 GB of memory. Members of the physics group on Della have access to additional nodes with 380 GB of memory. Della also has a few high-memory nodes that belong to CSML but are available to all users when not in use. There is one node with 1.51 TB, ten nodes with 3 TB and three with 6.15 TB. These may only be used for jobs that cannot be ran on the regular nodes. Your jobs will land on these nodes if you request more memory than what is available on the regular nodes. Traverse has 32 CPU-cores with 4 hardware threads per CPU-core.

Note that you can request much more memory than the per CPU-core value, up to the total memory of a node. For instance, you could request 1 CPU-core and 50 GB of memory on any of the nodes mentioned above. Be aware that snodes uses the convention of 1 MB is equal to 1024 kilobytes (binary). If you are requesting all the memory of the node then you have to take this into account when specifying the value. The simple solution is to specify the value in megabytes (e.g., #SBATCH --mem=192000M). If you fail to do this then you will encounter the following message:

sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

You cannot enter non-integer values for the memory (e.g., 4.2G is not allowed). Doing so will produce the following error:

sbatch: error: Invalid --mem specification

The solution to the error above is to use integer values (e.g., 4200M).

Memory Usage versus Time

One can see the memory history a job using the stats.rc web interface. See the Python packages memray or memory_profiler to monitor memory usage over time as well as line-by-line in your Python script. The MAP profiler can also be used to measure the memory usage as a function of time.

GPU memory

Just as a CPU has its own memory so does a GPU. GPU memory is much smaller than CPU memory. For instance, each GPU on the Traverse cluster has only 32 GB of memory compared to the 250 GB available to the CPU-cores. See the GPU Computing page for an introduction to GPUs.

When an application tries to allocate more GPU memory than what is available, it will result in an error. Here is an example for PyTorch:

Traceback (most recent call last):
  File "mem.py", line 8, in <module>
    y = torch.randn(N, dtype=torch.float64, device=torch.device('cuda:0'))
RuntimeError: CUDA out of memory. Tried to allocate 8.94 GiB (GPU 0; 15.90 GiB total
capacity; 8.94 GiB already allocated; 6.34 GiB free; 0 bytes cached)
srun: error: tiger-i23g11: task 0: Exited with exit code 1
srun: Terminating job step 3955266.0

There are no Slurm directives for specifying the GPU memory. When a job allocates a GPU, all of the GPU memory is available to the job. On Della, each GPU provides either 40 or 80 GB. In the event of an out-of-memory (OOM) error, one must modify the application script or the application itself to resolve the error. When training neural networks, the most common cause of out-of-memory errors on the GPU is using too large of a batch size.

GPU memory as a function of time for a running or completed job can be viewed via the stats.rc web interface. As a second method for jobs that are actively running, one can SSH to the compute node and run the nvidia-smi command. For more on this see the GPU Computing page.

Getting Help

If you encounter any difficulties with CPU or GPU memory then please send an email to [email protected] or attend a help session.