Effective Usage

It is easy to accidentally waste resource on the HPC clusters. Make sure you read over the Top 10 Mistakes to Avoid on the Research Computing Clusters. Additional guidelines which promote effective usage are given below.

Slurm

The more resources you request, the longer your job will spend in the queue waiting for the resources to become available. Try to specifiy your minimum requirements. Here are the key pieces:

  • Number of CPU-cores
  • Amount of time required to run the job
  • Amount of memory (RAM) needed
  • Number of GPUs (if any)

For parallel codes, one needs to carry out a scaling analysis as described on Choosing the Number of Nodes, CPU-cores and GPUs.

Time

The longer the requested run time limit, the longer your queue time. Make you sure you choose an accurate value but include some extra time for safety since the job will be killed if it does not complete before the run time limit.

Memory

For most jobs the default memory per CPU-core will be a resonable choice (4 GB for all clusters besides Stellar where it is 8 GB). Requesting excess memory can cause your job to spend longer and necessary in the queue. Read more about allocating memory.

Use the Minimum Number of Nodes

You should make every effort to use all of the CPU-cores on a node before requesting an additional node. You should always specifiy the number of nodes in your Slurm script since specifying ntasks > 1 without specifying the number of nodes may cause your job to be split across nodes. This can prevent other jobs which require all the CPU-cores of a node from running. So use directives such as the following:

#SBATCH --nodes=1
#SBATCH --ntasks=16

 

Files

Use /scratch/gpfs

Make sure that you use the local /scratch/gpfs filesystem with job input and output files. Do not use /tigress or /projects. See the Data Storage page for a complete discussion of this.

Number of Files

Our filesystems perform best for megabyte-size files that do not exceed hundreds of thousounds in number. Run the checkquota command to see the limits on the number of files you can store. If you have a large number of files then please use a command like tar to combine them into a single archive.