Job Priority




The Slurm scheduler works much like many other schedulers by simply applying a priority number to a job. To see all jobs with associated priorities one can use:

$ squeue -o "%.18i %Q %.9q %.8j %.8u %.10a %.2t %.10M %.10L %.6C %R" | more

How is the priority of a job determined?

There are a few factors which determine this value. The command sprio -w will show you the current factors being used along with the associated weights. For Della:

$ sprio -w
  Weights                          1      10000      10000      10000       6000    CPU=1,Mem=1

Let us try to explain what each component tries to do.

AGE - this is an increasing value which increments as a job which is ready to run sits in the queue waiting for resources. It will increment for a maximum time of 10 days which is a configurable limit.

FAIRSHARE - this gets complicated but is basically a measure of how much the user and/or group has been using the cluster over the past 30 days. See the "sshare -l" command for actual values where the LevelFS is used as a multiplier to either boost or decrement the actual value given past usage.

JOBSIZE - this is basically the number of cores requested such that more cores give higher priority. Why this is important is because wide jobs would be starved if not given a higher priority and smaller jobs can then be backfilled as resources are waiting to be freed.

QOS - use the qos command to see the various weights for the quality of service the job is using. This is based almost entirely on the time requested. And in most cases those jobs requesting shorter times are given the highest priority here.

Using the command (all on one line):

$ watch -n 30 -d 'squeue --start --format="%.7i %.7Q %.7q %.15j %.12u %.10a %.20S %.6D %.5C %R" --sort=S --states=PENDING | egrep -v "N/A" | head -20'

One can then watch the top jobs waiting to run and begin to see how this works. In many instances you will find lower priority jobs which will run before higher priority ones. This is all due to the requested resources, when they will free, and the decisions the scheduler is making in this regard. This is also why having an accurate representation of the time limit is so important. Better fitting of jobs and more accurate scheduling can happen as the actual time specified becomes more accurate for all jobs.

Run the following command to see how many shares your group has as well as your fairshare value: 

$ sshare

The fairshare varies between 0 and 1. A value of 0.5 means that you are getting what you should. Less than 0.5 means you have consumed more resources than your share while greater than 0.5 means you have consumed less. If you are not running jobs then your fairshare will increase and eventually reach 1.

When running jobs, remember to only request what you need (i.e., CPU-cores, CPU memory, GPUs and time). Allocating excess resources will hurt the job priority of your subsequent jobs.


When will my jobs start running?

To get an estimate of the start time of your jobs use this command:

$ squeue -u $USER --start

This command will only report on pending jobs so information about running jobs will not be shown. Slurm only considers three pending jobs at a time per user so you will not see estimated starting times for more than this number of jobs.


What are the meanings of the values in NODELIST (REASON)?

The squeue -u $USER command will show the state of all your queued and running jobs. For queued jobs, the rightmost column indicates the reason the job is not running. The most common reasons include:

  1. (Resources) - The necessary combination of nodes/CPUs/GPUs/memory for your job are not available.
  2. (Priority) - There are other users in the queue with a higher priority and will therefore be scheduled first.
  3. (Dependency) - The job will not start until the dependency specified by you is satisfied.
  4. (QOS<something>) - This tells you which limit the job is exceeding in the particular QOS. For example, QOSGrpCpuLimit means that the jobs running in that QOS (e.g., long) are using all of the allotted resources as set by the GrpTRES value. In this case, simply wait and your job will run. Run the qos command to see the limits. The number of "procs" or CPU-cores in use per QOS is displayed at the bottom of the output. One sees that "Grp" relates to the QOS and not to your research group.
  5. (ReqNodeNotAvail, Reserved for maintenance) - If the time limit of your job extends into the maintenance period then the scheduler will not start the job until after maintenance since the job would have to be killed when the clusters go down. No action is needed. The job will run as expected after maintenance. In some cases it may be useful to lower the requested time limit so that the job completes before the scheduled downtime.

For the complete list of codes see this page. Note that for running jobs, the rightmost column of the command above gives the node name(s) that the job is running on.


Why do I see "(ReqNodeNotAvail, Reserved for maintenance)" when I submit a job close to downtime?

The Research Computing clusters and storage systems are unavailable due to routine maintenance from approximately 6 AM to 2 PM on the second Tuesday of every month (please mark your calendar). If the runtime limit of your job submission extends into the maintenance period then the scheduler will not start the job until after maintenance is complete. This is because the job would have to be killed when the clusters go offline. Submitted jobs will have a status of "PD" for pending with the reason being "(ReqNodeNotAvail, Reserved for maintenance)". Leave your job in the queue and it will run as expected after the maintenance period. In some cases it may be useful to lower the requested runtime limit so that the job can finish before the clusters go offline at 6 AM on the second Tuesday of the month.


Is it true that some users have priority on certain clusters?

Yes. This is true on every large cluster. Most of the funding for the clusters comes from individual research groups and entire academic departments. Members of these groups get priority when a given cluster is operating at capacity. When there are idle nodes then all users are equal. Run the sshare command to see which groups have the largest priority.


Why can I only run two short jobs at once?

If you try to run multiple jobs with a time limit of less than 1 hour each, you will find that at most two of them run while the others are queued with "QOSMaxJobsPerUserLimit". The explanation is that the jobs are landing in the test queue where the maximum number of jobs is either 1 or 2 per user depending on the cluster. Run the "qos" command to see this. The solution is to set the time limit of your jobs to 62 minutes and then you will be able to run more jobs simultaneously since the jobs will land in a different QOS. Outside of this case you should always use an accurate value for the time limit (with some extra time included for safety).


Is it okay to run serial jobs on TigerCPU or Stellar?

Not really. TigerCPU and Stellar were designed for parallel jobs that require multiple nodes. Serial and single-node jobs on TigerCPU are assigned to the serial partition. Jobs on Stellar that allocate less than half of a node (<48 cores) are assigned to the serial partition. The scheduler on these clusters has been configured to give jobs in the serial partition the lowest priority. This means that jobs in the serial partition will only run when the scheduler cannot start a multinode job. In some cases, squeue will classify the reason that the small job is pending as (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions). This is indicating that the required resources are being used for large jobs. Your serial or single-node job will eventually run, however. If your code is GPU-enabled then it is perfectly fine run serial or single-node GPU jobs on TigerGPU. There is no priority penalty in such cases.

Della is designed to handle a range of job sizes including small jobs.