Job Priority

OUTLINE

 

Introduction

The Slurm scheduler works much like many other schedulers by simply applying a priority number to a job. To see all jobs with associated priorities one can use:

$ squeue -o "%.18i %Q %.9q %.8j %.8u %.10a %.2t %.10M %.10L %.6C %R" | more

How is the priority of a job determined?

There are a few factors which determine this value. The command sprio -w will show you the current factors being used along with the associated weights. For TigerGPU:

$ sprio -w
  JOBID PARTITION   PRIORITY    SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
  Weights                          1       1000      10000      10000       1000       8000

Let us try to explain what each component tries to do.

AGE - this is an increasing value which increments as a job which is ready to run sits in the queue waiting for resources. It will increment for a maximum time of 10 days which is a configurable limit.

FAIRSHARE - this gets complicated but is basically a measure of how much the user and or group has been using the cluster over the past 30 days. See the "sshare -l" command for actual values where the LevelFS is used as a multiplier to either boost or decrement the actual value given past usage.

JOBSIZE - this is basically the number of cores requested such that more cores give higher priority. Why this is important is because wide jobs would be starved if not given a higher priority and smaller jobs can then be backfilled as resources are waiting to be freed.

QOS - use the qos command to see the various weights for the quality of service the job is using. This is based almost entirely on the time requested. And in most cases those jobs requesting shorter times are given the highest priority here.

Using the command (all on one line):

$ watch -n 30 -d 'squeue --start --format="%.7i %.7Q %.7q %.15j %.12u %.10a %.20S %.6D %.5C %R" --sort=S --states=PENDING | egrep -v "N/A" | head -20'

One can then watch the top jobs waiting to run and begin to see how this works. In many instances you will find lower priority jobs which will run before higher priority ones. This is all due to the requested resources, when they will free, and the decisions the scheduler is making in this regard. This is also why having an accurate representation of the time limit is so important. Better fitting of jobs and more accurate scheduling can happen as the actual time specified becomes more accurate for all jobs.

Run the following command to see how many shares your group has as well as your fairshare value: 

$ sshare

To see the fairshare value of each user with a running or queued job use this command (all on one line):

$ join -1 4 -2 2 -o 2.7,1.4 <(squeue | sort -k 4) <(sshare -a | awk 'NF>=7' | grep -v class | sort -k 2) | sort -r | uniq | cat -n

When running jobs remember to only request what you need.

 

When will my jobs start running?

To get an estimate of the start time of your jobs use this command:

$ squeue -u $USER --start

Note that this command will only report on pending jobs. It will ignore running jobs.

 

What are the meanings of the values in NODELIST (REASON)?

The squeue -u $USER command will show the state of all your queued and running jobs. For queued jobs, the rightmost column indicates the reason the job is not running. The most common reasons include:

  1. (Resources) - The necessary combination of nodes/CPUs/GPUs/memory for your job are not available.
  2. (Priority) - There are other users in the queue with a higher priority and will therefore be scheduled first.
  3. (Dependency) - The job will not start until the dependency specified by you is satisfied.
  4. (QOS<something>) - This tells you which limit the job is exceeding in the particular QOS. Run the qos command for more.
  5. (ReqNodeNotAvail, Reserved for maintenance) - If the time limit of your job extends into the maintenance period then the scheduler will not start the job until after maintenance since the job would have to be killed when the clusters go down. No action is needed. The job will run as expected after maintenance. In some cases it may be useful to lower the requested time limit so that the job completes before the scheduled downtime.

For the complete list of codes see this page. Note that for running jobs, the rightmost column of the command above gives the node name(s) that the job is running on.

 

Why do I see "(ReqNodeNotAvail, Reserved for maintenance)" when I submit a job close to downtime?

The Research Computing clusters and storage systems are unavailable due to routine maintenance from approximately 6 AM to 2 PM on the second Tuesday of every month (please mark your calendar). If the runtime limit of your job submission extends into the maintenance period then the scheduler will not start the job until after maintenance is complete since the job would have to be killed when the clusters go down at 6 AM. In this event the job status is "PD" for pending with the reason being "(ReqNodeNotAvail, Reserved for maintenance)". No action is needed. Leave your job in the queue and it will run as expected after maintenance. In some cases it may be useful to lower the requested runtime limit so that the job completes before the scheduled downtime of 6 AM on the second Tuesday of the month.

 

Is it true that some users have priority on certain clusters?

Yes, this is true on almost every cluster. Most of the funding for the clusters comes from individual research groups, departments and institutes. Members of these groups get priority when a given cluster is operating at capacity. When there are idle nodes then all users are equal. Run the sshare command to see which groups have the largest priority. For instance, on Tiger sshare shows that astro, cbe, chem, eac, geo, geoclim, mae and mueller all have large shares. Most users are in the public account which gets a small number of shares. This leads to lower effective job priorities for these users when the cluster is busy.

 

Is it okay to run serial jobs on TigerCPU or Stellar?

Not really. TigerCPU and Stellar were designed for parallel jobs that require multiple nodes. The scheduler on these clusters has been configured to give serial or single-node jobs the lowest priority. In some cases, squeue will classify the reason that the small job is pending as (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions). This is indicating that the required resources are being used for large jobs. Your serial or single-node job will eventually run, however. If your code is GPU-enabled then it is perfectly fine run serial or single-node GPU jobs on TigerGPU. There is no priority penalty in such cases.

Della is ideal for a range of job sizes including small jobs. If you only have an account on Tiger or Stellar and you want to run several small jobs then please write to cses@princeton.edu to request an account on Della. Be sure to explain the situation.