Job Priority

OUTLINE

 

Introduction

The Slurm scheduler works much like many other schedulers by simply applying a priority number to a job. To see all jobs with associated priorities one can use:

$ squeue -o "%.18i %Q %.9q %.8j %.8u %.10a %.2t %.10M %.10L %.6C %R" | more

How is the priority of a job determined?

There are a few factors which determine this value. The command sprio -w will show you the current factors being used along with the associated weights. For TigerGPU:

$ sprio -w
  JOBID PARTITION   PRIORITY    SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
  Weights                          1       1000      10000      10000       1000       8000

Let us try to explain what each component tries to do.

AGE - this is an increasing value which increments as a job which is ready to run sits in the queue waiting for resources. It will increment for a maximum time of 10 days which is a configurable limit.

FAIRSHARE - this gets complicated but is basically a measure of how much the user and/or group has been using the cluster over the past 30 days. See the "sshare -l" command for actual values where the LevelFS is used as a multiplier to either boost or decrement the actual value given past usage.

JOBSIZE - this is basically the number of cores requested such that more cores give higher priority. Why this is important is because wide jobs would be starved if not given a higher priority and smaller jobs can then be backfilled as resources are waiting to be freed.

QOS - use the qos command to see the various weights for the quality of service the job is using. This is based almost entirely on the time requested. And in most cases those jobs requesting shorter times are given the highest priority here.

Using the command (all on one line):

$ watch -n 30 -d 'squeue --start --format="%.7i %.7Q %.7q %.15j %.12u %.10a %.20S %.6D %.5C %R" --sort=S --states=PENDING | egrep -v "N/A" | head -20'

One can then watch the top jobs waiting to run and begin to see how this works. In many instances you will find lower priority jobs which will run before higher priority ones. This is all due to the requested resources, when they will free, and the decisions the scheduler is making in this regard. This is also why having an accurate representation of the time limit is so important. Better fitting of jobs and more accurate scheduling can happen as the actual time specified becomes more accurate for all jobs.

Run the following command to see how many shares your group has as well as your fairshare value: 

$ sshare

When running jobs remember to only request what you need.

 

When will my jobs start running?

To get an estimate of the start time of your jobs use this command:

$ squeue -u $USER --start

This command will only report on pending jobs so information about running jobs will not be shown. Slurm only considers three pending jobs at a time per user so you will not see estimated starting times for more than this number of jobs.

 

What are the meanings of the values in NODELIST (REASON)?

The squeue -u $USER command will show the state of all your queued and running jobs. For queued jobs, the rightmost column indicates the reason the job is not running. The most common reasons include:

  1. (Resources) - The necessary combination of nodes/CPUs/GPUs/memory for your job are not available.
  2. (Priority) - There are other users in the queue with a higher priority and will therefore be scheduled first.
  3. (Dependency) - The job will not start until the dependency specified by you is satisfied.
  4. (QOS<something>) - This tells you which limit the job is exceeding in the particular QOS. For example, QOSGrpCpuLimit means that the jobs running in that QOS (e.g., long) are using all of the allotted resources as set by the GrpTRES value. In this case, simply wait and your job will run. Run the qos command to see the limits. The number of "procs" or CPU-cores in use per QOS is displayed at the bottom of the output. One sees that "Grp" relates to the QOS and not to your research group.
  5. (ReqNodeNotAvail, Reserved for maintenance) - If the time limit of your job extends into the maintenance period then the scheduler will not start the job until after maintenance since the job would have to be killed when the clusters go down. No action is needed. The job will run as expected after maintenance. In some cases it may be useful to lower the requested time limit so that the job completes before the scheduled downtime.

For the complete list of codes see this page. Note that for running jobs, the rightmost column of the command above gives the node name(s) that the job is running on.

 

Why do I see "(ReqNodeNotAvail, Reserved for maintenance)" when I submit a job close to downtime?

The Research Computing clusters and storage systems are unavailable due to routine maintenance from approximately 6 AM to 2 PM on the second Tuesday of every month (please mark your calendar). If the runtime limit of your job submission extends into the maintenance period then the scheduler will not start the job until after maintenance is complete since the job would have to be killed when the clusters go down at 6 AM. Submitted jobs will have a status of "PD" for pending with the reason being "(ReqNodeNotAvail, Reserved for maintenance)". Leave your job in the queue and it will run as expected after maintenance. In some cases it may be useful to lower the requested runtime limit so that the job completes before the scheduled downtime of 6 AM on the second Tuesday of the month.

 

Is it true that some users have priority on certain clusters?

Yes. This is true on every large cluster. Most of the funding for the clusters comes from individual research groups and entire academic departments. Members of these groups get priority when a given cluster is operating at capacity. When there are idle nodes then all users are equal. Run the sshare command to see which groups have the largest priority. For instance, on Tiger, sshare shows that astro, cbe, chem, eac, geo, geoclim, mae and mueller all have large shares. Most users are in the public account which gets a small number of shares. This leads to lower effective job priorities for these users when the cluster is busy.

 

Why can I only run two short jobs at once?

If you try to run multiple jobs with a time limit of less than 1 hour each, you will find that at most two of them run while the others are queued with "QOSMaxJobsPerUserLimit". What's happening is that the jobs are landing in the test queue where the maximum number of jobs is 2 per user. Run the "qos" command to see this. However, this is the one scenario where users are allowed to lie to the scheduler. Just set the time limit of your jobs to 62 minutes and you will be able to run more jobs at once since the jobs will land in a different QoS. Outside of this case you should always use an accurate value for the time limit. Be sure to include extra time for safety since the job will be killed if it does not finish before the limit is reached.

 

Is it okay to run serial jobs on TigerCPU or Stellar?

Not really. TigerCPU and Stellar were designed for parallel jobs that require multiple nodes. The scheduler on these clusters has been configured to give serial or single-node jobs the lowest priority. In some cases, squeue will classify the reason that the small job is pending as (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions). This is indicating that the required resources are being used for large jobs. Your serial or single-node job will eventually run, however. If your code is GPU-enabled then it is perfectly fine run serial or single-node GPU jobs on TigerGPU. There is no priority penalty in such cases.

Della is designed to handle a range of job sizes including small jobs. If you only have an account on Tiger or Stellar and you want to run several small jobs then please write to cses@princeton.edu to request an account on Della. Be sure to explain the situation.