- When will my job start running?
- What are the meanings of the values in NODELIST (REASON)?
- Why do I see "(ReqNodeNotAvail, Reserved for maintenance)" when I submit a job close to downtime?
- Is it true that some users have priority on certain clusters?
- Why can I only run two short jobs at once?
- Is it okay to run serial jobs on TigerCPU or Stellar?
The Slurm scheduler works much like many other schedulers by simply applying a priority number to a job. To see all jobs with associated priorities one can use:
$ squeue -o "%.18i %Q %.9q %.8j %.8u %.10a %.2t %.10M %.10L %.6C %R" | more
How is the priority of a job determined?
There are a few factors which determine this value. The command sprio -w will show you the current factors being used along with the associated weights. For TigerGPU:
$ sprio -w JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS Weights 1 1000 10000 10000 1000 8000
Let us try to explain what each component tries to do.
AGE - this is an increasing value which increments as a job which is ready to run sits in the queue waiting for resources. It will increment for a maximum time of 10 days which is a configurable limit.
FAIRSHARE - this gets complicated but is basically a measure of how much the user and/or group has been using the cluster over the past 30 days. See the "sshare -l" command for actual values where the LevelFS is used as a multiplier to either boost or decrement the actual value given past usage.
JOBSIZE - this is basically the number of cores requested such that more cores give higher priority. Why this is important is because wide jobs would be starved if not given a higher priority and smaller jobs can then be backfilled as resources are waiting to be freed.
QOS - use the qos command to see the various weights for the quality of service the job is using. This is based almost entirely on the time requested. And in most cases those jobs requesting shorter times are given the highest priority here.
Using the command (all on one line):
$ watch -n 30 -d 'squeue --start --format="%.7i %.7Q %.7q %.15j %.12u %.10a %.20S %.6D %.5C %R" --sort=S --states=PENDING | egrep -v "N/A" | head -20'
One can then watch the top jobs waiting to run and begin to see how this works. In many instances you will find lower priority jobs which will run before higher priority ones. This is all due to the requested resources, when they will free, and the decisions the scheduler is making in this regard. This is also why having an accurate representation of the time limit is so important. Better fitting of jobs and more accurate scheduling can happen as the actual time specified becomes more accurate for all jobs.
Run the following command to see how many shares your group has as well as your fairshare value:
When running jobs remember to only request what you need.
To get an estimate of the start time of your jobs use this command:
$ squeue -u $USER --start
This command will only report on pending jobs so information about running jobs will not be shown. Slurm only considers three pending jobs at a time per user so you will not see estimated starting times for more than this number of jobs.
The squeue -u $USER command will show the state of all your queued and running jobs. For queued jobs, the rightmost column indicates the reason the job is not running. The most common reasons include:
- (Resources) - The necessary combination of nodes/CPUs/GPUs/memory for your job are not available.
- (Priority) - There are other users in the queue with a higher priority and will therefore be scheduled first.
- (Dependency) - The job will not start until the dependency specified by you is satisfied.
- (QOS<something>) - This tells you which limit the job is exceeding in the particular QOS. For example, QOSGrpCpuLimit means that the jobs running in that QOS (e.g., long) are using all of the allotted resources as set by the GrpTRES value. In this case, simply wait and your job will run. Run the qos command to see the limits. The number of "procs" or CPU-cores in use per QOS is displayed at the bottom of the output. One sees that "Grp" relates to the QOS and not to your research group.
- (ReqNodeNotAvail, Reserved for maintenance) - If the time limit of your job extends into the maintenance period then the scheduler will not start the job until after maintenance since the job would have to be killed when the clusters go down. No action is needed. The job will run as expected after maintenance. In some cases it may be useful to lower the requested time limit so that the job completes before the scheduled downtime.
For the complete list of codes see this page. Note that for running jobs, the rightmost column of the command above gives the node name(s) that the job is running on.
The Research Computing clusters and storage systems are unavailable due to routine maintenance from approximately 6 AM to 2 PM on the second Tuesday of every month (please mark your calendar). If the runtime limit of your job submission extends into the maintenance period then the scheduler will not start the job until after maintenance is complete. This is because the job would have to be killed when the clusters go offline. Submitted jobs will have a status of "PD" for pending with the reason being "(ReqNodeNotAvail, Reserved for maintenance)". Leave your job in the queue and it will run as expected after the maintenance period. In some cases it may be useful to lower the requested runtime limit so that the job can finish before the clusters go offline at 6 AM on the second Tuesday of the month.
Yes. This is true on every large cluster. Most of the funding for the clusters comes from individual research groups and entire academic departments. Members of these groups get priority when a given cluster is operating at capacity. When there are idle nodes then all users are equal. Run the sshare command to see which groups have the largest priority. For instance, on Tiger, sshare shows that astro, cbe, chem, eac, geo, geoclim, mae and mueller all have large shares. Most users are in the public account which gets a small number of shares. This leads to lower effective job priorities for these users when the cluster is busy.
If you try to run multiple jobs with a time limit of less than 1 hour each, you will find that at most two of them run while the others are queued with "QOSMaxJobsPerUserLimit". The explanation is that the jobs are landing in the test queue where the maximum number of jobs is either 1 or 2 per user depending on the cluster. Run the "qos" command to see this. The solution is to set the time limit of your jobs to 62 minutes and then you will be able to run more jobs simultaneously since the jobs will land in a different QOS. Outside of this case you should always use an accurate value for the time limit (with some extra time included for safety).
Not really. TigerCPU and Stellar were designed for parallel jobs that require multiple nodes. Serial and single-node jobs on TigerCPU are assigned to the serial partition. Jobs on Stellar that allocate less than half of a node (<48 cores) are assigned to the serial partition. The scheduler on these clusters has been configured to give jobs in the serial partition the lowest priority. This means that jobs in the serial partition will only run when the scheduler cannot start a multinode job. In some cases, squeue will classify the reason that the small job is pending as (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions). This is indicating that the required resources are being used for large jobs. Your serial or single-node job will eventually run, however. If your code is GPU-enabled then it is perfectly fine run serial or single-node GPU jobs on TigerGPU. There is no priority penalty in such cases.
Della is designed to handle a range of job sizes including small jobs.