stats.rc provides detailed metrics about running and completed jobs

Research Computing provides various utilities to examine job behavior for both Slurm jobs as well as work done through OnDemand. These tools can be used to check performance and troubleshoot issues.

jobstats

Use the "jobstats" command to see various job metrics:

$ jobstats 1234567
================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 1234567
  NetID/Account: aturing/math
       Job Name: sys_logic_ordinals
          State: COMPLETED
          Nodes: 2
      CPU Cores: 48
     CPU Memory: 256GB (5.3GB per CPU-core)
           GPUs: 4
  QOS/Partition: della-gpu/gpu
        Cluster: della
     Start Time: Fri Mar 4, 2022 at 1:56 AM
       Run Time: 18:41:56
     Time Limit: 4-00:00:00
                              Overall Utilization
================================================================================
  CPU utilization  [|||||                                          10%]
  CPU memory usage [|||                                             6%]
  GPU utilization  [||||||||||||||||||||||||||||||||||             68%]
  GPU memory usage [|||||||||||||||||||||||||||||||||              66%]
                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%)
      della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%)
  Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9%
  CPU memory usage per node - used/allocated
      della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24)
      della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24)
  Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48)
  GPU utilization per node
      della-i14g2 (GPU 0): 65.7%
      della-i14g2 (GPU 1): 64.5%
      della-i14g3 (GPU 0): 72.9%
      della-i14g3 (GPU 1): 67.5%
  GPU memory usage per node - maximum used/total
      della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%)
      della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%)
      della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%)
      della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%)
                                     Notes
================================================================================
  * This job requested 5.3 GB of memory per CPU-core. Given that the overall
    CPU memory usage was only 6%, please consider reducing your CPU memory
    allocation for future jobs. This will reduce your queue times and make the
    resources available for other users. For more info:
    https://researchcomputing.princeton.edu/support/knowledge-base/memory
  * This job completed while only needing 19% of the requested time which
    was 4-00:00:00. For future jobs, please decrease the value of the --time
    Slurm directive. This will lower your queue times and allow the Slurm
    job scheduler to work more effectively for all users. For more info:
      https://researchcomputing.princeton.edu/support/knowledge-base/slurm
  * For additional job metrics including metrics plotted against time:
    https://mydella.princeton.edu/pun/sys/jobstats  (VPN required off-campus)

Use the command "shistory" to view your recent job id's.

Want to use the Jobstats job monitoring platform at your institution? See the Jobstats GitHub repo.

Jobstats Web Interface

Detailed job statistics as a function of time can be viewed for running and completed Slurm jobs:

  1. Browse to a page below (must be on the campus network or on the VPN from off-campus) and enter the JobID:
  1. For stats.rc, in the upper right corner, click on the dropdown arrow to replace "Last 6 hours" with "Last 7 days". If your job is older than 7 days then you will need to increase the time window.
  2. Enter the job id in the "Slurm JobID" text box in the upper left. Press the Enter/Return key. The job data should then display. To find the job id, use the command "shistory -u $USER". This command can also be used to obtain the exact time range of the job if needed.

You can adjust the time range of the plots using your mouse by clicking and dragging over the range of interest. Data is captured every 30 seconds.

stats.rc

Here are the job-level metrics that are available using stats.rc:

  • CPU Utilization
  • CPU Memory Utilization
  • GPU Utilization
  • GPU Memory Utilization
  • GPU Temperature
  • GPU Power Usage

Below are the node-level metrics that are available:

  • CPU Percentage Utilization
  • Total Memory Utilization
  • Average CPU Frequency Over All CPUs
  • NFS Stats
  • Local Disc R/W
  • GPFS Bandwidth Stats
  • Local Disc IOPS
  • GPFS Operations per Second Stats
  • Infiniband Throughput
  • Infiniband Packet Rate
  • Infiniband Errors

Eleven of the seventeen metrics above are node-level. This means that if multiple jobs are running on the node then it will not be possible to disentangle the data. To use these metrics to troubleshoot jobs, the job should allocate the entire node. This can be done by using the --exclusive directive or by requesting all of the available CPU memory. To see the amount of memory availabe on each node, run the "snodes" command and look at  the MEMORY column which is in units of megabytes. You should enter the value in your Slurm script in megabytes (e.g. #SBATCH --mem=192000M).

Want to use the Jobstats job monitoring platform at your institution? See the Jobstats GitHub repo.

OnDemand via MyAdroit, MyDella, MyStellar

Common applications such as Jupyter, MATLAB, RStudio, Stata and others can be run in your web browser via MyAdroitMyDella or MyStellar. To view the job statistics of an active or completed job, follow these steps:

Active Jobs

  1. In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Active Jobs".
  2. Find your job in the list. If you don't see it then make sure the blue button in the upper right reads "Your Jobs" instead of "All Jobs". If "All Jobs" is selected then type your NetID in the "filter" text box in the upper right corner to find your jobs.
  3. Once you have found your job, click on the icon with the right-pointing arrow or angled bracket. You should then see two panels, namely, "Job CPU Utilization" and "Job CPU Memory Utilization".
  4. Click on the blue "Detailed Metrics" link for more metrics.

Completed Jobs

  1. In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Completed Jobs".
  2. Find your job in the list. Click on the jobid which will be a blue hyperlink (e.g., 39852043)
  3. Examine the CPU and GPU metrics in the pop-up window.
OnDemand Job Stats

Slurm email reports

By adding the following lines to your Slurm batch script (and entering your NetID) you will receive an efficiency report via email (which is the output of jobstats) upon completion of the job:

#SBATCH --mail-type=end
#SBATCH --mail-user=<YourNetID>@princeton.edu

reportseff

reportseff is a wrapper around sacct that provides more complex option parsing, simpler options, and cleaner, colored outputs.  On Princeton RC systems, multi-node and GPU utilization can be displayed, similar to jobstats.

An example reportseff output showing color and GPU utilization

To see all of the jobs in a folder containing slurm outputs:

$ reportseff
           JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  rfmix-40506885  COMPLETED    03:31:09   58.7%    66.8%    56.3%
  rfmix-40506887  COMPLETED    03:14:13   53.9%    66.6%    56.1%
  rfmix-40506939  COMPLETED    03:47:37   63.2%    68.4%    56.3%
  rfmix-40506941  COMPLETED    03:26:04   57.2%    68.6%    56.0%
  rfmix-40507145  COMPLETED    03:12:42   53.5%    66.3%    56.2%
  rfmix-40507147  COMPLETED    03:24:52   56.9%    68.0%    56.2%
  rfmix-40507195  COMPLETED    03:38:25   60.7%    69.2%    56.3%
  rfmix-40507196  COMPLETED    03:04:54   51.4%    65.6%    56.1%
  rfmix-40507393  COMPLETED    03:35:50   60.0%    64.6%    56.2%
  rfmix-40507394  COMPLETED    03:36:48   60.2%    69.4%    56.0%

To see all of your jobs from the last week:

$ reportseff -u $USER

     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  40835250   TIMEOUT     02:05:16  104.4%    88.5%     1.8%  
  40835253  COMPLETED    00:15:48   26.3%    96.4%    10.4%  
  40835254  COMPLETED    00:11:58   19.9%    96.9%    10.3%  
  40835255  COMPLETED    04:14:14   21.2%    87.6%    36.2%
  40835256  COMPLETED    00:09:07   15.2%    94.9%     9.2%  
  40835257  COMPLETED    05:07:26   25.6%    85.6%    43.3%
  40835258  COMPLETED    01:28:42   7.4%     83.7%    16.9%  
  40835259  COMPLETED    00:03:13   5.4%     86.6%    10.2%  
  40835260  COMPLETED    01:25:07   7.1%     84.4%    16.9%  

To find lines with 'output:' in jobs in the current directory which have timed out or failed in the last 4 days:

$ reportseff --since 'd=4' --state TO,F --format jobid | xargs grep output: 

Using at Princeton

To use reportseff on a Research Computing cluster like Della, run these commands:

$ ssh <YourNetID>@della.princeton.edu
$ module load anaconda3/2024.2
$ conda create --name rseff-env python=3.11 -y
$ conda activate rseff-env
(rseff-env) $ pip install reportseff
(rseff-env) $ reportseff --help

After creating the Conda environment, one can use the tool directly without activating the environment:

$ /home/$USER/.conda/envs/rseff-env/bin/reportseff --help

Learn more about Conda environments. One could also make an alias in your .bashrc file:

alias rseff='/home/$USER/.conda/envs/rseff-env/bin/reportseff'

Learn more about aliases and shell functions.