Job Stats

OUTLINE

 

Research Computing provides various utilities to examine job behavior for both Slurm jobs as well as work done through OnDemand. These tools can be used to check performance and troubleshoot issues.

 

stats.rc.princeton.edu

Detailed job statistics can be viewed for running and completed Slurm jobs:

  1. Browse to a page below (must be on the campus network or on the VPN from off-campus) and enter the JobID:
  1. For stats.rc, in the upper right corner, click on the dropdown arrow to replace "Last 6 hours" with "Last 7 days". If your job is older than 7 days then you will need to increase the time window.
  2. Enter the job id in the "Slurm JobID" text box in the upper left. Press the Enter/Return key. The job data should then display. To find the job id, use the command "shistory -u $USER". This command can also be used to obtain the exact time range of the job if needed.

You can adjust the time range of the plots using your mouse by clicking and dragging over the range of interest. Data is captured every 30 seconds on stats.rc.

stats.rc

Here are the metrics that are available using stats.rc:

  • CPU Utilization
  • CPU Memory Utilization
  • GPU Utilization
  • GPU Memory
  • GPU Temperature
  • GPU Power Usage
  • CPU Percentage Utilization
  • Total Memory Utilization
  • Average CPU Frequency Utilization
  • NFS Stats (for all jobs running on the node)
  • Local Disc R/W (for all jobs running on the node)
  • Local IOPS (for all jobs running on the node)

 

OnDemand via MyAdroit/MyDella/MyStellar

Common applications such as Jupyter, MATLAB, RStudio, Stata and others can be run in your web browser via MyAdroitMyDella or MyStellar. To view the job statistics of an active or completed job, follow these steps:

Active Jobs

  1. In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Active Jobs".
  2. Find your job in the list. If you don't see it then make sure the blue button in the upper right reads "Your Jobs" instead of "All Jobs". If "All Jobs" is selected then type your NetID in the "filter" text box in the upper right corner to find your jobs.
  3. Once you have found your job, click on the icon with the right-pointing arrow or angled bracket. You should then see two panels, namely, "Job CPU Utilization" and "Job CPU Memory Utilization".
  4. Click on the blue "Detailed Metrics" link for more metrics.

Completed Jobs

  1. In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Completed Jobs".
  2. Find your job in the list. Click on the jobid which will be a blue hyperlink (e.g., 39852043)
  3. Examine the CPU and GPU metrics in the pop-up window.
OnDemand Job Stats

 

jobstats

Use the "jobstats" command to see various job metrics:

$ jobstats 1234567
================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 1234567
  NetID/Account: aturing/math
       Job Name: myjob
          State: RUNNING
          Nodes: 1
      CPU Cores: 4
     CPU Memory: 20GB (5GB per CPU-core)
  QOS/Partition: medium/cpu
        Cluster: della
     Start Time: Sun Jun 26, 2022 at 1:34 PM
       Run Time: 1-01:18:59 (in progress)
     Time Limit: 2-23:59:00

                              Overall Utilization
================================================================================
  CPU utilization  [|||||||||||||||||||||||||||||||||||||||||||||||97%]
  CPU memory usage [|||||||||||||||                                31%]

                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      della-i13n7: 4-02:20:54/4-05:15:58 (efficiency=97.1%)

  CPU memory usage per node - used/allocated
      della-i13n7: 6.0GB/19.5GB (1.5GB/4.9GB per core of 4)

                                     Notes
================================================================================
  * For additional job metrics including metrics plotted against time:
    https://mydella.princeton.edu/pun/sys/jobstats  (VPN required off-campus)

Use the command "shistory" to view your recent job id's.

 

Slurm email reports

By adding the following lines to your Slurm batch script (and entering your NetID) you will receive an efficiency report via email (which is the output of jobstats) upon completion of the job:

#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

 

reportseff

reportseff is a wrapper around sacct that provides more complex option parsing, simpler options, and cleaner, colored outputs.  On Princeton RC systems, multi-node and GPU utilization can be displayed, similar to jobstats.

An example reportseff output showing color and GPU utilization

To see all of the jobs in a folder containing slurm outputs:

$ reportseff
           JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  rfmix-40506885  COMPLETED    03:31:09   58.7%    66.8%    56.3%
  rfmix-40506887  COMPLETED    03:14:13   53.9%    66.6%    56.1%
  rfmix-40506939  COMPLETED    03:47:37   63.2%    68.4%    56.3%
  rfmix-40506941  COMPLETED    03:26:04   57.2%    68.6%    56.0%
  rfmix-40507145  COMPLETED    03:12:42   53.5%    66.3%    56.2%
  rfmix-40507147  COMPLETED    03:24:52   56.9%    68.0%    56.2%
  rfmix-40507195  COMPLETED    03:38:25   60.7%    69.2%    56.3%
  rfmix-40507196  COMPLETED    03:04:54   51.4%    65.6%    56.1%
  rfmix-40507393  COMPLETED    03:35:50   60.0%    64.6%    56.2%
  rfmix-40507394  COMPLETED    03:36:48   60.2%    69.4%    56.0%

To see all of your jobs from the last week:

$ reportseff -u $USER

     JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  40835250   TIMEOUT     02:05:16  104.4%    88.5%     1.8%  
  40835253  COMPLETED    00:15:48   26.3%    96.4%    10.4%  
  40835254  COMPLETED    00:11:58   19.9%    96.9%    10.3%  
  40835255  COMPLETED    04:14:14   21.2%    87.6%    36.2%
  40835256  COMPLETED    00:09:07   15.2%    94.9%     9.2%  
  40835257  COMPLETED    05:07:26   25.6%    85.6%    43.3%
  40835258  COMPLETED    01:28:42   7.4%     83.7%    16.9%  
  40835259  COMPLETED    00:03:13   5.4%     86.6%    10.2%  
  40835260  COMPLETED    01:25:07   7.1%     84.4%    16.9%  

To find lines with 'output:' in jobs in the current directory which have timed out or failed in the last 4 days:

$ reportseff --since 'd=4' --state TO,F --format jobid | xargs grep output: