Research Computing provides various utilities to examine job behavior for both Slurm jobs as well as work done through OnDemand. These tools can be used to check performance and troubleshoot issues.
Detailed job statistics can be viewed for running and completed Slurm jobs:
- Browse to a page below (must be on the campus network or on the VPN from off-campus) and enter the JobID:
- or for all clusters use https://stats.rc.princeton.edu
- For stats.rc, in the upper right corner, click on the dropdown arrow to replace "Last 6 hours" with "Last 7 days". If your job is older than 7 days then you will need to increase the time window.
- Enter the job id in the "Slurm JobID" text box in the upper left. Press the Enter/Return key. The job data should then display. To find the job id, use the command "shistory -u $USER". This command can also be used to obtain the exact time range of the job if needed.
You can adjust the time range of the plots using your mouse by clicking and dragging over the range of interest. Data is captured every 30 seconds on stats.rc.
Here are the metrics that are available using stats.rc:
- CPU Utilization
- CPU Memory Utilization
- GPU Utilization
- GPU Memory
- GPU Temperature
- GPU Power Usage
- CPU Percentage Utilization
- Total Memory Utilization
- Average CPU Frequency Utilization
- NFS Stats (for all jobs running on the node)
- Local Disc R/W (for all jobs running on the node)
- Local IOPS (for all jobs running on the node)
Common applications such as Jupyter, MATLAB, RStudio, Stata and others can be run in your web browser via MyAdroit, MyDella or MyStellar. To view the job statistics of an active or completed job, follow these steps:
- In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Active Jobs".
- Find your job in the list. If you don't see it then make sure the blue button in the upper right reads "Your Jobs" instead of "All Jobs". If "All Jobs" is selected then type your NetID in the "filter" text box in the upper right corner to find your jobs.
- Once you have found your job, click on the icon with the right-pointing arrow or angled bracket. You should then see two panels, namely, "Job CPU Utilization" and "Job CPU Memory Utilization".
- Click on the blue "Detailed Metrics" link for more metrics.
- In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Completed Jobs".
- Find your job in the list. Click on the jobid which will be a blue hyperlink (e.g., 39852043)
- Examine the CPU and GPU metrics in the pop-up window.
Use the "jobstats" command to see various job metrics:
$ jobstats 1234567 ================================================================================ Slurm Job Statistics ================================================================================ Job ID: 1234567 NetID/Account: aturing/math Job Name: myjob State: RUNNING Nodes: 1 CPU Cores: 4 CPU Memory: 20GB (5GB per CPU-core) QOS/Partition: medium/cpu Cluster: della Start Time: Sun Jun 26, 2022 at 1:34 PM Run Time: 1-01:18:59 (in progress) Time Limit: 2-23:59:00 Overall Utilization ================================================================================ CPU utilization [|||||||||||||||||||||||||||||||||||||||||||||||97%] CPU memory usage [||||||||||||||| 31%] Detailed Utilization ================================================================================ CPU utilization per node (CPU time used/run time) della-i13n7: 4-02:20:54/4-05:15:58 (efficiency=97.1%) CPU memory usage per node - used/allocated della-i13n7: 6.0GB/19.5GB (1.5GB/4.9GB per core of 4) Notes ================================================================================ * For additional job metrics including metrics plotted against time: https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus)
Use the command "shistory" to view your recent job id's.
By adding the following lines to your Slurm batch script (and entering your NetID) you will receive an efficiency report via email (which is the output of jobstats) upon completion of the job:
#SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu
reportseff is a wrapper around sacct that provides more complex option parsing, simpler options, and cleaner, colored outputs. On Princeton RC systems, multi-node and GPU utilization can be displayed, similar to jobstats.
To see all of the jobs in a folder containing slurm outputs:
$ reportseff JobID State Elapsed TimeEff CPUEff MemEff rfmix-40506885 COMPLETED 03:31:09 58.7% 66.8% 56.3% rfmix-40506887 COMPLETED 03:14:13 53.9% 66.6% 56.1% rfmix-40506939 COMPLETED 03:47:37 63.2% 68.4% 56.3% rfmix-40506941 COMPLETED 03:26:04 57.2% 68.6% 56.0% rfmix-40507145 COMPLETED 03:12:42 53.5% 66.3% 56.2% rfmix-40507147 COMPLETED 03:24:52 56.9% 68.0% 56.2% rfmix-40507195 COMPLETED 03:38:25 60.7% 69.2% 56.3% rfmix-40507196 COMPLETED 03:04:54 51.4% 65.6% 56.1% rfmix-40507393 COMPLETED 03:35:50 60.0% 64.6% 56.2% rfmix-40507394 COMPLETED 03:36:48 60.2% 69.4% 56.0%
To see all of your jobs from the last week:
$ reportseff -u $USER JobID State Elapsed TimeEff CPUEff MemEff 40835250 TIMEOUT 02:05:16 104.4% 88.5% 1.8% 40835253 COMPLETED 00:15:48 26.3% 96.4% 10.4% 40835254 COMPLETED 00:11:58 19.9% 96.9% 10.3% 40835255 COMPLETED 04:14:14 21.2% 87.6% 36.2% 40835256 COMPLETED 00:09:07 15.2% 94.9% 9.2% 40835257 COMPLETED 05:07:26 25.6% 85.6% 43.3% 40835258 COMPLETED 01:28:42 7.4% 83.7% 16.9% 40835259 COMPLETED 00:03:13 5.4% 86.6% 10.2% 40835260 COMPLETED 01:25:07 7.1% 84.4% 16.9%
To find lines with 'output:' in jobs in the current directory which have timed out or failed in the last 4 days:
$ reportseff --since 'd=4' --state TO,F --format jobid | xargs grep output: