Research Computing provides various utilities to examine job behavior for both Slurm jobs as well as work done through OnDemand. These tools can be used to check performance and troubleshoot issues.
Detailed job statistics can be viewed for running and completed Slurm jobs using stats.rc:
- Browse to https://stats.rc.princeton.edu (you need to be on the campus network or on the VPN from off-campus).
- In the upper right corner, click on the dropdown arrow to replace "Last 6 hours" with "Last 7 days". If your job is older than 7 days then you will need to increase the time window.
- Enter the job id in the "Slurm JobID" text box in the upper left. Press the Enter/Return key. The job data should then display. To find the job id, use the command "shistory -u $USER". This command can also be used to obtain the exact time range of the job if needed.
You can adjust the time range of the plots using your mouse by clicking and dragging over the range of interest. Data is captured every 30 seconds on stats.rc.
Here are the metrics that are available using stats.rc:
- CPU Utilization
- CPU Memory Utilization
- GPU Utilization
- GPU Memory
- GPU Temperature
- GPU Power Usage
- CPU Percentage Utilization
- Total Memory Utilization
- Average CPU Frequency Utilization
- NFS Stats
- Local Disc R/W
- Local IOPS
OnDemand via MyAdroit/MyDella/MyStellar
Common applications such as MATLAB, Jupyter, RStudio, Stata and others can be run in your web browser via MyAdroit or MyDella. To view the job statistics of a running or completed job, follow these steps:
- In the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Active Jobs".
- Find your job in the list. If you don't see it then make sure the blue button in the upper right reads "Your Jobs" instead of "All Jobs". If "All Jobs" is selected then type your NetID in the "filter" text box in the upper right corner to find your jobs.
- Once you have found your job, click on the icon with the right-pointing arrow or angled bracket. You should then see two panels, namely, "Job CPU Utilization" and "Job CPU Memory Utilization".
- Click on the blue "Detailed Metrics" link for more metrics.
Slurm email reports and seff
By adding the following lines to your Slurm batch script (and entering your NetID) you will receive an efficiency report via email upon completion of the job:
#SBATCH --mail-type=begin # send email when job begins #SBATCH --mail-type=end # send email when job ends #SBATCH --mail-user=<YourNetID>@princeton.edu
Below is a sample email report:
Job ID: 670018 Cluster: adroit User/Group: aturing/math State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 05:17:21 CPU Efficiency: 92.73% of 05:42:14 core-walltime Job Wall-clock time: 05:42:14 Memory Utilized: 2.50 GB Memory Efficiency: 62.5% of 4.00 GB
One can also see this report on the command line by using the "seff" command. Provide the job id as the first argument:
$ ssh <YourNetID>@adroit.princeton.edu $ seff 670018 Job ID: 670018 Cluster: adroit User/Group: aturing/math State: COMPLETED (exit code 0) Cores: 1 CPU Utilized: 05:17:21 CPU Efficiency: 92.73% of 05:42:14 core-walltime Job Wall-clock time: 05:42:14 Memory Utilized: 2.50 GB Memory Efficiency: 62.5% of 4.00 GB
Use the command "shistory -u $USER" to view your recent job id's.