stats.rc provides detailed metrics about running and completed jobs Research Computing provides various utilities to examine job behavior for both Slurm jobs as well as work done through OnDemand. These tools can be used to check performance and troubleshoot issues.jobstatsUse the "jobstats" command to see various job metrics:$ jobstats 1234567 ================================================================================ Slurm Job Statistics ================================================================================ Job ID: 1234567 NetID/Account: aturing/math Job Name: sys_logic_ordinals State: COMPLETED Nodes: 2 CPU Cores: 48 CPU Memory: 256GB (5.3GB per CPU-core) GPUs: 4 QOS/Partition: della-gpu/gpu Cluster: della Start Time: Fri Mar 4, 2022 at 1:56 AM Run Time: 18:41:56 Time Limit: 4-00:00:00 Overall Utilization ================================================================================ CPU utilization [||||| 10%] CPU memory usage [||| 6%] GPU utilization [|||||||||||||||||||||||||||||||||| 68%] GPU memory usage [||||||||||||||||||||||||||||||||| 66%] Detailed Utilization ================================================================================ CPU utilization per node (CPU time used/run time) della-i14g2: 1-21:41:20/18-16:46:24 (efficiency=10.2%) della-i14g3: 1-18:48:55/18-16:46:24 (efficiency=9.5%) Total used/runtime: 3-16:30:16/37-09:32:48, efficiency=9.9% CPU memory usage per node - used/allocated della-i14g2: 7.9GB/128.0GB (335.5MB/5.3GB per core of 24) della-i14g3: 7.8GB/128.0GB (334.6MB/5.3GB per core of 24) Total used/allocated: 15.7GB/256.0GB (335.1MB/5.3GB per core of 48) GPU utilization per node della-i14g2 (GPU 0): 65.7% della-i14g2 (GPU 1): 64.5% della-i14g3 (GPU 0): 72.9% della-i14g3 (GPU 1): 67.5% GPU memory usage per node - maximum used/total della-i14g2 (GPU 0): 26.5GB/40.0GB (66.2%) della-i14g2 (GPU 1): 26.5GB/40.0GB (66.2%) della-i14g3 (GPU 0): 26.5GB/40.0GB (66.2%) della-i14g3 (GPU 1): 26.5GB/40.0GB (66.2%) Notes ================================================================================ * This job requested 5.3 GB of memory per CPU-core. Given that the overall CPU memory usage was only 6%, please consider reducing your CPU memory allocation for future jobs. This will reduce your queue times and make the resources available for other users. For more info: https://researchcomputing.princeton.edu/support/knowledge-base/memory * This job completed while only needing 19% of the requested time which was 4-00:00:00. For future jobs, please decrease the value of the --time Slurm directive. This will lower your queue times and allow the Slurm job scheduler to work more effectively for all users. For more info: https://researchcomputing.princeton.edu/support/knowledge-base/slurm * For additional job metrics including metrics plotted against time: https://mydella.princeton.edu/pun/sys/jobstats (VPN required off-campus) Use the command "shistory" to view your recent job id's.Want to use the Jobstats job monitoring platform at your institution? See the Jobstats GitHub repo.Jobstats Web InterfaceDetailed job statistics as a function of time can be viewed for running and completed Slurm jobs:Browse to a page below (must be on the campus network or on the VPN from off-campus) and enter the JobID:https://myadroit.princeton.edu/pun/sys/jobstatshttps://mydella.princeton.edu/pun/sys/jobstatshttps://mystellar.princeton.edu/pun/sys/jobstatsor for all clusters use https://stats.rc.princeton.eduFor stats.rc, in the upper right corner, click on the dropdown arrow to replace "Last 6 hours" with "Last 7 days". If your job is older than 7 days then you will need to increase the time window.Enter the job id in the "Slurm JobID" text box in the upper left. Press the Enter/Return key. The job data should then display. To find the job id, use the command "shistory -u $USER". This command can also be used to obtain the exact time range of the job if needed.You can adjust the time range of the plots using your mouse by clicking and dragging over the range of interest. Data is captured every 30 seconds. Here are the job-level metrics that are available using stats.rc:CPU UtilizationCPU Memory UtilizationGPU UtilizationGPU Memory UtilizationGPU TemperatureGPU Power UsageBelow are the node-level metrics that are available:CPU Percentage UtilizationTotal Memory UtilizationAverage CPU Frequency Over All CPUsNFS StatsLocal Disc R/WGPFS Bandwidth StatsLocal Disc IOPSGPFS Operations per Second StatsInfiniband ThroughputInfiniband Packet RateInfiniband ErrorsEleven of the seventeen metrics above are node-level. This means that if multiple jobs are running on the node then it will not be possible to disentangle the data. To use these metrics to troubleshoot jobs, the job should allocate the entire node. This can be done by using the --exclusive directive or by requesting all of the available CPU memory. To see the amount of memory availabe on each node, run the "snodes" command and look at the MEMORY column which is in units of megabytes. You should enter the value in your Slurm script in megabytes (e.g. #SBATCH --mem=192000M).Want to use the Jobstats job monitoring platform at your institution? See the Jobstats GitHub repo.OnDemand via MyAdroit, MyDella, MyStellarCommon applications such as Jupyter, MATLAB, RStudio, Stata and others can be run in your web browser via MyAdroit, MyDella or MyStellar. To view the job statistics of an active or completed job, follow these steps:Active JobsIn the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Active Jobs".Find your job in the list. If you don't see it then make sure the blue button in the upper right reads "Your Jobs" instead of "All Jobs". If "All Jobs" is selected then type your NetID in the "filter" text box in the upper right corner to find your jobs.Once you have found your job, click on the icon with the right-pointing arrow or angled bracket. You should then see two panels, namely, "Job CPU Utilization" and "Job CPU Memory Utilization".Click on the blue "Detailed Metrics" link for more metrics.Completed JobsIn the OnDemand main menu (which has an orange bar at the top), choose "Jobs" then "Completed Jobs".Find your job in the list. Click on the jobid which will be a blue hyperlink (e.g., 39852043)Examine the CPU and GPU metrics in the pop-up window. Slurm email reportsBy adding the following lines to your Slurm batch script (and entering your NetID) you will receive an efficiency report via email (which is the output of jobstats) upon completion of the job:#SBATCH --mail-type=end #SBATCH --mail-user=<YourNetID>@princeton.edureportseffreportseff is a wrapper around sacct that provides more complex option parsing, simpler options, and cleaner, colored outputs. On Princeton RC systems, multi-node and GPU utilization can be displayed, similar to jobstats. To see all of the jobs in a folder containing slurm outputs:$ reportseff JobID State Elapsed TimeEff CPUEff MemEff rfmix-40506885 COMPLETED 03:31:09 58.7% 66.8% 56.3% rfmix-40506887 COMPLETED 03:14:13 53.9% 66.6% 56.1% rfmix-40506939 COMPLETED 03:47:37 63.2% 68.4% 56.3% rfmix-40506941 COMPLETED 03:26:04 57.2% 68.6% 56.0% rfmix-40507145 COMPLETED 03:12:42 53.5% 66.3% 56.2% rfmix-40507147 COMPLETED 03:24:52 56.9% 68.0% 56.2% rfmix-40507195 COMPLETED 03:38:25 60.7% 69.2% 56.3% rfmix-40507196 COMPLETED 03:04:54 51.4% 65.6% 56.1% rfmix-40507393 COMPLETED 03:35:50 60.0% 64.6% 56.2% rfmix-40507394 COMPLETED 03:36:48 60.2% 69.4% 56.0% To see all of your jobs from the last week:$ reportseff -u $USER JobID State Elapsed TimeEff CPUEff MemEff 40835250 TIMEOUT 02:05:16 104.4% 88.5% 1.8% 40835253 COMPLETED 00:15:48 26.3% 96.4% 10.4% 40835254 COMPLETED 00:11:58 19.9% 96.9% 10.3% 40835255 COMPLETED 04:14:14 21.2% 87.6% 36.2% 40835256 COMPLETED 00:09:07 15.2% 94.9% 9.2% 40835257 COMPLETED 05:07:26 25.6% 85.6% 43.3% 40835258 COMPLETED 01:28:42 7.4% 83.7% 16.9% 40835259 COMPLETED 00:03:13 5.4% 86.6% 10.2% 40835260 COMPLETED 01:25:07 7.1% 84.4% 16.9% To find lines with 'output:' in jobs in the current directory which have timed out or failed in the last 4 days:$ reportseff --since 'd=4' --state TO,F --format jobid | xargs grep output: Using at PrincetonTo use reportseff on a Research Computing cluster like Della, run these commands:$ ssh <YourNetID>@della.princeton.edu $ module load anaconda3/2024.2 $ conda create --name rseff-env python=3.11 -y $ conda activate rseff-env (rseff-env) $ pip install reportseff (rseff-env) $ reportseff --help After creating the Conda environment, one can use the tool directly without activating the environment:$ /home/$USER/.conda/envs/rseff-env/bin/reportseff --helpLearn more about Conda environments. One could also make an alias in your .bashrc file:alias rseff='/home/$USER/.conda/envs/rseff-env/bin/reportseff'Learn more about aliases and shell functions.