Intel VTune Profiler is a powerful serial and parallel profiler which can be used to collect performance statistics of your code. VTune can profile code written in C, C++, C#, FORTRAN, Java, and Assembly. VTune is designed to be used on shared memory machines so code using MPI and/or OpenMP, as long as it is confined to run on a single node, can be profiled. Initial Setup Set up the VTune environment by loading the VTune module as follows: "module load intel-vtune/oneapi" Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc, command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g. -O3, -xAVX). This way efforts can be spent optimizing regions not addressed by compiler optimizations. Serial Usage with the GUI Do not use this approach for jobs running longer than a few minutes - instead submit to the scheduler and view the results in the gui (see section below). After loading the VTune module start the gui from the command line: "vtune-gui". If this is the first time you have run VTune click "New Project" and choose a location to store the analysis output (/home/$USER/intel/vtune/projects is the default location). You will be taken to the Configure Analysis tab of your project. Choose the application that you built in the initial setup stages (Ex. ~/sample_code.exe). Enter any application parameters you wish to use. On the right side of the Configure Analysis tab, under Performance Analysis, choose Hotspots. It is recommended to start with basic hotspots and then move to more advanced profiling analyses if necessary. At the bottom of that tab, you can now click the blue Start button. Your application will now run in the background while VTune collects data. The amount of time this takes is individual to your application; VTune should not add a noticeable amount of overhead. VTune will then finalize the results and display a summary page. Assuming the application was compiled with the -g flag, the Top Hotspots should point out the most time consuming functions/subroutines of your program. You can see more information by clicking on the bottom-up or top-down tree. Double clicking on a line will bring you to the source code of the application and show CPU usage on a per source code line basis. This will point you to the areas that should be the focus of your optimization efforts. Serial/Parallel Usage Through the Scheduler The instructions here detail how to submit your program to run with the VTune collector on a remote compute node and then finalize and visualize the results in the GUI on the head node. Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc, command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g. -O3, -xAVX). This way efforts can be spent optimizing regions not addressed by compiler optimizations. Set up your slurm submission script as you normally would, with the following changes: If it is not in your .bashrc/.cshrc, add the line "module load intel-vtune" before your application run statement. You will need to ensure your job lands entirely on a single node. VTune is not capable of collecting across multiple nodes. Therefore, you will want to set -N 1, and use the -n flag to request the number of cpus. Modify the application run statement to have the executable "vtune" run your application. For example, if you wanted to collect hotspots on your code named "mpi_code.exe" and put the results in the (new) local directory "new_dir_name" you would use: srun vtune -r new_dir_name -collect hotspots ./mpi_code.exe Submit your job to the scheduler Ex. “sbatch vtune_slurm_script.cmd” When your job has finished running, on the head node type “vtune-gui new_dir_name” from the directory one level above new_dir_name. This will launch the GUI. Your collector results will be finalized and a summary page will be shown. The CPU Usage Histogram shows the usage of simulateneously used CPUs. The target value should be the number of processors requested with the -n flag in your slurm submission script. You can see more information by clicking on the bottom-up or top-down tree. Double clicking on a line will bring you to the source code of the application and show CPU usage on a per source code line basis. This will point you to the areas that should be the focus of your optimization efforts. Additional Information Intel VTune Amplifier User's Guide: https://www.intel.com/content/www/us/en/docs/vtune-profiler/user-guide CPU Metrics Defined: https://software.intel.com/en-us/vtune-amplifier-help-cpu-metrics-reference Intel Software Developer's Manual: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-…