Profiling with Arm MAP

Arm MAP is a graphical and command-line profiler for parallel, multi-threaded, and serial applications written in C, C++ and Fortran. It has an easy to use, low-overhead interface making it a good first choice for profiling serial, OpenMP, MPI, and hybrid OpenMP/MPI codes. It also has support for CUDA applications. The 20.0.1 version can be used to profile some Python scripts that call compiled code. See the documentation for MAP.

Follow these steps to use MAP:

  1. Log into the machine or the head node of the cluster with X11-forwarding enabled (e.g., ssh -X). You may also consider using TurboVNC.
  2. Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc., command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g., -O3, -xHost, -march=native). This way efforts can be spent optimizing regions not addressed by compiler optimizations.


Tiger, Della, Adroit  and Perseus

  1. Non-MPI jobs (serial or OpenMP)
    • Prepare your Slurm script as you normally would. That is, request the appropriate resources for the job (nodes, tasks, CPUs, walltime, etc). The addition of MAP should have a negligible impact on the wall clock time.
    • Precede your executable with the map executable along with the flag --profile. For example, if your executable is a.out and you need to give it the command-line argument input.file: "/usr/licensed/bin/map --profile ./a.out input.file"
  2.  MPI jobs (including hybrid MPI/OpenMP)
    • Prior to submitting your Slurm script, load the necessary MPI modules, then run the following script once: /usr/licensed/ddt/ddt18.0.2/rhel7/x86_64/map/wrapper/build_wrapper. This will create a wrapper library and some symbolic links in $HOME/.allinea/wrapper.
    • In your slurm submission script add the following with your newly created .so file: export ALLINEA_MPI_WRAPPER=$HOME/.allinea/wrapper/libmap-sampler-pmpi-<machine-name> Where <machine-name> is the name of the head node, for example: tigercpu, della5.
    • Precede your executable with the map executable along with the flag --profile.  For example, if your executable is a.out and you need to give the it the command line argument input.file:  "/usr/licensed/bin/map --profile ./a.out input.file".
  3. Once the job is complete, a .map file will be created in your working directory.  Start the MAP GUI on the head node: /usr/licensed/bin/map. Then select "Load Profile Data File" and choose the new .map file.

Below is a sample Slurm script for an MPI code that uses a GPU:

#SBATCH --job-name=myjob         # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=4               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:02:00          # total run time limit (HH:MM:SS)

module purge
module load intel/18.0/64/
module load intel-mpi/intel/2018.3/64

export ALLINEA_LICENSE_FILE=/usr/licensed/ddt/ddt18.0.2/rhel7/x86_64/Licence.default
/usr/licensed/bin/map --profile srun $HOME/.local/bin/lmp_tigerGpu -sf gpu -sf intel -sf omp -in in.melt.gpu



Here is an example for a specific code that uses MPI and GPUs:

$ ssh -X <YourNetID>
$ module load map/20.0.1 openmpi/gcc/3.1.4/64 cudatoolkit/10.2
$ export MPICC=$(which mpicc)
$ map

Once the GUI opens click on "Profile". A window with the title "Run (on" will appear. Fill in the needed information and then click on "Run". Your code will run and then the profiling information will appear. Choose "Stop and Analyze" if the code is running for too long.


GPU Codes

According the MAP user guide, when compiling CUDA kernels do not generate debug information for device code (the -G or --device-debug flag) as this can significantly impair runtime performance. Use -lineinfo instead, for example:

nvcc -c -o device.o -g -lineinfo -O3


Nobel and Tigressdata

On a cluster (Tiger, Della, Adroit, Perseus) use this option only if your job will likely schedule and complete quickly, as you will have to wait for it to finish before you can analyze any results.  The MAP GUI will build and submit your job to the scheduler for you.  If the job will not run quickly it is best follow the directions below to use the scheduler manually.

  1. Start MAP: “/usr/licensed/bin/map”
  2. If this is the first time you are running MAP on a given machine, MAP will be configured with the default values for that system.  You may need to change these for your application, as described below.
  3. In the opening window select the “Profile” button.
  4. Select your Application, Arguments, Input File, and Working Directory as appropriate. 
  5. If this is an MPI code, check the MPI box.
    • Adjust the Number of Processes (total number of process for the entire job), the number of nodes, and the number of processes per node.  The number of processes should equal the number of nodes multiplied by the number of processes per node.  
    • For Tiger, Della, Adroit, and Perseus the implementation should be "SLURM (generic)".  Typically there is no need to change the implementation nor is there a need for any srun arguments.  If the implementation is something else, click change, then select SLURM (generic) from the drop down menu MPI/UPC Implementation.  Then click OK.
    • For Nobel and Tigressdata the implementation should be "OpenMPI".   If the implementation is something else, click change, then select OpenMPI from the drop down menu MPI/UPC Implementation.  Then click OK..
  6. If this is an OpenMP job, check the OpenMP box
    • Adjust the Number of OpenMP threads as appropriate.
    • OpenMP applications require an additional change.  Click on the Options button (near the bottom left).  This will open another window.  Select "Job Submission" from the left hand menu.  Then change the "Submission template file:" field to "/usr/licensed/ddt/templates/slurm-openmp.qtf".  This should be re-set to slurm-default.qtf for all non-OpenMP applications. 
  7. If running on a cluster check "Submit to Queue", and click on Parameters.
    • Choose a Wall Clock Limit.  The addition of MAP should have a negligible impact on the wall clock time of your application.
    • If you wish an Email notification at the beginning (begin) or end (end) of a job, or when the job aborts (fail), change the default to suit your preference (e.g., all).
    • There is no need to change the Email address unless you do not have a Princeton address. If you don't, please specify your email address here.
    • Click OK.
  8. On Nobel and Tigressdata, click on Run; otherwise, click on Submit.
  9. MAP will submit the job to the scheduler (on a cluster) and run when ready.  Profiling statistics will not be available until the job is finished.  Click the “Stop and Analyze” button at the top right to end the job immediately.