Performance profiler for parallel and GPU codes

Arm MAP is a graphical and command-line profiler for serial,multithreaded, parallel and GPU-enabled applications written in C, C++ and Fortran. It has an easy to use, low-overhead interface. See the documentation for MAP.

Follow these steps to use MAP:

  1. Connect to the login node of the cluster with X11-forwarding enabled (e.g., ssh -X). You may also consider using TurboVNC.
  2. Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc., command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g., -O3, -xHost, -march=native). This way efforts can be spent optimizing regions not addressed by compiler optimizations.

Latest Version

The latest version of MAP can be made available by running this command:

module load map/24.0

Tiger, Della, Adroit  and Stellar

MAP and DDT are part of the Arm Forge package. To see the available versions, use this command: module avail map. To load the latest module run: module load map/24.0

  1. Non-MPI jobs (serial or OpenMP)
    • Prepare your Slurm script as you normally would. That is, request the appropriate resources for the job (nodes, tasks, CPUs, walltime, etc). The addition of MAP should have a negligible impact on the wall clock time.
    • Precede your executable with the map executable along with the flag --profile. For example, if your executable is a.out and you need to give it the command-line argument input.file: "/usr/licensed/bin/map --profile ./a.out input.file"
  2.  MPI jobs (including hybrid MPI/OpenMP)
    • See a demo for LAMMPS
    • Prior to submitting your Slurm script, load the necessary MPI modules, then run the following script once: /usr/licensed/ddt/ddt18.0.2/rhel7/x86_64/map/wrapper/build_wrapper. This will create a wrapper library and some symbolic links in $HOME/.allinea/wrapper.
    • In your slurm submission script add the following with your newly created .so file: export ALLINEA_MPI_WRAPPER=$HOME/.allinea/wrapper/libmap-sampler-pmpi-<machine-name>.princeton.edu.so. Where <machine-name> is the name of the head node, for example: tigercpu, della5.
    • Precede your executable with the map executable along with the flag --profile.  For example, if your executable is a.out and you need to give it the command line argument "input.file" then use: /usr/licensed/bin/map --profile ./a.out input.file
  3. Once the job is complete, a .map file will be created in your working directory. Start the MAP GUI on the head node: /usr/licensed/bin/map. Then select "Load Profile Data File" and choose the new .map file.

Below is a sample Slurm script for an MPI code that uses a GPU:

#!/bin/bash
#SBATCH --job-name=myjob         # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=4               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --time=00:02:00          # total run time limit (HH:MM:SS)
module purge
module load intel/18.0/64/18.0.3.222
module load intel-mpi/intel/2018.3/64
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export ALLINEA_MPI_WRAPPER=$HOME/.allinea/wrapper/libmap-sampler-pmpi-tigergpu.princeton.edu.so
export ALLINEA_LICENSE_FILE=/usr/licensed/ddt/ddt18.0.2/rhel7/x86_64/Licence.default
/usr/licensed/bin/map --profile srun $HOME/.local/bin/lmp_tigerGpu -sf gpu -sf intel -sf omp -in in.melt.gpu

Traverse

Here is an example for a specific code that uses MPI and GPUs:

$ ssh -X <YourNetID>@traverse.princeton.edu
$ module load map/20.0.1 openmpi/gcc/3.1.4/64 cudatoolkit/10.2
$ export MPICC=$(which mpicc)
$ map

Once the GUI opens click on "Profile". A window with the title "Run (on traverse.princeton.edu)" will appear. Fill in the needed information and then click on "Run". Your code will run and then the profiling information will appear. Choose "Stop and Analyze" if the code is running for too long.

GPU Codes

According the MAP user guide, when compiling CUDA kernels do not generate debug information for device code (the -G or --device-debug flag) as this can significantly impair runtime performance. Use -lineinfo instead, for example:

nvcc device.cu -c -o device.o -g -lineinfo -O3

There are No Compilers on the Compute Nodes

$ module load openmpi/gcc/4.1.2
$ mpicxx --showme
g++ -I/usr/local/openmpi/4.1.2/gcc/include -pthread -L/usr/local/openmpi/4.1.2/gcc/lib64 -L/usr/lib64 -Wl,-rpath -Wl,/usr/local/openmpi/4.1.2/gcc/lib64 -Wl,-rpath -Wl,/usr/lib64 -Wl,--enable-new-dtags -lmpi_cxx -lmpi
$ ompi_info
                 Package: Open MPI mockbuild@42bbe3ce599c42a79674836eda18c320
                          Distribution
                Open MPI: 4.1.2
  Open MPI repo revision: v4.1.2
   Open MPI release date: Nov 24, 2021
                Open RTE: 4.1.2
  Open RTE repo revision: v4.1.2
   Open RTE release date: Nov 24, 2021
                    OPAL: 4.1.2
      OPAL repo revision: v4.1.2
       OPAL release date: Nov 24, 2021
                 MPI API: 3.1.0
            Ident string: 4.1.2
                  Prefix: /usr/local/openmpi/4.1.2/gcc
 Configured architecture: x86_64-redhat-linux-gnu
          Configure host: 42bbe3ce599c42a79674836eda18c320
           Configured by: mockbuild
           Configured on: Thu Mar  3 17:50:13 UTC 2022
          Configure host: 42bbe3ce599c42a79674836eda18c320
  Configure command line: '--build=x86_64-redhat-linux-gnu'
                          '--host=x86_64-redhat-linux-gnu'
                          '--program-prefix=' '--disable-dependency-tracking'
                          '--prefix=/usr/local/openmpi/4.1.2/gcc'
                          '--exec-prefix=/usr/local/openmpi/4.1.2/gcc'
                          '--bindir=/usr/local/openmpi/4.1.2/gcc/bin'
                          '--sbindir=/usr/local/openmpi/4.1.2/gcc/sbin'
                          '--sysconfdir=/etc'
                          '--datadir=/usr/local/openmpi/4.1.2/gcc/share'
                          '--includedir=/usr/local/openmpi/4.1.2/gcc/include'
                          '--libdir=/usr/local/openmpi/4.1.2/gcc/lib64'
                          '--libexecdir=/usr/local/openmpi/4.1.2/gcc/libexec'
                          '--localstatedir=/var' '--sharedstatedir=/var/lib'
                          '--mandir=/usr/local/openmpi/4.1.2/gcc/man'
                          '--infodir=/usr/local/openmpi/4.1.2/gcc/share/info'
                          '--disable-static' '--enable-shared' '--with-sge'
                          '--enable-mpi_thread_multiple' '--enable-mpi-cxx'
                          '--with-cma' '--sysconfdir=/etc/openmpi/4.1.2/gcc'
                          '--with-esmtp' '--with-slurm' '--with-pmix=/usr'
                          '--with-libevent=/usr'
                          '--with-libevent-libdir=/usr/lib64'
                          '--with-hwloc=/usr' '--with-ucx=/usr'
                          '--with-hcoll=/opt/mellanox/hcoll'
                          '--without-verbs'
                          'LDFLAGS=-Wl,-rpath,/usr/local/openmpi/4.1.2/gcc/lib64
                          -Wl,-z,noexecstack'
                Built by: mockbuild
                Built on: Thu Mar  3 18:01:18 UTC 2022
              Built host: 42bbe3ce599c42a79674836eda18c320
              C bindings: yes
            C++ bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the gfortran compiler and/or Open
                          MPI, does not support the following: array
                          subsections, direct passthru (where possible) to
                          underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 8.5.0
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /usr/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.2)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.2)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.2)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.2)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.2)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.2)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.2)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.2)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.2)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v4.1.2)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.2)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.2)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.2)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.2)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.2)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.2)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA ras: gridengine (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.2)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.2)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.2)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.2)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.2)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: hcoll (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.2)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.2)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component
                          v4.1.2)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.2)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.2)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.2)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v4.1.2)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.2)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v4.1.2)

The Slurm script is:

#!/bin/bash
#SBATCH --job-name=cxx_mpi       # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks-per-node=4     # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=1G         # memory per cpu-core (4G is default)
#SBATCH --time=00:20:10          # total run time limit (HH:MM:SS)

module purge
module load map/24.0
module load openmpi/gcc/4.1.2

map --profile srun ./hello_world_mpi

 

When the code is ran:

getsebool:  SELinux is disabled
Warning: unrecognised style "CDE"
Linaro Forge 24.0.2 - Linaro MAP
MAP: Unable to automatically generate and compile a MPI wrapper for your system. Please start Linaro Forge with the MPICC environment variable set to the C MPI compiler for the MPI version in use with your program.
MAP: 
MAP: /usr/licensed/linaro/forge/24.0.2/map/wrapper/build_wrapper: line 433: [: argument expected
MAP: No mpicc command found (tried mpixlc_r mpxlc_r mpixlc mpxlc mpiicc mpcc mpicc mpigcc mpgcc mpc_cc)
MAP: 
MAP: Unable to compile MPI wrapper library (needed by the Linaro Forge sampler). Please set the environment variable MPICC to your MPI compiler command and try again.

Or with Intel MPI:

Warning: unrecognised style "CDE"
Linaro Forge 24.0.2 - Linaro MAP
MAP: Unable to automatically generate and compile a MPI wrapper for your system. Please start Linaro Forge with the MPICC environment variable set to the C MPI compiler for the MPI version in use with your program.
MAP: 
MAP: Attempting to generate MPI wrapper using $MPICC ('/opt/intel/oneapi/mpi/2021.7.0/bin/mpiicc').../usr/licensed/linaro/forge/24.0.2/map/wrapper/build_wrapper: line 237: /opt/intel/oneapi/mpi/2021.7.0/bin/mpiicc: No such file or directory
MAP: 
MAP: /bin/sh: /opt/intel/oneapi/mpi/2021.7.0/bin/mpiicc: No such file or directory
MAP: Error: Couldn't run '/opt/intel/oneapi/mpi/2021.7.0/bin/mpiicc -E /tmp/tmpl9wuc5yw.c' for parsing mpi.h.
MAP:        Process exited with code 127.
MAP: fail
MAP: /usr/licensed/linaro/forge/24.0.2/map/wrapper/build_wrapper: line 433: [: argument expected
MAP: 
MAP: Unable to compile MPI wrapper library (needed by the Linaro Forge sampler). Please set the environment variable MPICC to your MPI compiler command and try again.

When running on the head node:

Warning: unrecognised style "CDE"
Linaro Forge 24.0.2 - Linaro MAP
MAP: (Message repeated 2 times.)
MAP: 
MAP: No debug symbols were loaded for the glibc library.
MAP: It is recommended you install the glibc debug symbols.
Profiling             : mpirun /home/aturing/.local/bin/lmp_intel -sf intel -in in.melt
Linaro Forge sampler  : preload (Express Launch)
MPI implementation    : Auto-Detect (Intel MPI (MPMD))
* number of processes : 32
* number of nodes     : 1
* MPI wrapper         : preload (precompiled mpich3-gnu-64) (Express Launch)
Linaro Forge sampler: CUPTI failed to enable kernel activity monitoring - error code 15
Linaro Forge sampler: CUPTI failed to enable kernel activity monitoring - error code 15
…
Linaro Forge sampler: CUPTI failed to enable memcpy activity monitoring - error code 15
MAP: Processes 0-31: 
MAP: 
MAP: The Linaro Forge sampler failed to initialize.
MAP: 
MAP: [Detaching after vfork from child process 3727988]
MAP: [Detaching after vfork from child process 3728168]
MAP: [Detaching after vfork from child process 3728256]
MAP: [Detaching after vfork from child process 3728577]
MAP: 0

Nobel and Tigressdata

On a cluster (Tiger, Della, Adroit, Stellar) use this option only if your job will likely schedule and complete quickly, as you will have to wait for it to finish before you can analyze any results.  The MAP GUI will build and submit your job to the scheduler for you.  If the job will not run quickly it is best follow the directions below to use the scheduler manually.

  1. Start MAP: “/usr/licensed/bin/map”
  2. If this is the first time you are running MAP on a given machine, MAP will be configured with the default values for that system.  You may need to change these for your application, as described below.
  3. In the opening window select the “Profile” button.
  4. Select your Application, Arguments, Input File, and Working Directory as appropriate. 
  5. If this is an MPI code, check the MPI box.
    • Adjust the Number of Processes (total number of process for the entire job), the number of nodes, and the number of processes per node.  The number of processes should equal the number of nodes multiplied by the number of processes per node.  
    • For Tiger, Della, Adroit, and Stellar the implementation should be "SLURM (generic)".  Typically there is no need to change the implementation nor is there a need for any srun arguments.  If the implementation is something else, click change, then select SLURM (generic) from the drop down menu MPI/UPC Implementation.  Then click OK.
    • For Nobel and Tigressdata the implementation should be "OpenMPI".   If the implementation is something else, click change, then select OpenMPI from the drop down menu MPI/UPC Implementation.  Then click OK..
  6. If this is an OpenMP job, check the OpenMP box
    • Adjust the Number of OpenMP threads as appropriate.
    • OpenMP applications require an additional change.  Click on the Options button (near the bottom left).  This will open another window.  Select "Job Submission" from the left hand menu.  Then change the "Submission template file:" field to "/usr/licensed/ddt/templates/slurm-openmp.qtf".  This should be re-set to slurm-default.qtf for all non-OpenMP applications. 
  7. If running on a cluster check "Submit to Queue", and click on Parameters.
    • Choose a Wall Clock Limit.  The addition of MAP should have a negligible impact on the wall clock time of your application.
    • If you wish an Email notification at the beginning (begin) or end (end) of a job, or when the job aborts (fail), change the default to suit your preference (e.g., all).
    • There is no need to change the Email address unless you do not have a Princeton address. If you don't, please specify your email address here.
    • Click OK.
  8. On Nobel and Tigressdata, click on Run; otherwise, click on Submit.
  9. MAP will submit the job to the scheduler (on a cluster) and run when ready.  Profiling statistics will not be available until the job is finished.  Click the “Stop and Analyze” button at the top right to end the job immediately.