Performance profiler for parallel and GPU codes Arm MAP is a graphical and command-line profiler for serial,multithreaded, parallel and GPU-enabled applications written in C, C++ and Fortran. It has an easy to use, low-overhead interface. See the documentation for MAP.Follow these steps to use MAP:Connect to the login node of the cluster with X11-forwarding enabled (e.g., ssh -X). You may also consider using TurboVNC.Build your application as you normally would but also turn on the compiler debug symbols. This is typically done by adding the -g option to the icc, gcc, mpicc, ifort, etc., command. This enables source-level profiling. It is recommended to use release build optimization flags (e.g., -O3, -xHost, -march=native). This way efforts can be spent optimizing regions not addressed by compiler optimizations.Latest VersionThe latest version of MAP can be made available by running this command:module load map/24.0Tiger, Della, Adroit and StellarMAP and DDT are part of the Arm Forge package. To see the available versions, use this command: module avail map. To load the latest module run: module load map/24.0Non-MPI jobs (serial or OpenMP)Prepare your Slurm script as you normally would. That is, request the appropriate resources for the job (nodes, tasks, CPUs, walltime, etc). The addition of MAP should have a negligible impact on the wall clock time.Precede your executable with the map executable along with the flag --profile. For example, if your executable is a.out and you need to give it the command-line argument input.file: "/usr/licensed/bin/map --profile ./a.out input.file" MPI jobs (including hybrid MPI/OpenMP)See a demo for LAMMPSPrior to submitting your Slurm script, load the necessary MPI modules, then run the following script once: /usr/licensed/ddt/ddt18.0.2/rhel7/x86_64/map/wrapper/build_wrapper. This will create a wrapper library and some symbolic links in $HOME/.allinea/wrapper.In your slurm submission script add the following with your newly created .so file: export ALLINEA_MPI_WRAPPER=$HOME/.allinea/wrapper/libmap-sampler-pmpi-<machine-name>.princeton.edu.so. Where <machine-name> is the name of the head node, for example: tigercpu, della5.Precede your executable with the map executable along with the flag --profile. For example, if your executable is a.out and you need to give it the command line argument "input.file" then use: /usr/licensed/bin/map --profile ./a.out input.fileOnce the job is complete, a .map file will be created in your working directory. Start the MAP GUI on the head node: /usr/licensed/bin/map. Then select "Load Profile Data File" and choose the new .map file.Below is a sample Slurm script for an MPI code that uses a GPU:#!/bin/bash #SBATCH --job-name=myjob # create a short name for your job #SBATCH --nodes=1 # node count #SBATCH --ntasks=4 # total number of tasks across all nodes #SBATCH --cpus-per-task=1 # cpu-cores per task (>1 if multi-threaded tasks) #SBATCH --mem-per-cpu=4G # memory per cpu-core (4G per cpu-core is default) #SBATCH --gres=gpu:1 # number of gpus per node #SBATCH --time=00:02:00 # total run time limit (HH:MM:SS) module purge module load intel/18.0/64/18.0.3.222 module load intel-mpi/intel/2018.3/64 export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK export ALLINEA_MPI_WRAPPER=$HOME/.allinea/wrapper/libmap-sampler-pmpi-tigergpu.princeton.edu.so export ALLINEA_LICENSE_FILE=/usr/licensed/ddt/ddt18.0.2/rhel7/x86_64/Licence.default /usr/licensed/bin/map --profile srun $HOME/.local/bin/lmp_tigerGpu -sf gpu -sf intel -sf omp -in in.melt.gpuTraverseHere is an example for a specific code that uses MPI and GPUs:$ ssh -X <YourNetID>@traverse.princeton.edu $ module load map/20.0.1 openmpi/gcc/3.1.4/64 cudatoolkit/10.2 $ export MPICC=$(which mpicc) $ mapOnce the GUI opens click on "Profile". A window with the title "Run (on traverse.princeton.edu)" will appear. Fill in the needed information and then click on "Run". Your code will run and then the profiling information will appear. Choose "Stop and Analyze" if the code is running for too long.GPU CodesAccording the MAP user guide, when compiling CUDA kernels do not generate debug information for device code (the -G or --device-debug flag) as this can significantly impair runtime performance. Use -lineinfo instead, for example:nvcc device.cu -c -o device.o -g -lineinfo -O3Nobel and TigressdataOn a cluster (Tiger, Della, Adroit, Stellar) use this option only if your job will likely schedule and complete quickly, as you will have to wait for it to finish before you can analyze any results. The MAP GUI will build and submit your job to the scheduler for you. If the job will not run quickly it is best follow the directions below to use the scheduler manually.Start MAP: “/usr/licensed/bin/map”If this is the first time you are running MAP on a given machine, MAP will be configured with the default values for that system. You may need to change these for your application, as described below.In the opening window select the “Profile” button.Select your Application, Arguments, Input File, and Working Directory as appropriate. If this is an MPI code, check the MPI box.Adjust the Number of Processes (total number of process for the entire job), the number of nodes, and the number of processes per node. The number of processes should equal the number of nodes multiplied by the number of processes per node. For Tiger, Della, Adroit, and Stellar the implementation should be "SLURM (generic)". Typically there is no need to change the implementation nor is there a need for any srun arguments. If the implementation is something else, click change, then select SLURM (generic) from the drop down menu MPI/UPC Implementation. Then click OK.For Nobel and Tigressdata the implementation should be "OpenMPI". If the implementation is something else, click change, then select OpenMPI from the drop down menu MPI/UPC Implementation. Then click OK..If this is an OpenMP job, check the OpenMP boxAdjust the Number of OpenMP threads as appropriate.OpenMP applications require an additional change. Click on the Options button (near the bottom left). This will open another window. Select "Job Submission" from the left hand menu. Then change the "Submission template file:" field to "/usr/licensed/ddt/templates/slurm-openmp.qtf". This should be re-set to slurm-default.qtf for all non-OpenMP applications. If running on a cluster check "Submit to Queue", and click on Parameters.Choose a Wall Clock Limit. The addition of MAP should have a negligible impact on the wall clock time of your application.If you wish an Email notification at the beginning (begin) or end (end) of a job, or when the job aborts (fail), change the default to suit your preference (e.g., all).There is no need to change the Email address unless you do not have a Princeton address. If you don't, please specify your email address here.Click OK.On Nobel and Tigressdata, click on Run; otherwise, click on Submit.MAP will submit the job to the scheduler (on a cluster) and run when ready. Profiling statistics will not be available until the job is finished. Click the “Stop and Analyze” button at the top right to end the job immediately.