Home

Curie advanced userguide - VI-HPS

1. Hardware topology Machine 128GB Socket P 0 P O e P 8 Cor Coe P 1 Socket P 2 Col NUMANode P 2 32GB Socket P 1 L3 24MB 0 NUMANode P 3 32GB Socket P 3 L3 4MB Core P 0 Coi P 1 Hardware topology of a Curie fat node The hardware topology is the organization of cores processors sockets and memory in a node The previous image was created with hwloc You can have access to hwloc on Curie with the command module load hwloc Definitions We define here some vocabulary e Binding a Linux process can be bound or stuck to one or many cores It means a process and its threads can run only on a given selection of cores For example a process which is bound to a socket on a Curie fat node can run on any of the 8 cores of a processor e Affinity it represents the policy of resources management cores and memory for processes e Distribution the distribution of MPI processes describes how theses processes are spread accross the core sockets or nodes On Curie the default behaviour for distribution affinity and binding are managed by SLURM precisely the ccc_mprun command Process distribution We present here some example of MPI processes distributions e block or round this is the standard distribution From SLURM manpage The block distribution method will distribute tasks to a node such that consecutive tasks share a node For example consider an allocation of two nod
2. Hla pse d time limit in s e conds S ta nda rd output l is the job id Proje ct ID Disable de fa ult S LURM binding mopirun bind to core np 32 a out Socket binding the process and his threads can run on all cores of a socket bir ba sh M5 UB r MyJob Pa ra M5 UB n 32 M5 UB x M5 UB T 1800 MS UB o e xa mple _ l 0 M5 UB A pa xxxx M5 UB E cpu _bind none Re quest name Number of tasks to use Require a exclusive node Ha pse d time limit in se conds S ta nda rd output lis the job id Proje ct ID Disable de fa ult S LURM binding mopirun bind to s ocke t np 32 a out You can specify the number of cores to assign to a MPI process bir ba sh M5 UB r MyJob Pa ra MS UB n 32 MS UB x M5 UB T 1800 M5 UB o e xa mple _ l 0 M5 UB A pa xxxx M5 UB E cpu_bind none Re quest name Number of tasks to use Require a exclusive node Ha pse d time limit in se conds S ta nda rd output lis the job id Proje ct ID Disable de fa ult S LURM binding mopirun bind to s ocke t cpus pe r proc 4 np 8 a out Here we assign 4 cores per MPI process Manual process management BullxMPI gives the possibility to manually assign your processes through a hostfile and a rankfile An example bir ba sh M5 UB r MyJob Pa ra MS UB n 32 MS UB x M5 UB T 1800 MS UB o e xa mple _ l 0 M5 UB A pa xxxx M5 UB E cpu _bind no
3. curie 1342 19946 40080 0 0 odis de fa ult fork binding child 40080 1 6 to s ocke t 3 cpus 88888888 curie 1342 19946 40080 0 0 odis de fa ult fork binding child 40080 1 7 to s ocke t 3 cpus 88888888 curie 1342 19946 40080 0 0 odis de fa ult fork binding child 4 0080 1 0 to s ocke t O cpus 11111111 curie 1342 19946 40080 0 0 odis de fa ult fork binding child 40080 1 1 to socke t O cpus 11111111 curie 1342 19946 40080 0 0 odis de fa ult fork binding child 40080 1 2 to socke t 1 cpus 22222222 In the following paragraphs we present the different possibilities of process distribution and binding These options can be mixed if possible Remark the following examples use a whole Curie fat node We reserve 32 cores with MSUB n 32 and MSUB x to have all the cores and to do what we want with them This is only examples for simple cases In others case there may be conflicts with SLURM Process distribution Block distribution by core bir ba sh M5 UB r MyJob Pa ra MS UB n 32 MS UB x MS UB T 1800 M5 UB o e xa mple_ l o M5 UB A pa xxxx M5 UB E cpu _bind none Request name Number of tasks to use Re quire a exclusive node Ha pse d time limit in se conds S ta nda rd output l is the job id Proje ct ID Disable de fa ult S LURM binding mopirun by core np 32 a out Cyclic distribution by socket biryba sh M5 UB r MyJob Pa r
4. Tota flpins 100056592 MFLOPS 1944 826 If you want more precisions you can contact us or visit PAPI website VampirT race Vampir VampirTrace is a library which let you profile your parallel code by taking traces during the execution of the program We present here an introduction of Vampir Vampirtrace Basics First you must compile your code with VampirTrace compilers In order to use VampirTrace you need to load the vampirtrace module ba sh 4 00 module loa d va mpirtra ce ba sh 4 00 vtcc c prog c ba sh 4 00 vtcc o prog e xe prog o Available compilers are e vtcc C compiler e vtc vtCC et vtcxx C compilers e vtf77 et vtf90 Fortran compilers To compile a MPI code you should type ba sh 4 00 vtcc vt ce mpice g c prog c ba sh 4 00 vtcc vt cc mpicc g o prog e xe prog o For others languages you have e vtcc vt cc mpicc MPI C compiler e vtc vt cxx mpic vtCC vt cxx mpiCC et vtcxx vt cxx mpicxx MPI C compilers e vtf77 vt f77 mpif77 et vtf90 vt f90 mpif90 MPI Fortran compilers By default VampirTrace wrappers use Intel compilers To change for another compiler you can use the same method for MPI ba sh 4 00 vtcc vt ce gcc 02 c prog c ba sh 4 00 vtcc vt cc gcc 02 o prog e xe prog o To profile an OpenMP or a hybrid OpenMP MPI application you should add the corresponding OpenMP option for the compiler ba sh 4 00 vtcc ope nmp O2 c prog c ba sh 4 00
5. cpu bind MAS K curie 1139 task 5 5 18761 ma sk 0x40404040set cpu bind MAS K curie 1139 task 0 0 18756 ma sk 0x 1010101 set cpu_bind MAS K curie 1139 task 1 1 18757 ma sk 0x 10101010 set cpu_bind MAS K curie 1139 task 6 6 18762 mask 0x8080808 se t cpu_bind MAS K curie 1139 task 4 4 18760 mask 0x4040404 set cpu bind MAS K curie 1139 task 3 3 18759 mask 0x20202020set cpu_bind MAS K curie 1139 task 2 2 18758 mask 0x2020202 set cpu bind MAS K curie 1139 task 7 7 18763 mask 0x80808080 set We can see here the MPI rank 0 process is launched over the cores 0 8 16 and 24 of the node These cores are all located on the node s first socket Remark With the c option SLURM will try to gather at best the cores to have best performances In the previous example all the cores of a MPI process will be located on the same socket Another example ccc_mprun n 1 c 32 E cpu _bind ve rbose a out cpu bind MAS K curie 1017 task 0 0 34710 mask Oxffffffff set We can see the process is not bound to a core and can run over all cores of a node BullxMPI BullxMPI has its own process management policy To use it you have first to disable SLURM s process management policy by adding the directive MSUB E cpu_bind none You can then use BullxMPI launcher mpirun bir ba sh M5 UB r MyJob Para Re que st na me M5 UB n 32 Number of tasks to use MS UB x Require a exclusive node MS
6. E 5 98e4 Receives E 0 00 gradphiorder3_ E 0 ttane1520 Clo Colective E 0 00 phiddorder3_ i 2 65 Process 16 6340 Exchange E 2 91 sub_defstffmat_helmhokz_ 2 85 Process 17 W960 As source E 1 58e4 As destination E 0 00 gradphiorder3_ 2 89 Process 18 0 00 phiedorder3_ 3 04 Process 19 GC 0 Bytes transferred 0 00 MPLRecv 2 85 Process 20 To Point to point loco MPLirecy 3 03 Process 21 E 2999 Sent ooo MPL_Watany 2 95 Process 22 2 9909 Received loco MPL_Comm_group 3 55 Process 23 Do collective loco MPL_Group_inct 0 thanes21 E 2 62e9 Outgoing E 2 62e9 Incoming m_create I 2 73 Process 24 000 MPL_Group_free E 2 73 Process 25 E C 0 00 Computational imbalance C 0 00 MPI_Comm_spit E 2 56 Process 26 16 44 Overload 000 mPt Reduce 231 Process 27 L E 235 Single participant ooo MPI Akoa E 6 44 Underload 0 00 MPL_Barrier E C 0 00 Non partcipation D000 MPL Send LE 235 Sngulariy 0 00 MPI Waral 000 MPL_Isend ooo MPI Test C 0 00 MPL_Iprobe ooo MPI War 0 00 MPI Probe ooo MPI Testal ooo MPI_Ssend E 1 81 sub_wrteunstruct 000 MPL_Finatze L DOCO TRACING a D D ion 8 g 5 fo c0 88 80 27 12 327 40 o0 73 Scalasca If you need more information you can contact us Scalasca Vampir Scalasca can generate OTF tracefile in order visualize it with Vampir To activate traces you can add t option to scalasca when you
7. MS UB o e xa mple _ l 0 Sta nda rd output lis the job id MS UB e exa mpe _ l e Error output l is the job id M5 UB A pa Xxxx Proje ct ID M5 UB E x curie 1000 1003 Exclude 4 nodes curie 1000 to curie 1003 set x cd BRIDGE_MS UB PWD cc_mprun a out MPI Embarrassingly parallel jobs and MPMD jobs e Anembarrassingly parallel job is a job which launch independent processes These processes need few or no communications e A MPMD job is a parallel job which launch different executables over the processes A MPMD job can be parallel with MPI and can do many communications These two concepts are separate but we present them together because the way to launch them on Curie is similar An simple example in the Curie info page was already given In the following example we use ccc_mprun to launch the job srun can be used too We want to launch binO on the MPI rank 0 bin1 on the MPI rank 1 and bin2 on the MPI rank 2 We have first to write a shell script which describes the topology of our job launch_exe sh biryba sh if S LURM PROCID e q 0 the n bin0 fi if S LURM PROCID eq 1 the n bin1 fi if S LURM PROCID e q2 the n bin2 fi We can then launch our job with 3 processes ccc mpun n 3 la unch e xe sh The script launch_exe sh must have execute permission When ccc_mprun launches the job it will initialize some environment variables Among them SLURM_PROCID defines th
8. UB T 1800 Ha pse d time limit in se conds MS UB o e xa mple_ l o Sta nda rd output lis the job id M5 UB A pa Xxxx Proje ct ID M5 UB E cpu _bind none Disa ble de fa ult S LURM binding mopirun np 32 a out Note In this example BullxMPI process management policy can be effective only on the 32 cores allocated by SLURM The default BullxMPI process management policy is e the processes are not bound e the processes can run on all cores e the default distribution is block by core and by node The option report bindings gives you a report about the binding of processes before the run bir ba sh M5 UB r MyJob Para Re que st name M5 UB n 32 Number of tasks to use MS UB x Require a exclusive node M5 UB T 1800 Ha pse d time limit in se conds M5 UB o e xa mple_ l o Sta nda rd output l is the job id MS UB A pa Xxxx Proje ct ID M5 UB E cpu_bind none Disa ble de fa ult S LURM binding mpirun re port bindings bind to s ocke t cpus pe r proc 4 np 8 a out And there is the output mpirun bind to s ocke t cpus pe r proc 4 np 8 a out curie 1342 19946 40080 0 0 odis de fa ult fork binding child 40080 1 3 to socke t 1 cpus 22222222 curie 1342 1994 6 40080 0 0 odis de fa ult fork binding child 40080 1 4 to socke t 2 cpus 44444444 curie 1342 19946 40080 0 0 odis de fa ult fork binding child 40080 1 5 to socke t 2 cpus 44444444
9. before the ccc_mprun command The form of the corresponding environment variable is OMPI_MCA_xxxxx where xxxxx is the parameter biryba sh M5 UB r MyJob Pa ra Re quest name MS UB n 32 Number of tasks to use MS UB T 1800 Ha pse d time limit ins e conds MS UB o e xa mple_ l o Sta nda rd output l is the job id M5 UB e exa mpe _ l e Error output lis the job id MS UB A pa Xxxx Proje ct ID set x cd BRIDGE_MS UB PWD e xport OMPI_MCA_mpi_s how_mca _pa ra ms a ll ccc_mprun a out Optimizing with BullxMPI You can try theses parameters in order to optimize BullxMPI e xport OMPI MCA _mpi le a ve _pinne d 1 This setting improves the bandwidth for communication if the code uses the same buffers for communication during the execution e xport OMPI MCA _btl_ope nib use _e a ge r_rdma 1 This parameter optimizes the latence for short messages on Infiniband network But the code will use more memory Be careful theses parameters are not set by default They can have influences on the behaviour of your codes Debugging with BullxMPI Sometimes BullxMPI codes can hang in any collective communication for large jobs If you find yourself in this case you can try this parameter e xport OMPI MCA _coll ghc tune d This setting disables optimized collective communications it can slow down your code if it uses many collective operations Process distribution affinity and binding Introduction
10. in the following section Intel e opt_report generates a report which describes the optimisation in stderr O3 required e ip ipo inter procedural optimizations mono and multi files The command xiar must be used instead of ar to generate a Static library file with objects compiled with ipo option e fast default high optimisation level O3 ipo static Carefull This option is not allowed using MPI the MPI context needs to call some libraries which only exists in dynamic mode This is incompatible with the static option You need to replace fast by O3 ipo e ftz considers all the denormalized numbers like INF or NAN as zeros at runtime e fp relaxed mathematical optimisation functions Leads to a small loss of accuracy e pad makes the modification of the memory positions operational ifort only There are some options which allow to use specific instructions of Intel processors in order to optimize the code These options are compatible with most of Intel processors The compiler will try to generate these instructions if the processor allow it e xSSE4 2 May generate Intel SSE4 Efficient Accelerated String and Text Processing instructions May generate Intel SSE4 Vectorizing Compiler and Media Accelerator Intel SSSE3 SSE3 SSE2 and SSE instructions e xSSE4 1 May generate Intel SSE4 Vectorizing Compiler and Media Accelerator instructions for Intel processors May generate Intel SSSE3
11. is only one time You can increase or reduce the size of the buffer your code will also use more memory To change the size you have to initialize an environment variable export VT_BUFFER_S ZE 64M ccc_mprun prog e xe In this example the buffer is set to 64 MB We can increase the maximum number of flushes export VT_MAX_FLUS HES 10 ccc_mprun prog e xe If the value for VT_MAX_FLUSHES is 0 the number of flushes is unlimited By default Vampirtrace will first store profiling information in a local directory tmp of process These files can be very large and fill the directory You have to change this local directory with another location export VT_PFORM LDIR S CRATCHDIR There are more Vampirtrace variables which can be used See User Manual for more precisions Vampirserver Traces generated by Vampirtrace can be very large Vampir can be very slow if you want to visualize these traces Vampir provides Vampirserver it is a parallel program which uses CPU computing to accelerate Vampir visualization Firstly you have to submit a job which will launch Vampirserver on Curie nodes cat va mpirse rver sh biryba sh MS UB r va mpirserver Re quest name M5 UB n 32 Number of tasks to use M5 UB T 1800 Ela pse d time limit in s e conds MS UB o va mpirs e rve r_ l o S ta nda rd output lis the job id MS UB e va mpirse rve r_ l e Error output lis the job id ccc_mprun vingd module loa d va mp
12. some regions of memory than others This is due to the fact that all memory regions are not physically on the same bus NUMA node Curie hybrid node In this picture we can see that if a data is in the memory module 0 a process running on the second socket like the 4th process will take more time to access the data We can introduce the notion of local data vs remote data In our example if we consider a process running on the socket 0 a data is ocal if it is on the memory module 0 The data is remote if it is on the memory module 1 We can then deduce the reasons why tuning the process affinity is important e Data locality improve performance If your code use shared memory like pthreads or OpenMP the best choice is to regroup your threads on the same socket The shared datas should be local to the socket and moreover the datas will potentially stay on the processor s cache e System processes can interrupt your process running on a core If your process is not bound to a core or to a socket it can be moved to another core or to another socket In this case all datas for this process have to be moved with the process too and it can take some time e MPI communications are faster between processes which are on the same socket If you know that two processes have many communications you can bind them to the same socket e On Curie hybrid nodes the GPUs are connected to buses which are local to socket Processes can take longer tim
13. vtcc ope nmp O2 o prog e xe _prog o Then you can submit your job Here is an example of submission script biryba sh M5 UB r MyJob Pa ra Re quest name M5 UB n 32 Number of tasks to use 4 MS UB T 1800 Ha pse d time limit in s e conds M5 UB o e xa mple_ l o S ta nda rd output lis the job id MS UB e exa mpe _ l e Error output lis the job id set x cd BRIDGE_MS UB PWD ccc_mprun prog e xe At the end of execution the program generates many profiling files ba sh 4 00 Is a out a out 0 de f z a out 1 e ve nts z a out otf To visualize those files you must load the vampir module ba sh 4 00 module loa d va mpir ba sh 4 00 va mpir a out otf C Vampir Tirace View jece scratchicont0o0 sainguyenl stream_all_gpu nguyenl_stream_cuda exe_5403 otf sur bulbi20 eee Mme Est Chan Fler Window Heb leila SF a Bruises LI Es a z T mim Application psi LS Ss E ee ee ee 2727 E Met 1 567 s M cUDA_sYNC 250 955 ms VT_API lt 100 ms VT_CUDA Process 1 Process 2 ammes _ Process 3 Process 4 Process 5 a gt Cortex Vew Furciin Legend Vampir window If you need more information you can contact us Tips Vampirtrace allocate a buffer to store its profiling information If the buffer is full Vampirtrace will flush the buffer on disk By default the size of this buffer is 32MB per process and the maximum number of flushes
14. 046163 To get the available hardware counters you can type papi_avail commande This library can retrieve the MFLOPS of a certain region of your code progra m ma in implicit none include f90pa pi h inte ger parameter size 1000 inte ge r pa ra meter ntimes 100 double pre cision dime ns ion size size A B C inte ger ijn Va ria ble PAPI inte ger retval re a I kind 4 proc_time mflops re a _time inte ge r kind 8 flpins Init PAPI re tva PAPLVER_CURRENT ca Il PAPIf_libra ry_init re tva if re tva NE PAPLVER_CURRENT the n print PAPI libra ry_init re tva endif ca ll PAPIf_que ry_e ve nt PAPI_FP_INS re tva 1 Init Ma trix Cli rea l i j 8 Bi i 0 14j end do enddo S e tup Counte r ca PAPIf_flips re a _time proc_time flpins mflops re tva I DAXPY do n 1 ntime s do i 1 size doj 1 size A ij 2 0 B ij Cli end do end do enddo Colle ct the data into the Variables passedin ca ll PAPIf_flips re a _time proc_time flpins mflops re tva Print re sults print Re a time realtime print Proc_time proc_time print Tota flpins flpins print MFLOPS mflops jj e nd progra m ma in and the output ba sh 4 00 module loa d pa pi 4 1 3 ba sh 4 00 ifort I PAPI_INC_DIR pa pi_flops f90 PAPI_LIBS ba sh 4 00 a out Realtime 6 12 50001E 02 Proc_time 5 1447589E 02
15. Contents e 1 Curie s advanced usage manual 2 Optimization o 2 1 Compilation options a 2 1 1 Intel a 2 1 1 1 Intel Sandy Bridge processors a 2 1 2 GNU e 3 Submission o 3 1 Choosing or excluding nodes e 4 MPI o 4 1 Embarrassingly parallel jobs and MPMD jobs o 4 2 BullxMPI 4 2 1 MPMD jobs 4 2 2 Tuning BullxMPI 4 2 3 Optimizing with BullxMPI a a a a 4 2 4 Debugging with BullxMPI 5 Process distribution affinity and binding o 5 1 Introduction a 5 1 1 Hardware topology 5 1 2 Definitions a a 5 1 3 Process distribution a 5 1 4 Why is affinity important for improving performance a 5 1 5 CPU affinity mask o 5 2 SLURM a 5 2 1 Process distribution a 5 2 1 1 Curie hybrid node a 5 2 2 Process binding o 5 3 BullxMPI a 5 3 1 Process distribution a 5 3 2 Process binding a 5 3 3 Manual process management e 6 Using GPU o 6 1 Two sequential GPU runs ona single hybrid node e 7 Profiling o 7 1 PAPI o 7 2 VampirTrace Vampir 7 2 1 Basics 7 2 2 Tips 7 2 3 Vampirserver 7 2 4 CUDA profiling o 7 3 Scalasca a 7 3 1 Standard utilization a 7 3 2 Scalasca Vampir a 7 3 3 Scalasca PAPI o 7 4 Paraver a 7 4 1 Trace generation a 7 4 2 Converting traces to Paraver format a 7 4 3 Launching Paraver Curie s advanced usage manual If you have suggestions or remarks please contact us hotline tgcc cea fr Optimization Compilation options Compilers provides many options to optimize a code These options are described
16. SSE3 SSE2 and SSE instructions xSSSE3 May generate Intel SSSE3 SSE3 SSE2 and SSE instructions for Intel processors xSSE3 May generate Intel SSE3 SSE2 and SSE instructions for Intel processors xSSE2 May generate Intel SSE2 and SSE instructions for Intel processors xHost this option will apply one of the previous options depending on the processor where the compilation is performed This option is recommended for optimizing your code None of these options are used by default The SSE instructions use the vectorization capability of Intel processors Intel Sandy Bridge processors Curie thin nodes use the last Intel processors based on Sandy Bridge architecture This architecture provides new vectorization instructions called AVX for Advanced Vector eXtensions The option xAVX allows to generate a specific code for Curie thin nodes Be careful a code generated with xAVX option runs only on Intel Sandy Bridge processors Otherwise you will get this error message Fa ta Error This progra m wa s not built to run in your syste m Please verify that both the ope ra ting syste m a nd the proce s s or support Inte R AVX Curie login nodes are Curie large nodes with Nehalem EX processors AVX codes can be generated on these nodes through cross compilation by adding xAVX option On Curie large node the xHost option will not generate a AVX code If you need to compile with xHost or if the installation requires som
17. _VIS IBLE DEVICES S LURM PROCID the first process willsee only the first GPU andthe second process willsee only the second GPU if S LURM PROCID e q 0 then bin_1 gt job_ S LURM_PROCID out fi if S LURM PROCID e q1 n bin 2 gt job_ S LURM_PROCID out fi To work correctly the two binaries have to been sequential not using MPI Then run your script making sure to submit two MPI processes with 4 cores per process cat multi_jobs_gpush bir ba sh M5 UB r jobs _gpu MS UB n2 2 tasks MS UB N1 1node M5 UB c 4 eachtasktakes 4 cores M5 UB q hybrid M5 UB T 1800 M5 UB o mutti_jobs _gpu_ l out MS UB e multi jobs _gpu_ l out set x cd BRIDGE_MS UB PWD e xport OMP_NUM THREADS 4 ccc_mprun E wa it 0 n 2 c 4 la unch_exe sh E wa it 0 spe cify to s lum to not kill the job if one of the two processes is termina ted and not the second So your first process will be located on the first CPU socket and the second process will be on the second CPU socket each socket is linked with a GPU ccc_ms ub multi_jobs_gpu s h Profiling PAPI PAPI is an API which allows you to retrieve hardware counters from the CPU Here an example in Fortran to get the number of floating point operations of a matrix DAXPY progra m ma in implicit none include f90pa pi h i inte ger parameter size 1000 inte ger parameter ntimes 10 double pre cision dime nsion size
18. a MS UB n 32 MS UB x M5 UB T 1800 M5 UB o e xa mple_ l o M5 UB A pa xxxx M5 UB E cpu _bind none Request name Number of tasks to use Require a exclusive node Hla pse d time limit in s e conds S ta nda rd output l is the job id Proje ct ID Disable de fa ult S LURM binding mopirun bys ocke t np 32 a out Cyclic distribution by node bir ba sh M5 UB r MyJob Pa ra MS UB n 32 MS UB N 16 MS UB x M5 UB T 1800 M5 UB o e xa mple_ l o M5 UB A pa xxxx M5 UB E cpu _bind none Request name Number of tasks to use Require exclusive nodes Hla pse d time limit in s e conds S ta nda rd output l is the job id Proje ct ID Disable de fa ult S LURM binding mopirun by node np 32 a out Process binding No binding biryba sh M5 UB r MyJob Pa ra M5 UB n 32 M5 UB x M5 UB T 1800 M5 UB o e xa mple_ l o M5 UB A pa xxxx M5 UB E cpu _bind none Request name Number of tasks to use Re quire a exclusive node Ha pse d time limit ins e conds S ta nda rd output lis the job id Proje ct ID Disable de fa ult S LURM binding mopirun bind to none np 32 a out Core binding bir ba sh M5 UB r MyJob Pa ra MS UB n 32 MS UB x MS UB T 1800 MS UB o e xa mple _ l 0 M5 UB A pa xxxx M5 UB E cpu _bind none Request name Number of tasks to use Require a exclusive node
19. b id MS UB e exa mpe _ l e Error output lis the job id set x cd BRIDGE_MS UB PWD e xport S CAN_MPLLAUNCHER ccc_mprun scalasca analyze ccc_mprun prog e xe At the end of execution the program generates a directory which contains the profiling files ba sh 4 00 Is e pik To visualize those files you can type bash4 00 scalasca examine e pik Cubes 31Q1 epik helmholtz 0lsum summary cube ga sur titaneoo7 Ele Display Topology Help Absolute Own root percent Metre wee Calltee Flat view T000 Time fal T1000 MP Line E 6 87 MP UD ooo TRACING 3 07 Process 0 0 00 Synchronization 1000 MPI Iniiaiized 230 Process 1 L 0 13 Colective 0 00 MPt_Ccomm_dup 231 Process 2 T 0 00 Communication 000 MPIAlreduce 196 Process 3 45 11 Point to point 63 20 Colective 0 00 MPL_Comm_sank loco MPLComm_size 116 76 inex 000 MPLBcast 2 27 Process 6 W652 Overhead ooo MPLComm free 232 Process 7 fl 8 587 vists 093 sub_wrtemesh_ ane1519 E10 Synchronizations To Point to point 203 Sends 203 Receives 252 Collective GC 0 Communications 002 sub_renum2_ 0 00 sub_renum_ 0 00 sub_param_phys_ E 0 00 sub defghd_ E 0 00 sub defgd_ 2 52 Process 8 2 78 Process 9 2 60 Process 10 2 55 Process 11 2 45 Process 12 1127 sub_defmassma_ Wh 2 47 Process 13 Do Poin to point L tml 0 00 phadorders_ Hh 261 Process 14 E 5 98e4 Sends E 0 07 sub shsheimhokz_ L 2 76 Process 15
20. cess a GPU if this process is not on the same socket of the GPU By default the distribution is block by core Then the MPI rank 0 is located on the first socket and the MPI rank 1 is on the first socket too The majority of GPU codes will assign GPU O to MPI rank O and GPU 1 to MPI rank 1 In this case the bandwidth between MPI rank 1 and GPU 1 is not optimal If your code does this in order to obtain the best performance you should e use the block cyclic distribution e if you intend to use only 2 MPI processes per node you can reserve 4 cores per process with the directive MSUB c 4 The two processes will be placed on two different sockets Process binding By default processes are bound to the core For multi threaded jobs processes creates threads these threads will be bound to the assigned core To allow these threads to use other cores SLURM provides the option c to assign many cores to a process biryba sh M5 UB r MyJob Para Re quest name M5 UB n8 Numbe r of tasks to use M5 UB c 4 Assign 4 cores per process MS UB T 1800 Ha pse d time limit ins e conds M5 UB o e xa mple_ l o Sta nda rd output l is the job id M5 UB A pa Xxxx Proje ct ID e xport OMP_NUM THREADS 4 ccc mpnun a out In this example our hybrid OpenMP MPI code runs on 8 MPI processes and each process will use 4 OpenMP threads We give here an example for the output with the verbose option for binding ccc_mpnun a out
21. e current MPI rank BullxMPI MPMD jobs BullxMPI or OpenMPI jobs can be launched with mpirun launcher In this case we have other ways to launch MPMD jobs see embarrassingly parallel jobs section We take the same example in the embarrassingly parallel jobs section There are then two ways for launching MPMD scripts e We don t need the Jaunch_exe sh anymore We can launch directly the job with mpirun command mpirun np 1 bin0 np 1 bin1 np 1 bin2 e In the launch_exe sh we can replace SLURM_PROCID by OMPI_COMM_WORLD_RANK launch_exe sh bir ba sh if OMPI COMM WORLD RANK e q0 the n bin0 fi if OMPI COMM WORLD RANK eq 1 the n bin1 fi if OMPI COMM WORLD RANK e q 2 the n bin2 fi We can then launch our job with 3 processes mpirun np 3 la unch e xe sh Tuning BullxMPI BullxMPI is based on OpenMPI It can be tuned with parameters The command ompi_info a gives you a list of all parameters and their descriptions curie 50 ompi_info a MCA mpi pa ra me te r mpi_s how_mca _pa ra ms cure nt va lue lt none gt da ta source de fa ult va lue Whe the r to s how a ll MCA pa ra me te r va lue s during MPLINIT or not good for re produca bility of MPI jobs for de bug purposes Acce pte d values are all de fa ult file a pi a nd e nvironme nt ora comma de limite d combina tion of the m Theses parameters can be modified with environment variables set
22. e tests like autotools configure you can submit a job which will compile on the Curie thin nodes GNU There are some options which allow usage of specific set of instructions for Intel processors in order to optimize code behavior These options are compatible with most of Intel processors The compiler will try to use these instructions if the processor allow it mmmx mno mmx Switch on or off the usage of said instruction set msse mno sse idem msse2 mno sse2 idem msse3 mno sse3 idem mssse3 mno ssse3 idem msse4 1 mno sse4 1 idem msse4 2 mno sse4 2 idem msse4 mno sse4 idem mavx mno avx idem for Curie Thin nodes partition only Submission Choosing or excluding nodes SLURM provides the possibility to choose or exclude any nodes in the reservation for your job To choose nodes bir ba sh M5 UB r MyJob Para Re que st name M5 UB n 32 Number of tasks to use M5 UB T 1800 Ha pse d time limit in se conds M5 UB o e xa mple_ l o Sta nda rd output l is the job id MS UB e exa mpe _ l e Error output l is the job id MS UB A pa Xxxx Proje ct ID M5 UB E w curie 1000 1003 Include 4 nodes curie 1000 to curie 1003 set x cd BRIDGE_MS UB PWD ccc_mprun a out To exclude nodes bir ba sh M5 UB r MyJob Para Re que st name M5 UB n 32 Number of tasks to use M5 UB T 1800 Ha pse d time limit in se conds
23. e to access a GPU which is not connected to its socket NUMA node Curie hybrid node with GPU For all theses reasons it is better to know the NUMA configuration of Curie nodes fat hybrid and thin In the following section we will present some ways to tune your processes affinity for your jobs CPU affinity mask The affinity of a process is defined by a mask A mask is a binary value which length is defined by the number of cores available on a node By example Curie hybrid nodes have 8 cores the binary mask value will have 8 figures Each figures will have 0 or 1 The process will run only on the core which have 1 as value A binary mask must be read from right to left For example a process which runs on the cores 0 4 6 and 7 will have as affinity binary mask 11010001 SLURM and BullxMPI use theses masks but converted in hexadecimal number e To convert a binary value to hexadecimal e cho iba s e 2 0ba s e 16 11010001 bc 21202 e To convert a hexadecimal value to binary e cho iba se 16 oba se 2 21202 bc 11010001 The numbering of the cores is the PU number from the output of hwloc SLURM SLURM is the default launcher for jobs on Curie SLURM manages the processes even for sequential jobs We recommend you to use ccc_mprun By default SLURM binds processes to a core The distribution is block by node and by core The option E cpu_bind verbose for ccc_mprun gives you a report about the binding of pr
24. es each with 8 cores A block distribution request will distribute those tasks to the nodes with tasks 0 to 7 on the first node task 8 to 15 on the second node Block distribution by core e cyclic by socket from SLURM manpage the cyclic distribution method will distribute tasks to a socket such that consecutive tasks are distributed over consecutive socket in a round robin fashion For example consider an allocation of two nodes each with 2 sockets each with 4 cores A cyclic distribution by socket request will distribute those tasks to the socket with tasks 0 2 4 6 on the first socket task 1 3 5 7 on the second socket In the following image the distribution is cyclic by socket and block by node Cyclic distribution by socket e cyclic by node from SLURM manpage the cyclic distribution method will distribute tasks to a node such that consecutive tasks are distributed over consecutive nodes in a round robin fashion For example consider an allocation of two nodes each with 2 sockets each with 4 cores A cyclic distribution by node request will distribute those tasks to the nodes with tasks 0 2 4 6 8 10 12 14 on the first node task 1 3 5 7 9 11 13 15 on the second node In the following image the distribution is cyclic by node and block by socket Block distribution by node Why is affinity important for improving performance Curie nodes are NUMA Non Uniform Memory Access nodes It means that it will take longer to access
25. etrieve only 3 hardware counters at the same time on Curie The the syntax is e xport EPK_METRICS PAPI_FP_OPS PAPI_TOT_CYC Paraver Paraver is a flexible performance visualization and analysis tool that can be used to analyze MPI OpenMP MPI OpenMP hardware counters profile Operating system activity and many other things you may think of In order to use Paraver tools you need to load the paraver module ba sh 4 00 module loa d para ver ba sh 4 00 module show pa raver Jus r loca Vccc_us e rs _e nv module s de ve lopme nt pa ra ve 1 4 1 1 module wha tis Pa raver conflict pa raver pre pe nd pa th PATH us r loca l pa ra ve r 4 1 1 bin pre pe ncd pa th PATH us r loca Ve xtra e 2 1 1 bin pre pe nd pa th LD_LIBRARY_PATH us r loca V pa ra ve r 4 1 1 lib pre pe nd pa th LD_LIBRARY_PATH us r loca Ve xtra e 2 1 1 lib module loa d pa pi setenv PARAVER_HOME us r loca Vpa ra ve r 4 1 1 setenv EXTRAE HOME us r loca Ve xtra e 2 1 1 setenv EXTRAE LIB DIR us r loca Ve xtra e 2 1 1 lib setenv MPLTRACE_LIBS us r loca Ve xtra e 2 1 1 lib libmpitra ce s o Trace generation The simpliest way to activate mpi instrumentation of your code is to dynamically load the library before execution This can be done by adding the following line to your submission script export LD PRELOAD LD_PRELOAD MPI_TRACE LIBS The instrumentation process is managed by Extrae and also need a configuration file in xml format You will have t
26. ir ccc ms ub va mpirse rver sh When the job is running you will obtain this ouput ccc_mpp US ER ACCOUNT BATCHID NCPU QUEUE PRIORITY STATE RLM RUN START SUSP OLD NAME NODES toto genXXX 234481 32 large 210332 RUN 30 0m 13m 13m vamprserver curie 1352 ccc mpe e k 234481 Found lice nse file us r loca va mpir 7 3 bin lic da t Running 31 a na lysis processes a bort with Ctri C or vngd s hutdown Server liste ns on curie 1352 30000 In our example the Vampirserver master node is on curie1352 The port to connect is 30000 Then you can launch Vampir on front node Instead of clicking on Open you will click on Remote Open lt j 7 l indow Hep W Connect to Server sur curie50 Deserption Deserption Server cure 1352 Pon 30000 Connecting to Vampirserver Fill the server and the port You will be connected to vampirserver Then you can open an OTF files and visualize it Notes e You can ask any number of processors you want it will be faster if your profiling files are big But be careful it consumes your computing times e Don t forget to delete the Vampirserver job after your analyze CUDA profiling Vampirtrace can collect profiling data from CUDA programs As previously you have to replace compilers by Vampirtrace wrappers NVCC compiler should be replaced by vtnvcc Then when you run your program you have to set an environment variable e xport e x
27. launch the run Here is the previous modified script bin ba sh M5 UB r MyJob Para Re quest name M5 UB n 32 Number of tasks to use M5 UB T 1800 Ha pse d time limit in se conds MS UB o e xa mple _ l o S ta nda rd output lis the job id MS UB e example _ le Error output l is the job id set x cd BRIDGE_MS UB PWD scalasca analyze t mpirun prog e xe At the end of execution the program generates a directory which contains the profiling files bash 4 00 ls epik To visualize those files you can visualize them as previously To generate the OTF trace files you can type bash 4 00 ls epik ba sh 4 00 e Ig2otf e pik It will generate an OTF file under the epik_ directory To visualize it you can load Vampir ba sh 4 00 module loa d va mpir ba sh 4 00 va mpir e pik _ a otf Scalasca PAPI Scalasca can retrieve the hardware counter with PAPI For example if you want retrieve the number of floating point operations bir ba sh M5 UB r MyJob Para Re que st na me M5 UB n 32 Number of tasks to use MS UB T 1800 Ha pse d time limit in se conds M5 UB o e xa mple_ l o Sta nda rd output l is the job id MS UB e example _ le Error output l is the job id set x cd BRIDGE_MS UB PWD export EPK_METRICS PAPI_FP_OPS scalasca analyze mpirun prog e xe Then the number of floating point operations will appear on the profile when you visualize it You can r
28. ne hos tna me gt hostfile txt Re quest name Number of tasks to use Re quire a exclusive node Hla pse d time limit ins e conds S ta nda rd output l is the job id Proje ct ID Disable de fa ult S LURM binding e cho ra nk O HOS TNANE slot 0 1 2 3 gt rankfile txt e cho ra nk 1 HOS TNANE s lot 8 10 12 14 gt gt ra nkfile txt e cho ra nk 2 HOS TNAME s lot 16 17 22 23 gt gt ra nkfile txt e cho ra nk 3 HOS TNAME s lot 19 20 21 31 gt gt ra nkfile txt mpirun hos tfile_hos tfile txt ra nkfile_ra nkfile txt np 4 a out In this example there are many steps e You have to create a hostfile here hostfile txt where you put the hostname of all nodes your run will use e You have to create a rankfile here rankfile txt where you assign to each MPI rank the core where it can run In our example the process of rank 0 will have as affinity the core 0 1 2 and 3 etc Be careful the numbering of the core is different than the hwloc output on Curie fat node the eight first core are on the first socket 0 etc e you can launch mpirun by specifying the hostfile and the rankfile Using GPU Two sequential GPU runs on a single hybrid node To launch two separate sequential GPU runs ona single hybrid node you have to set the environment variable CUDA_VISIBLE_DEVICES which enables GPUs wanted First create a script to launch binaries catlaunchexe sh bir ba sh set x e xport CUDA
29. o add next line to your submission script e xport EXTRAE_CONFIG_FILE e xtra e _config file xml All detailled about how to write a config file are available in Extrae s manual which you can reach at EXTRAE_HOME doc user guide pdf You will also find many examples of scripts in EXTRAE_HOME examples LINUX file tree You can also add some manual instrumentation in your code to add some specific user event This is mandatory if you want to see your own functions in Paraver timelines If trace generation succeed during computation you ll find a directory set O containing some mpit files in your working directory You will also find a TRACE mpits file which lists all these files Converting traces to Paraver format Extrae provides a tool named mpi2prv to convert mpit files into a prv which will be read by Paraver Since it can be a long operation we recommend you to use the parallel version of this tool mpimpi2prv You will need less processes than previously used to compute An example script is provided below ba s h 4 00 ca t re build s h M5 UB r me rge M5 UB n8 M5 UB T 1800 set x cd BRIDGE_MS UB PWD ccc mpun mpimpi2 prv syn e_pa th to your_bina ry f TRACE mpits o file _to be _a na lys e d prv Launching Paraver You just now have to launch paraver file_to_be_analysed prv As Paraver may ask for high memory amp CPU usage it may be better to launch it through a submission script do not forget then to activa
30. ocesses before the run ccc_moprun E cpu_bind ve rbos e q hybrid n 8 a out cpu bind MAS K curie 7054 task 3 3 3534 mask Ox8set cpu_bind MAS K curie 7054 task 0 0 3531 mask Oxlset cpu_bind MAS K curie 7054 task 1 1 3532 mask 0x2 set cpu_bind MAS K curie 7054 task 2 2 3533 mask 0x4 set cpu_bind MAS K curie 7054 task 4 4 3535 mask 0x10set cpu_bind MAS K curie 7054 task 5 5 3536 mask Ox20set cpu_bind MAS K curie 7054 task 7 7 3538 mask Ox80set cpu bind MAS K curie 7054 task 6 6 3537 mask Ox40set In this example we can see the process 5 has 20 as hexadecimal mask or 00100000 as binary mask the 5th process will run only on the core 5 Process distribution To change the default distribution of processes you can use the option E m for ccc_mprun With SLURM you have two levels for process distribution node and socket e Node block distribution ccc_mprun E m block a out e Node cyclic distribution ccc_mprun E m cyclic a out By default the distribution over the socket is block In the following examples for socket distribution the node distribution will be block e Socket block distribution cc_moprun E m block block a out e Socket cyclic distribution ccc_mprun E m block cy clic a out Curie hybrid node On Curie hybrid node each GPU is connected to a socket See previous picture It will take longer for a process to ac
31. port VI_CUDARTTRACE ye s ccc_mprun prog e xe Scalasca Scalasca is a set of software which let you profile your parallel code by taking traces during the execution of the program This software is a kind of parallel gprof with more information We present here an introduction of Scalasca Standard utilization First you must compile your code by adding Scalasca tool before your call of the compiler In order to use Scalasca you need to load the scalasca module ba sh 4 00 module loa d sca la sca ba sh 4 00 s ca la sca ins trume nt mpicc c prog c ba sh 4 00 sca la s ca _ ins trume nt mpicc o prog e xe _prog o or for Fortran ba sh 4 00 module loa d sca la sca ba sh 4 00 sca la sca ins trume nt mpif90 c prog f90 ba sh 4 00 s ca la s ca _ ins trume nt mpif90 o prog e xe prog o You can compile for OpenMP programs ba sh 4 00 sca la sca ins trume nt ifort ope nmp c prog f90 ba sh 4 00 sca la sca ins trume nt ifort ope nmp o prog e xe prog o You can profile hybrid programs ba sh 4 00 sca la sca ins trume nt mpif90 ope nmp 03 c prog f90 ba sh 4 00 sca la s ca _ ins trume nt mpif90 ope nmp 03 o prog e xe prog o Then you can submit your job Here is an example of submission script biryba sh M5 UB r MyJob Para Re que st name M5 UB n 32 Number of tasks to use M5 UB T 1800 Ea pse d time limit in s e conds MS UB o e xa mple _ l o S ta nda rd output lis the jo
32. size A B C inte ger ij n Va ria ble PAPI inte ger pa ra meter max_event 1 inte ge r dime ns ion ma x_e ve rt eve nt inte ger num e ve nts retval inte ge r kind 8 dime ns ion ma x_e ve nt va lue s Init PAPI ca Il PAPIf_num_counte rs num e ve nts print Numbe r of ha rawa re counte rs supporte d num e ve nts ca Il PAPIf_que ry_e ve nt PAPI_FP_INS re tva if re tva NE PAPI_OK the n eve nt 1 PAPLTOT_INS else Tota floa ting point ope ra tions eve nt 1 PAPI FP_INS endif 1 Init Ma trix doi 1size do j 1size Cli re a l i j 8 Bi i 0 1 j end do enddo Set up courte rs num events 1 ca Il PAPIf_s ta rt_counte rs e ve nt num e ve nts re tva Clea r the counte r va lue s ca PAPIf_re a d_counte rs va lue s num e ve nts re tva DAXPY do n 1 ntime s do i 1 size doj 1 size A ij 2 0 B ij Cli end do end do enddo Stop the counte rs and put the results inthe ama y values ca Il PAPIf_s top_counte rs va lue s num e ve nts re tva Print re sults if e ve nt 1 EQ PAPLTOT_INS the n print TOT Instructions va lue s 1 else print FP Instructions va lue s 1 endif e nd progra m ma in To compile you have to load the PAPI module ba sh 4 00 module loa d pa pi 4 1 3 ba sh 4 00 ifort I PAPI_INC_DIR pa pi f90 PAPI_LIBS ba sh 4 00 a out Numbe r of ha rdwa re counte rs supporte d FP Ins tructions 10
33. te the X option in ccc_msub For analyzing your data you will need some configurations files available in Paraver s browser under PARAVER_HOME cfgs directory A Paraver sur curie50 MIE Eile Help oR xi Window browser scratch cont000 s8 dahmt termX_bench3D Test traces_termx_small prv z I User calls Instructions per cycle _ jie yy amp MPI call activity E Total MPI activity protile amp MPI call duration MPI cal g e ENE E EEEE Jg MPI call duration li K IS vo I 1 0 begin l 1 I 1 0 end i F VO end I Instantaneous parallelism profile pment il Fibs amp Window Properties ee He gt E abinit_test2 I l gt B BencHs_ePus aan gt E cudago gt BS extrae 2 1 1 p2 gt E libunwind 1 0 1 What Where Timing colors gt Bimagmat o 0 Semantic Z Events 2 Communications Z Previous Next Z Text E my_Extrae gt E My_vasp Object THREAD 1 3 1 Click time 7 993 045 us gt BS paraver source gt Ba wxparavers4 MPI_Irecv Duration 22 47 us gt Bon End Duration 753 91 us v BE Parverties MPI_Waitall Duration 763 75 us Paraver window

Curie advanced userguide - VI-HPS

Contents

Download Pdf Manuals

Related Search

Related Contents