Home
IBM Hight Performance Computing Toolkit MPI Tracing/Profiling
Contents
1. 1 The C header file is used when it needs to configure the library Please see Section 6 for details 2 System and Software Requirement Current supported architecture OS and required software are e AIX on Power 32 bit and 64 bit IBM Parallel Environment PE for AIX program product and its Parallel Operating Environment POE e Linux on Power 32 bit and 64 bit IBM Parallel Environment PE for Linux program product and its Parallel Operating Environment POE e Blue Gene L System Software version 3 e Blue Gene P 3 Compiling and Linking The trace library uses the debugging information stored within the binary to map the performance information back to the source code To use the library the application must be compiled with the g option The application may consider turning off or having lower level of optimiza tion O2 O1 when linking with the MPI profiler tracer High level opti mization will affect the correctness of the debugging information It may also affect the call stack behavior To link the application with the library add three options to your command line the option L path to libraries where path to libraries is the path where the libraries are located the option Impitrace this trace library should be before the MPI library Impich in the linking order and the option license to link the license library For some platforms if the shared library liblicense so is use
2. 0 2 MPI_COMM_WORLD req 2 err endif if teskid eq nuntaske 1 then Pinital awim 0 1 MPI_COMM_NORLD req 3 err MPI_Irecv_295 endif MPI_Irecv_299 if taskid eq 0 then MPI_Irecv_319 8 CALL mpi_ 4 u Me1 1 1 MPI_DOUBLE_PRECISION MPI_Irecv_323 nuntasks 1 2 MPI_COMM_WNORLD req 4 ierr MPI_Isend_303 endif MPI_Isend_307 6e if teskid eq 0 or taskid eq nunteske 1 then CALL MPI_WAITALL 4 req istat ierr MPI_Isend_331 endif MPI_Maita11_273 40 D 1 N 1 v Me1 1 MPI_Waiteli_311 v Me1 1 v 1 N 1 MPI_Mai tal1_335 DO 86 J je je amp ahalow awim f DO 86 1 1 MP1 MPI_Comm_renk_61 1 2e DOLD I J U I I VOLD I 3 V 1 3 MPI_Comm_eize_621 POLD I J P 1 3 ITAL U CAPITAL V Z AND M 513 N2513 he je je taekid nuntasks req 16 1 istat MPI_STATUS_SIZE 16 integer taskid req COMMON U N1 N2 V N1 N2 P WL N2 UNEN N1 N2 VNEW N1 N2 PNEW N1 N2 UOLD N1 N2 VOLD N1 N2 POLD N1 N2 CU NL N2 CV N1 N2 Z 41 N2 H N1 N2 PSI N1 N2 COMMON CONS DT TOT DX DY A ALPHA ITMAX MPRINT M N MP1 1 NPL EL PI TPI DI DJ PCF integer ierr Figure 1 Peekperf 5 2 Viz File In addition to the mpi_profile taskid files the library may also generate mpi_profile_taskid viz XML format files that be viewed by using Peekperf as shown in Figure 1 5 3 Trace File The library will also generate a file called single trace The Peekview utility can be used
3. int mpirank int xCoord int yCoord int zCoord int xSize int ySize int zSize int procid int ntasks double clockHz int nmpi 3 e MT_tracebufferstruct This data structure is used together with MT_get_tracebufferinfo utility function It holds information about how many events are recorded num ber_events and information about memory space in total used available MBytes for tracing struct MT_tracebufferstruct int number_events double total_buffer in terms of MBytes double used_buffer double free_buffer 12 e MT_callerstruct 6 3 This data structure holds the caller s information for the MPI function It is used with MT get callerinfo utility function The information includes source file path source file name function name and line number in the source file struct MT_callerstruct char filepath char filename char funcname int lineno Es MT_memorystruct Blue Gene L Only Since the memory space per compute node on Blue Gene L is limited This data structure is used with MT_get_memoryinfo utility function to provide memory usage information struct MT_memorystruct unsigned int max_stack_address unsigned int min_stack_address unsigned int max_heap_address 3 Utility Functions long long MT_get mpi_counts int The integer passed in is the MPI ID and the number of call counts for this MPI function will be returned The MPI ID can be one of IDs listed in
4. Table 2 double MT_get_mpi_bytes int Similar to the MT_get_mpi_counts this function will return the accumu lated size of data transferred by the MPI function double MT_get_mpi_time int Similar to the MT_get_mpi_counts this function will return the accumu lated time spent in the MPI function double MT_get_avg_hops void If the distance between two processors p q with physical coordinates Lp Yp Zp and 4 Yq 24 is calculated as Hops p q p Lal Yp Yal 2p zal We measure the AverageHops for all communications on a given processor as follows X Hops x Bytes X Bytes AverageHops 13 COMM_SIZE_ID SSEND_ID ISEND_ID IBSENDID RSEND_INIT ID RECV_ID SENDRECV_REPLACE_ID PROBE ID TESTANY_ID WAIT _ID WAITSOME_ID BCAST ID GATHERV ID SCANID REDUCE_ID ALLTOALL ID COMM_RANK_ID RSEND ID ISSEND_ID SEND_INIT_ID BSEND_INIT ID IRECV ID BUFFER _ATTACH ID IPROBE ID TESTALL ID WAITANY ID START ID BARRIER_ID SCATTER_ID ALLGATHER ID ALLREDUCE_ID ALLTOALLV_ID SEND_ID BSEND ID IRSEND ID SSEND_INIT_ID RECV_INIT_ID SENDRECV_ID BUFFER DETACH ID TEST_ID TESTSOME_ID WAITALL_ID STARTALL_ID GATHER_ID SCATTERV_ID ALLGATHERV_ID REDUCE_SCATTER ID Table 2 MPI ID where Hops is the distance between the processors for ith MPI communi cation and Bytes is the size of the data transferred in this communication The logical concept behind this performan
5. adding mutex locks around updates of static data which would add some additional overhead 8 Contacts e TAsin Chung ihchung us ibm com Comments corrections or technical issues e David Klepacki klepackiQus ibm com IBM High Performance Computing Toolkit licensing and distributions 17
6. this by setting this environment variable default value is 0 For example setting TRACEBACK LEVEL 1 tells the library to save addresses starting not with the location of the MPI call level 0 but from the parent in the call chain level 1 SWAP_BYTES The event trace file is binary and so it is sensitive to byte order For example Blue Gene L is big endian and your visualization workstation is probably little endian e g x86 The trace files are written in little endian format by default If you use a big endian system for graphical display examples are Apple OS X ATX p series workstations etc you can set an environment variable export SWAP_BYTES no bash setenv SWAP_BYTES no csh when you run your job This will result in a trace file in big endian format TRACE SEND PATTERN Blue Gene L and Blue Gene P Only In either profiling or tracing mode there is an option to collect information about the number of hops for point to point communication on the torus This feature can be enabled by setting an environment variable export TRACE_SEND_PATTERN yes setenv TRACE_SEND_PATTERN yes When this variable is set the wrappers keep track of how many bytes are sent to each task and a binary file send _bytes matrix is written during MPI Finalize which lists how many bytes were sent from each task to all other tasks The format of the binary file is Doo Doi Don Dio Dij 5 Dan where the data type Dj
7. IBM Hight Performance Computing Toolkit MPI Tracing Profiling User Manual Advanced Computing Technology Center IBM Thomas J Watson Research Center Yorktown Heights NY 10598 April 4 2008 Contents 1 2 Overview System and Software Requirement Compiling and Linking 3 1 ATX on Power a Oh REE A AAD jeer ee 3 2 Linux on Power 4 4 EE A eda PA A ad BE 8 3 3 Blue Gen be ina aa mor ea e e ae ee ree ed 3 4 Blue Gene D 0 2 as A a oe ag Ah ae pese a Environment Variable Output SL Plaine Text Filey sca DA E AAA E Od Naz Biles co tt Ook ee PR A SO a Ss BiB race Pile Lt ate ie Mes thc he Ge Ge er ape A AS toad e Bde 8 Configuration 6 1 Configuration Function 2 2 00 0 0 0000000 6 2 Data Structures ol Aik be SS ae ee ee ee ee 6 3 Utility Functions gt syy eae a EE a e e aaa GA Example cu a DOES Oe EE POS SY ORO ADE Ads Final Note Tele Overhead Fhe aino A A ee aaa fede ee T2 Multi Threading py yeei aa SAR ee a A Contacts 10 10 10 11 11 13 15 16 17 17 1 Overview This is the documentation for the IBM High Performance Computing Toolkit MPI Profiling Tracing library This library collects profiling and tracing data for MPI programs The library file names and their usage are shown in Table 1 Name Usage libmpitrace a library file for both C and Fortran application mpt h C header file Table 1 Library file names and usage Note
8. all distribution current_event_count MT_get_mpi_counts compare MPI function call distribution comparison_result compare_dist prev_event_count current_event_count prev_event_count current_event_count compare MPI function call distribution if comparison_result 1 return 0 stop tracing else return 1 start tracing int MT_output_trace int rank if rank lt 32 return 1 output performance data else return 0 no output Figure 3 Sample Code for MPI Tracing Configuration 7 Final Note 7 1 Overhead The library implements wrappers that use the MPI profiling interface and have the form int MPI_Send start_timingO PMPI_Send stop_timing log_the_event When event tracing is enabled the wrappers save a time stamped record of every MPI call for graphical display This adds some overhead about 1 2 microseconds per call The event tracing method uses a small buffer in memory up to 3 x 10 events per task and so this is best suited for short running applications or time stepping codes for just a few steps To further trace profile large scale application configuration may be required to improve the scalability Please refer Section 6 for details 16 7 2 Multi Threading The current version is not thread safe so it should be used in single threaded applications or when only one thread makes MPI calls The wrappers could be made thread safe by
9. cal results e g min max median average on primitive performance data e g call counts size of data transferred time etc for specific or all MPI functions The data_type can be one of the data type listed in Table 3 and mpi_id can be one of the MPI ID listed in Table 2 or ALLMPLID for all MPI functions COUNTS BYTES COMMUNICATIONTIME STACK HEAP MAXSTACKFUNC ELAPSEDTIME AVGHOPS Table 3 Data Type e int MT_get_memoryinfo struct MT_memorystruct Blue Gene L Only This function returns the information for the memory usage on the com pute node 6 4 Example In Figure 3 we re write the MT_trace_event and MT_output_trace routines with about 50 lines of code and use the default version of MT_output_text on Blue Gene L The function automatically detects the communication pat tern and shuts off the recording of trace events after the first instance of the pattern Also only MPI ranks less than 32 will output performance data at the end of program execution As shown in the figure utility functions such as MT_get_time and MT_get_environment help the user easily obtain infor mation needed to configure the library In this example MT_get_time returns the execution time spent so far and MT get environment returns the process personality including its physical coordinates and MPI rank 15 int MT_trace_event int id now MT_get_time MT_get_environment env get MPI function c
10. cation time 4 039 sec for task 30 taskid xcoord ycoord zcoord procid total_comm sec avg_hops 0 0 0 0 0 0 015 1 00 1 1 0 0 0 4 039 1 00 2 2 0 0 0 4 039 1 00 0 Xd 000 10 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 MPI tasks sorted by communication time taskid 0 9 26 10 2 Al 17 5 23 4 29 21 15 19 31 20 6 7 8 3 16 11 13 14 24 27 22 25 28 12 18 30 WNRPOWNFOWNRFPOWNFOWNFPOWNRFPOWNHF OW WWWWNNNNRFFRFRFPODWVOOWWWWNHNNNKRFPHKFKRH OS xcoord ycoord 0 NNOORPNWONKFWOWOWNOWWWRRPOWRRPRENNNE 0 WOWWNKFNNWWNOONKFKFKFPWOWRWRKRKRODOONNN PRARARPRRRPRPRAPRPRPRPRAPRAARARAA A4OOOOOOOOOOOOO 0000000000000 000000000000000o zcoord procid 0 PRROPPRRPRARAPROOOPROOOOPRPROPPROPOPOOORPRO 0000000000000 0000000000000000000o0o SARA ESSE ASES total_comm 015 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 SNS SSA SAS 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 039 sec avg_hops 1 00 1 00 1 00 1 00 1 00 1 00 1 00 1 00 4 00 1 00 1 00 1 00 7 00 4 00 7 00 1 00 1 00 4 00 p o0000000 20 00 10 00 00 ererrrrprerr prhe oooooo OOTO OO o o CALL mpi_irecv u 1 n 1 1 MPI_DOUBLE_PRECISION
11. ce metric is to measure how far each byte has to travel for the communication in average If the com munication processor pair is close to each other in the coordinate the AverageH ops value will tend to be small double MT_get_time void This function returns the time since MPI Init is called double MT_get_elapsed_time void This function returns the time between MPI Init and MPI Finalize are called char MT_get_mpi_name int This function takes a MPI ID and returns its name in a string e int MT_get_tracebufferinfo struct MT_tracebufferstruct This function returns the size of buffer used free by the tracing profiling tool at the moment e unsigned long MT_get_calleraddress int level This function will return the caller address in the memory e int MT get callerinfo unsigned long caller memory address struct MT_callerstruct 14 This function takes the caller memory address from MT_get_calleraddress and returns detailed caller information including the path the source file name the function name and the line number of the caller in the source file e void MT_get_environment struct MT_envstruct This function returns its self environment information including MPI rank physical coordinates dimension of the block number of total tasks and CPU clock frequency e int MT get allresults int data type int mpi_id struct MT_summarystruct This function returns statisti
12. d you may need to set the environment variable LD_LIBRARY_PATH to IHPCT_BASE lib lib64 to make sure the application finds the correct library during runtime 3 1 AIX on Power e C example CC usr lpp ppe poe bin mpcc_r TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense mpitrace ppe mpi_test c CC g o lt TRACE_LIB 1m e Fortran example FC usr lpp ppe poe bin mpxl1f_r TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense swim ppe swim f FC g o lt TRACELIB 3 2 Linux on Power e C example CC opt ibmhpc ppe poe bin mpcc TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense mpitrace mpi_test c CC g o lt TRACE_LIB 1m e Fortran example FC opt ibmhpc ppe poe bin mpfort TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense statusesf_trace statusesf f FC g o lt TRACE_LIB 3 3 Blue Gene L e C example BGL_INSTALL bgl BlueLight ppcfloor LIBS_RTS lrts rts ldevices rts LIBS_MPI L BGL_INSTALL bglsys lib lmpich rts lmsglayer rts LIBS_RTS XLC_TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense XLC_RTS blrts_xlc XLC_CFLAGS I BGL_INSTALL bglsys include g 0 qarch 440 qtune 440 qhot mpitrace_xlc rts mpi_test c XLC_RTS o lt XLC_CFLAGS XLC_TRACE_LIB LIBS_MPI lm e Fortran example 3 4 BGL_INSTALL bgl BlueLight ppcfloor LIBS_RTS lrts r
13. inside Peekperf or independently to display this trace file as shown in Figure 2 6 Configuration In this section we describe a more general way to make the tracing tool config urable and thereafter allows users to focus on interesting performance points By providing a flexible mechanism to control the events that are recorded the library can remain useful for even very large scale parallel applications 10 Identifier sul Sa aagaa IMPI_Waitall El elle De Le e Le Le Le Le Le Le Le De Le De De De De Le Le De Le Le Le Le Le Le Le AAA PD PN PN PN YO A Figure 2 Peekview 6 1 Configuration Function There are three functions that can be rewritten to configure the library During the runtime the return values of those three functions decide what performance information to be stored which process MPI rank will output the performance information and what performance information will be output to files e int MT_trace_event int Whenever a MPI function that is profiled traced is called this function will be invoked The integer passed into this func tion is the ID number for the MPI function The return value is 1 if the performance information should be stored in the buffer otherwise 0 e int MT_output_trace int This function is called once in the MPI_Finalize The integer passed into this function is the MPI rank The return value is 1 if it will output performance information otherw
14. is double in C and it represents the size of MPI data sent from rank i to rank j This matrix can be used as input to external utilities that can generate efficient mappings of MPI tasks onto torus coordinates The wrappers also provide the average number of hops for all flavors of MPI_Send The wrappers do not track the message traffic patterns in collective calls such as MPI_Alltoall Only point to point send operations are tracked The AverageHops for all communications on a given processor is measured as follows gt Hops x Bytes i S Bytes where Hops is the distance between the processors for ith MPI communi cation and Bytes is the size of the data transferred in this communication The logical concept behind this performance metric is to measure how far each byte has to travel for the communication in average If the com munication processor pair is close to each other in the coordinate the AverageHops value will tend to be small AverageHops 5 Output After building the binary executable and setting the environment run the ap plication as you normally would To have better control for the performance data collected and output please refer to Sections 4 and 6 5 1 Plain Text File The wrapper for MPI Finalize writes the timing summaries in files called mpi_profile taskid The mpi_profile 0 file is special it contains a timing sum mary from each task Currently for scalability reason by default only fo
15. ise 0 e int MT_output_text void This function will be called inside the MPI_Finalize once The user can rewrite this function to customize the performance data output e g user defined performance metrics or data layout 6 2 Data Structure Each data structure described in this section is usually used with associated utility function described in Section 6 3 to provide user information when im plementing the configuration functions described in Section 6 1 11 e MT_summarystruct This data structure holds statistics results including MPI ranks and sta tistical values e g Min Max Median Average and Sum The data structure is used together with MT_get_allresults utility function struct MT_summarystruct int min_rank int max_rank int med_rank void min_result void max_result void med_result void avg_result void sum_result void all_result void sorted_all_result int sorted_rank his e MT_envstruct This data structure is used with MT_get_environment utility function It holds MPI process self information includes MPI rank mpirank total number of MPI tasks ntasks and total number of MPI function types that are profiled traced nmpi For Blue Gene L it also provides the process self environment information including x y z coordinates in the torus dimension of the torus xSize ySize zSize the processor ID pro cid and the CPU clock frequency clockHz struct MT_envstruct
16. ords from MPI tasks 0 255 or for all MPI processes if there are 256 or fewer processes in MPILCOMM_WORLD That should be enough to provide a good visual record of the communication pattern If you want to save data from all tasks you have to set this environment variable to yes export TRACE_ALL_TASKS yes bash setenv TRACE_ALL_TASKS yes csh e TRACE MAX RANK To provide more control you can set MAX_TRACE_RANK For ex ample if you set MAX_TRACE_RANK 2048 you will get trace data from 2048 tasks 0 2047 provided you actually have at least 2048 tasks in your job By using the time stamped trace feature selectively both in time trace_start trace_stop and by MPI rank you can get good insight into the MPI performance of very large complex parallel applications OUTPUT_ALL_RANKS For scalability reason by default only four ranks will generate plain text files and the events in the trace rank 0 rank with min med max MPI communication time if rank 0 is one of the ranks with min med max MPI communication time only three ranks will generate plain text files and events in the trace If plain text files and events should be output from all ranks set this environment variable to yes export OUTPUT_ALL_RANKS yes bash setenv OUTPUT_ALL_RANKS yes csh TRACEBACK_LEVEL In some cases there may be deeply nested layers on top of MPI and you may need to profile higher up the call chain functions in the call stack You can do
17. rappers can be used in two modes The default value is set to yes and will collect both a timing summary and a time history of MPI calls suitable for graphical display If this environment variable is set to be yes it will save a record of all MPI events after MPI_Init until the application completes or until the trace buffer is full 1By default for MPI ranks 0 255 or for all MPI ranks if there are 256 or fewer processes in MPLCOMM_WORLD You can change this by setting the TRACE_ALL_ TASKS or using configuration described in Section 6 Another method is to control time history measurement within the appli cation by calling routines to start stop tracing Fortran syntax call mt_trace_start do work mpi call mt_trace_stop C syntax void MT_trace_start void void MT_trace_stop void MT_trace_start do work mpi MT_trace_stop C syntax extern C void MT_trace_start void extern C void MT_trace_stop void MT_trace_start do work mpi MT_trace_stop To use this control method the environment variable needs to be disabled otherwise it would trace all events export TRACE_ALL_EVENTS no bash setenv TRACE_ALL_EVENTS no csh e TRACE_ALL_TASKS When saving MPI event records it is easy to generate trace files that are just too large to visualize To cut down on the data volume the de fault behavior when you set TRACE_ALL_EVENTS yes is to save event rec
18. ts ldevices rts LIBS_MPI L BGL_INSTALL bglsys lib lmpich rts lmsglayer rts LIBS_RTS TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense BG_XLF blrts_xlf FC_FLAGS I BGL_INSTALL bglsys include g 0 statusesf_trace rts statusesf f BG_XLF o lt FC_FLAGS TRACE_LIB MPI_LIBS Blue Gene P C example BGPHOME bgsys drivers ppcfloor CC BGPHOME comm bin mpicc CFLAGS I BGPHOME comm include g 0 TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense LIB1 L BGPHOME comm 1ib lmpich cnk ldcmfcoll cnk ldcmf cnk LIB2 L BGPHOME runtime SPI 1SPI cna lpthread lrt LIB3 lgfortranbegin lgfortran please read the NOTE mpitrace mpi_test c CC o lt CFLAGS TRACE_LIB LIB1 LIB2 lm NOTE the C example uses mpicc which currently is based on GNU com piler In order to accomodate part of the tracing profiling library that is written in Fortran It is necessary to link those two GNU Fortran libraries Fortran example BGPHOME bgsys drivers ppcfloor CC BGPHOME comm bin mpif77 FFLAGS I BGPHOME comm include g 0 TRACE_LIB L lt path to libmpitrace a gt lmpitrace llicense LIB1 L BGPHOME comm lib lmpich cnk ldcmfcoll cnk ldcmf cnk LIB2 L BGPHOME runtime SPI 1SPI cna lpthread lrt statusesf statusesf f CC o lt FFLAGS TRACE_LIB LIB1 LIB2 Environment Variable TRACE_ALL_EVENTS The w
19. ur ranks will generate plain text file rank 0 rank with min med max MPI communica tion time To change this default setting please refer to the TRACE_ALL RANKS environment variable An example of mpi_profile 0 file is shown as follows elapsed time from clock cycles using freq 700 0 MHz MPI Routine calls avg bytes time sec MPI_Comm_size 1 0 0 0 000 MPI_Comm_rank 1 0 0 0 000 MPI_Isend 21 99864 3 0 000 MPI_Irecv 21 99864 3 0 000 MPI_Waitall 21 0 0 0 014 MPI_Barrier 47 0 0 0 000 total communication time 0 015 seconds total elapsed time 4 039 seconds Message size distributions MPI_Isend calls avg bytes time sec 3 2 3 0 000 1 8 0 0 000 1 16 0 0 000 1 32 0 0 000 1 64 0 0 000 1 128 0 0 000 1 256 0 0 000 1 512 0 0 000 1 1024 0 0 000 f 2048 0 0 000 1 4096 0 0 000 1 8192 0 0 000 1 16384 0 0 000 1 32768 0 0 000 1 65536 0 0 000 1 131072 0 0 000 1 262144 0 0 000 1 524288 0 0 000 1 1048576 0 0 000 MPI_Irecv calls avg bytes time sec 3 2 3 0 000 1 8 0 0 000 1 16 0 0 000 1 32 0 0 000 1 64 0 0 000 1 128 0 0 000 1 256 0 0 000 1 512 0 0 000 1 1024 0 0 000 1 2048 0 0 000 1 4096 0 0 000 1 8192 0 0 000 1 16384 0 0 000 1 32768 0 0 000 1 65536 0 0 000 1 131072 0 0 000 1 262144 0 0 000 1 524288 0 0 000 1 1048576 0 0 000 Communication summary for all tasks minimum communication time 0 015 sec for task 0 median communication time 4 039 sec for task 20 maximum communi
Download Pdf Manuals
Related Search
Related Contents
ROBO-8111VG2AR ROBO-8111VG2AR-Q77 User`s Manual IR01A Medium Range Infrared Sensor User`s Manual VFP-C8WSSP Chief FCA841 mounting kit 08402 THERMOSAN ESP Denon AVR-5800 User's Manual Copyright © All rights reserved.
Failed to retrieve file