Home

bullx DE User's Guide

1. 25 5 1 TALI fd 25 5 1 1 MPI Analyser 25 5 1 2 Communication Matrices REY EH ERAS E EE a aoii 26 5 1 3 Topology of the Execution 26 Preface i 5 2 5 3 Chapter 6 6 1 6 2 6 3 6 4 6 5 Chapter 7 7 1 7 2 Using 27 54 profilecomm Data Analysis oo Ree EIER IECUR 28 5 1 6 Profilecomm Data 33 5 1 7 Exporting a Matrix or an 35 e RT Arr 39 5 2 1 Scalasca Ove view CREER DIN HER canes eR NN THIS 39 5 2 2 6 Usage 2 65 aera evens eau cen ene tu tu aunt M eg E 40 5 2 3 More Intonation tercer ere as aD ode REO RR RERO 40 Mulder EUR 41 53 5 0 1 e te eec ains 41 5 32 PAMPDCORTISUTOHOT ean E a e ae 42 93637 Usage D E 42 Analyzing Application Performance seseseseesesesesossesescscssesesesossesesescssesesesossesesesossesesesesss 43 o deserto tt 43 6 1 1 High level rec heo da tb RE e 43 6 1 2 Low level PAPI Interface sene uU LOT END ae A ui ba a A 45 6 1 3 Collecting FLOP Counts on Sandy Bridge Processors
2. 46 Bull Performance Monitor bpmoh 2 ete etude eee bear true seq Cue ps ede 48 6 2 1 bpmon Reporting Mode dieti eR REI eite MI 49 622 BPMON PAPI CPU Performance Events Dee nee eH Oi 50 6 2 3 with the Bull Coherent 5 Open I 53 6 3 1 Open SpeedShop 53 6 3 2 Open SpeedShop sagaci ga iA D Qa an ta Ubi ea a Ra 53 Information yg oe bs oes UR DO RE EM EE 54 Olas see LEE SUA 55 JHPCIeolat Workflow tertie oet as tasted S Rt Ste 55 GAD BPOTSolki To ls HR a Ie t Rad eeu 56 6 4 3 information about HPCToolkit sssssssssse 58 Bull Enhanced OTI bodes Vesa 59 6 5 History 59 0 5 2 Miewing Gompohentl 61 5 5 3 HPCToolkit Wrappers iot ebrei qt eR ELE ER EE UE ERR 63 6 5 4 e 67 6 5 5 BPCIeolkitConfiqutdion Files nece em b Ep D a RIDE sy 69 AORAKI EEEE libb ti ip ucciso DEO TUE 71 en M 71 Dars a tuse Mu Le occas 72 JM Darshan Usage sso eI ERU RENE 72 bullx DE User s Guide
3. Clock Interface ion Buffer Occupation Error Monitoring Correct Select Event Event Select Threshold Event Event Event o o ofo ojo o o o o o o o o o o o ofo o 0 0 No Event O0 0 Event 0 0 0 0 0 Event No Event No Event A packet has been emitted A packet has been emitted w idle latency 1 A flit has been emitted Flow control or lack of credit on a flit waiting to be emitted OB to LL flitO OB to LL flit1 OB to LL flit2 OB to LL flit3 OB to LL flitO and flit1 OB to LL flit2 and flit3 OB to LL flit0 1 2 3 OB to LL flit0 1 2 3 VNO traffic only amp amp aa Route Through RO units are the direct routing path for messages from the two IOH modules to and between the IOH modules RO event type is monitored in the ROIC and ROCI units by setting fields in the PMROO or PMRO1 PME registers in the selected unit The following event type can be monitored in RO Descriptive information is in addition to the general description at the beginning of this section Interface measure RO to OB traffic or ROIC to ROCI traffic Below is a depiction of the 4 bit PMRO PME register Field description details can be found in the PMRO Event Configuration Register Description Interface 0 0 No Event 0 1 A packet has been emitted 1 0 JA flit has been emitted 1 1 Flow control or lack of credit on a
4. 11000 1 11001 5 3 NCM 11010 CA2 HA2 11011 CA3 HA3 11100 Table 3 Map bullx DE User s Guide 7 Configuration Management Description Performance Monitor Configuration Registers Register Symbolic Real Address CSR Attribute Function Description Name Address for BCS 0 1 2 3 n 0 2 4 6 _ JRegisters that should be initialized Control and status Counter control and Status RW PTCTL 0000 FDnC 50043 1401 Interval timer control and Status PAIRO CNTO PMR 0000 5018 3 1406 RW Control and status PairO CounterO resource control and 31 0 status PAIRO 1 PMR 0000 502C 3 140 RW Control and status PairO Counter resource control and 31 0 status PAIRT CNTO PMR 0000 FDnC 5040 3 1410 RW Control and status CounterO resource control and 31 0 status PAIRT 1 PMR 0000 FDnC 505443 1415 Pairl Counter resource control and status 31 0 0000 FDnC 50083 1402 Initial value of timer low order bits PMINIT 31 0 PMINIT 44 32 0000 FDnC 500C 3 1403 Initial value of timer high order bits Registers that can be initialized depending on usage 0000 FDnC 501C 3 1407 Initial or current value PairO CounterO compare value max 31 0 count low order bits 0000 FDnC 5020 3 1408 Initial or current value CounterO compare value max 44 32 count high orde
5. Create the Event Set if PAPI create eventset amp EventSet PAPI OK handle error 1 Add Total Instructions Executed to our Event Set i PAPI add event EventSet PAPI TOT INS handle error 1 Start counting events in the Event Set if PAPI start EventSet PAPI OK handle error 1 Defined in tests do loops c in the PAPI source distribution do flops NUM FLOPS Read the counting events in the Event Set if PAPI read EventSet values PAPI OK handle error 1 printf After reading the counters 11 1 0 Reset the counting events in the Event Set if PAPI reset EventSet PAPI OK handle error 1 do flops NUM FLOPS Chapter 9 Analyzing Application Performance 45 6 1 3 46 Add the counters in the Event Set if PAPI accum EventSet values OK handle error 1 printf After adding the counters 11dMn values 0 flops NUM FLOPS Stop the counting of events in the Event Set if PAPI stop EventSet values PAPI OK handle error 1 printf After stopping the counters lld n values 0 After reading the counters 440973 After adding the counters 882256 After stopping the counters 443913 Note that PAPI reset is called to reset the counters because read does not reset the counters This lets the second value after a
6. Note You are advised to consult the Bull Support Web site for the most up to date product information documentation firmware updates software fixes and service offers http support bull com Intended Readers This guide is intended for Application Developers of bullx supercomputer suite clusters Highlighting The following highlighting conventions are used in this guide Bold Italic monospace gt Identifies the following Interface objects such as menu names labels buttons and icons File directory and path names Keywords to which particular attention must be paid Identifies references such as manuals or URLs Identifies portions of program codes command lines or messages displayed in command windows Identifies parameters to be supplied by the user Commands entered by the user A WARNING A Warning notice indicates an action that could cause damage to a program device system or data Preface v Related Publications vi NJ mportant The Software Release Bulletin SRB delivered with your version of bullx supercomputer suite must be read first e Software Release Bulletin 86 A2 91FK e Documentation Overview 86 A2 9OFK e Installation and Configuration Guide 86 A2 74FK e Pack Installation and Configuration Guide 86 A2 75FK MC Administration Guide 86 A2 76FK MC Monitoring Guide 86 A2 77FK e MC Power Management
7. Transaction Type msgclass opcode Transaction Type Mask msgclass opcode Buffer Occupation Threshold 010 0 0 Specific type All types 0 1088 Data Response No Event Captured tracker entry has been released 0 Unlock message sent No Event single ECC error NCXC single ECC error 2o x 1 0 INDR Non Data Response 0 0 INCS Non Coherent Standard 0 0 Non Coherent Bypass 1 Allocation of selected tracker entry has been captured Select CSI Tracker Buffer Select XCSI Tracker Buffer No Event Number of occupied entries is greater than threshold Number of occupied entries is equal to threshold 0 0 ID s 1 Specific ID 0 CSI to XCSI 1 XCSI to CSI Appendix A Performance Monitoring with BCS Counters Traffic Interface Error Transaction Type Transaction Type Mask Incoming packet Incoming RHNID msgclass opcode msgclass opcode RHNID Mask required for Traffic Events required for Traffic Events Selj Event Event 6 62 58 57 53 52 45 44 37 36 35 34 33 32 oo 00 1 i tiit 1 Ti 0 0 0 No Event x 1 double ECC error 1 x double ECC error 0 0 No Event 0 1 A packet has been emitted 1 0 JA flit has been emitted 1 1 jLack of credit on a flit waiting to be emitted 0 to O
8. 110000000 Unit Type Source 10 The Unit Event Source can have above value if both are configured to provide the source of the count Here are the three choices Unit Event Source ROIC amp ROCI 110000000 Unit Event Source ROIC 100000000 Unit Event Source 010000000 The syntax for the expert user that does not wish any software tool help in defining an event is to provide the PMR and PMRO PME register contents BCS_RO PMR 0x58000004 ROIC 2 ROCI 2 Interface Monitoring You select the Select of the type of traffic needed You select the Event The Interface Monitoring fields are Interface Select IS Interface Event IE The definition will fill bits 3 O of PMRO PME register The PMR for the chosen counter for this event should have values shown in the RO Event Setup section with Unit Event Source chosen as ROIC and ROCI For example Bits 3 0 in binary BCS RO Interface IS ROBO T IE LOC 1011 Where the set of exclusive Interface Select Types and their abbreviations are RO to OB Flow O ROBO RO to OB Flow 1 ROBI RO to OB Flow and 1 ROBO ROIC to ROCI Flow and 1 Where the set of exclusive Interface Event Types and their abbreviations are A Packet has been Emitted PKT A Flit has been Emitted FLT Lack or Credit on a Flit Waiting to be Emitted LOC For Unit Event Source chosen as ROIC here is an example BCS_RO_ROIC_Interface IS ROBO1 IEZLOC 1011 For Unit E
9. http icl cs utk edu projects papi wiki PAPlITopics SandyFlops bullx DE User s Guide The Floating Point Preset Events are as below PRESET Event Description PAPI FP INS Count of Scalar Operations PAPI FP OPS same as above PAPI SP OPS Count of all Single Precision Operations PAPI DP OPS Count of all Double Precision Operations PAPI VEC SP Count of all Single Precision Vector Operations PAPI VEC DP Count of all Double Precision Vector Operations The following table is from the website The table shows how single and double precision operand operations are computed for total operations and for vector operations from the raw event counts PRESET Event Definition PAPI FP INS SSE SCALAR DOUBLE SSE FP SCALAR SINGLE PAPI OPS same as above PAPI SP OPS FP COMP OPS EXE SSE FP SCALAR SINGLE 4 COMP OPS EXE SSE PACKED SINGLE 8 SIMD FP 256 PACKED SINGLE PAPI DP OPS FP COMP OPS EXE SSE SCALAR DOUBLE 2 FP COMP OPS EXE SSE FP PACKED DOUBLE A SIMD 256 PACKED DOUBLE PAPI VEC SP 4 COMP OPS EXE SSE PACKED SINGLE 8 SIMD FP 256 PACKED SINGLE PAPI VEC DP 2 FP COMP OPS EXE SSE FP PACKED DOUBLE A SIMD 256 PACKED DOUBLE Chapter 9 Analyzing Application Performance 47 6 2 48 Bull Performance Monitor bpmon The Bull Performance Monitor tool bpmon is a Linux command line single node performance monitorin
10. regionN gt Enables time profiling of the selected code region Applies to the timing experiment only Possible regions are user user code mpi MPI functions io POSIX I O functions mpiio MPI I O functions 14 bullx DE User s Guide 5 Prints the reports using a smart display time as hours minutes seconds other values as K ilo M ega or trace lt tracelevel gt Sets the level of detail of the profiling reports 1 basic 2 detailed and 3 advanced Overrides experiment specific trace level set in configuration files y Displays version and exits 4 4 Configuration bullxprof behavior can be configured through command line options or via a configuration file The options given as command line arguments overload the options set in a configuration file The configuration files are considered in this order of priority e configuration file specified by the BULLXPROF_CONF_FILE environment variable e file named bullxprof conf located in the directory where the tool is launched from e A file named bullxprof conf located in HOME bullxprof e system wide configuration file named bullxprof conf located in BULLXPROF e system wide core configuration file named bullxprof core conf located in BULLXPROF_HOME etc It is highly recommended not to modify the content of this file unless the administrator is well aware of his changes The following parameters may be set in a user level
11. Interface BCS RO ROCI Interface If the counts from all the BCSs are added together then the syntax above is used as shown However a special variant of each performance event is allowed that provides the capability to choose from which BCS the counts for an event will be collected This is controlled in the event definition by noting which BCSs will collect counts for this event It is noted by the following syntax for each BCS that will collect the count by putting its number 0 1 2 3 in the event name up to three of the four BCSs may be listed 5 1 2 3 PE REM Incoming Traffic For example to get the count from BCSO BCSO PE Incoming Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 NID 0 NIDM 0 For example to get the count from BCS1 BCS2 and BCS3 BCS123 PE Incoming MC DRS MCM OxF OC 0x0 OCM 0x0 NID 0 NIDM 0 bullx DE User s Guide This can be especially useful in experiments where the performance analyst is evaluating a test program that is referencing from one BCS to another and wishes to collect separate counts from the BCS where the CPU is executing the test and from the BCS where the memory being referenced is located PE Event Setup For PE count events the PMR for the chosen counter for this event should have the following settings where Unit Event Source can have one of three values Counter Enable Source local count enable 001 Counter Status Output Source perfcon 000 Count Mode count events
12. application has to be compiled by replacing the regular MPI compilers with the wrappers provided by the tool That is depending on the module file loaded With darshan version bullxmpi gnu inst or darshan version bullxmpi intel inst e mpicc darshan for C source files mpiCC darshan or mpicxx for C source files mpif77 darshan for Fortran 77 source files mpif90 darshan for the Fortran 90 source files With darshan version intelmpi inst e mpiicc darshan for C source files e mpiicpc darshan source files Note The MPI environment must be setup prior to use the Darshan wrappers Chapter 7 Profiling 73 7 2 4 Analyzing log files with Darshan utilities Each time a Darshan instrumented application is executed it will generate a single binary and portable log file summarizing the activity from that application This log file is generated and placed into the directory pointed by the DARSHAN_LOGPATH environment variable The log is generated with a name in the following format username binary name job ID date unique ID timing darshan gz The Darshan package provides a set of tools to help processing and analyzing the log files e darshan job summary pl One can generate a graphical summary of the 1 activity for a job by using the darshan job summary pl graphical summary tool as in the following example darshan job summary pl carns my app
13. qoa Mtr en cae oom Ft ntt 3 2 2 deeb 3 2 3 4 2 4 bullx DE Module Files M 5 Chapter 3 Debugging Application with Beds need ib rao EE epe DNO Pa E EE PARE PR epa pe Repas 7 3 1 7 3 2 Je me EU 7 3 3 padb with SLURM Bulbc MPI asit i RETI KU epit elie ideals 7 3 4 long 8 510 tet 8 3 5 More Information T EO TT 12 Chapter 4 Application Analysis with 13 4 1 ENVIROMMENE das RE E EE RR HR 13 4 2 13 4 3 Command line Opon Ss e oppo E 14 4 4 UROL Rn UE 15 4 5 Profiling Rr lel 19 4 5 1 Timing experiment 19 43 2 HWEL experiment aa aki pne p idu pas 20 LE MEM IIS nd 20 4 5 4 1 21 N55 reete erra EE FIERE 23 Chapter 5 MPI Application
14. 8 clone at 8 start thread at 8 service thread start at 8 select at The following example shows padb with the stack tree option padb O rmgr slurm tx 47136 0 1 3 8 8 processes main at pp sndrcv spbl c 52 PMPI Finalize at ompi mpi finalize at barrier at opal progress at opal event loop at poll dispatch at poll at ThreadId 2 clone at start thread at btl openib async thread at poll at ThreadId 3 clone at start thread at service thread start at select at 2 1 processes ThreadId 1 Pit at T 7 22 212 ThreadId 2 clone at start_thread at btl_openib_async_thread at poll at ThreadId 3 clone at start_thread at service_thread_start at select at These stacks are standard from GDB Chapter 3 Debugging Application with padb 11 3 5 More Information See http padb pittman org uk and the man page for more information about 12 bullx DE User s Guide Chapter 4 Application Analysis with bullxprof bullxprof is a lightweight profiling tool which launches and profiles a specified program according to the chosen experiments and dumps a profiling report onto the standard error output stream after the program s completion bullxprof can be seen as pertinent to the first analysis of an application as it deliv
15. Appendix A Performance Monitoring with BCS Counters 83 Buffer Occupation Starvation mE Threshold Event Threshold ai s 86 84 NS Event Retry has occurred New retry has occurred Valid transaction seen in retry detection Short Ret ible lookup same twin address already in pipeline Ret back invalidate refused single ECC errors full conflict partial conflict W TID pool unavailable E TID pool unavailable Output channel not available No Event Start of new Starvation mechanism Starvation mechanism is active Number of starved transactions at start of mechanism is greater than threshold Number of starved transactions at start of mechanism is equal to threshold Buffer Occupation Buffer Select max size Threshold 0 0 0 0 0 Write Buffer 8 DCT 256 LOT 256 West TID pool 0 64 West TID pool 1 64 West TID pool 2 64 West TID pool 3 64 Sum of West TID pools 96 East TID pool 64 East NDR Virtual fifo 276 East SNP Virtual fifo 276 West HOM Virtual fifo 276 West SNP Virtual fifo 276 WSB 16 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 No Event 0 Read Request 0 A packet 1 flit has been emitted RT East to OB 1 0 packet 21 flit has been emitted RT West to OB 1 1 1 1 Specific opcode 0 0 0 0 opcod
16. Chapter 8 8 1 8 2 8 3 Appendix A 2 4 5 A 6 A 7 A 8 752 2 RR EM Re CUR e ad bei enge e 73 7 2 3 Compiling with Darshan sz iet ER IE 73 7 2 4 Analyzing log files with Darshan utilities sess 74 72 95 a bad 74 librories ana her OCIS URS UN DURS LEO DER Seoda cea 75 75 OTF Open Trace het ti eet nio edt te 76 77 8 3 1 GPUSETS O 77 832 CPUSEIsmGnagemen retten oW EO MAP a OC D oe ace 78 Performance Monitoring with BCS Counters e esessssesesesessesesescsossescscsossesescsoesesesesossesesesosse 79 Bull Coherent Switch Archieet re oco et aee red on e etat tete 79 Performance Monitoring Architecture sssssssssssssseeeeeeeee 80 Event Detection 80 Event COUNTING 80 Event Types E a 81 PE Event Types eO E ERE E TER eS 81 INCMEBLEVenE TY PGs tado 85 OB Types V eni 86 exin Mg e E CDU T 86 Event Counts and Counter Threshold 87
17. Destination NID Mask DNIDM Request NID RNID Request NID Mask RNIDM Transaction Type MsgClass MC Transaction Type OpCode OC Transaction Type MsgClass Mask MCM Transaction Type OpCode Mask OCM The definition will fill bits 73 37 of PMNC PME register For example Bits 73 37 in binary BCS NCMH Traffic DDNIDZO DNIDM O RNID 0 RNIDM 0 MC DRS MCM OxF OC 0x0 OCM 0x0 0000000000000000000001 110111 100000000 BCS_NCMH_X QPI_QPI_Traffic DNID 0 DNIDM 0 RNID 0 RNIDM 0 MC DRS MCM OxF OC 0x0 OCM 0x0 1000000000000000000001 110111 100000000 The PMR for the chosen counter for this event should have values shown in the NCMH Event Setup section LL Event Setup The event interface from most LL blocks in the BCS chip was connected incorrectly making many of the event selections non functional In fact only one remains usable For the LL count events the PMR for the chosen counter for this event should have the following settings where Unit Event Source can have one of four values Counter Enable Source local count enable 001 Counter Status Output Source perfcon 000 Count Mode count events Counter Event Source unit pme event 000 Counter and Status Reset Source no reset 000 Compare Mode disabled 00 Unit Event Source LLchO 3 amp LLihO 1 amp LLxhO 2 111111111 Unit Type Source LL 01 The Unit Event Source can have the above value if LLchO 3 LLihO 1 and LLxhO 3 are configured to provide the
18. It will be possible for a component s configuration file to appear in one or more of the directories shown below The enhanced HPCToolkit components will look for their configuration files in the following directories in the order shown e Directory etc bullhpctk to provide system wide default values for tools e Directory HOME bullhpctk to provide login specific values for tools e Directory in BHPCTK CONF DIR environment variable to run scripts with custom configuration files Bull will deliver a sample set of configuration files that can be copied into etc bullhpctk to provide system wide default values for the components delivered with the enhanced HPCToolkit Configuration files contain labels to identify the argument being specified for the component In some cases this same label as well as a single char shortcut for the label may be supported as a command line argument to the component For each label found in the configuration file there is a value This value specifies what the component uses for this argument As a component processes each of its configuration files found in the search path and finds labels it sets the component s value for this label to the value found in the configuration file Therefore the values found in files later in the search path normally override the earlier ones Configuration files also contain a special label by the name lock The value for this label is a comma separated list of the
19. OCM 0xF NID 0 NIDM 0x00 bullx DE User s Guide Bull Cedoc 357 avenve Patton BP 20845 49008 Angers Cedex 01 FRANCE
20. OO Counter Event Source unit event Counter and Status Reset Source no reset Compare Mode disabled 00 Unit Event Source LoMO 3 amp ReMO 3 2 111111110 Unit Type Source PE 00 The Unit Event Source can have the above value if both LoMO 3 and ReMO 3 are configured to provide the source of the count Here are the three choices Unit Event Source LoMO 3 amp ReMO 3 2 111111110 Unit Event Source LoMO 3 111100000 Unit Event Source ReMO 3 000011110 The syntax for the expert user that does not wish any software tool help in defining an event is to provide the PMR and PMPE PME register contents BCS_PE PMR 0x1FEQO004 LOMH 0 0 0x7E0420 0 REMH 0 0 0x7E0420 0 Error Monitoring You select the set of errors you wish to monitor The definition will fill bits 6 0 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 6 0 in binary BCS_PE_Error DSS 0000001 BCS_PE_Error DSS DSD 0000011 Where the set of errors and their abbreviations are Directory SRAM Single ECC Error DSS Directory SRAM Double ECC Error DSD Directory LOT Single ECC Error DLS Directory DCT Single ECC Error DDCS Directory DLIT Single ECC Error DDLS Tracker Single ECC Error TRS Virtual Output FIFO Single ECC Error VOFS For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM
21. Output opt bullxde modulefiles debuggers padb 3 2 opt bullxde modulefiles utils OTF 1 8 opt bullxde modulefiles profilers hpctoolkit 4 9 9 3111 Bull 2 opt bullxde modulefiles perftools bpmon 1 0 Bull 1 20101208 papi 4 1 1 Bull 2 ptools 0 10 4 Bull 4 20101203 opt bullxde modulefiles mpicompanions boost mpi 1 44 0 mpianalyser 1 1 4 scalasca 1 3 2 Chapter 2 bullx DE User Environment 5 6 bullx DE User s Guide Chapter 3 Debugging Application with padb The padb tool is used to trace MPI process stacks for running job It is a Job Inspection tool used to examine and debug parallel programs simplifying the process of gathering stack traces for compute clusters padb supports a number of parallel environments and it works out of the box for most clusters It is an Open Source licensed under the Lesser General Public License http www gnu org licenses lgpl html non interactive command line scriptable tool intended for use by programmers and System Administrators alike It supports the RMS SLURM and LSF batch schedulers Bull has contributed in the project to support more resources managers such as PBS Pro MPD SLURM OpenMPI LSF MPD and LSF OpenMPI However it will not diagnose problems with the wider environment including the job launcher or runtime environment 3 1 Installation padb should be installed on LOGIN and COMPUTE nodes type The following tools are pre requir
22. REMOTE DRAM 0 11874471347 184298321 181306087 86188440 95116786 11864491632 183240206 180310779 83212538 97097821 2 11856905044 183105309 180369962 83542631 96827232 3 11856505436 183098942 180344484 83470335 96873988 4 3292691 5589 1032 367 528 5 401016 2342 466 195 176 6 2594262 981 217 50 121 7 101785 594 150 147 0 8 11848325273 182436818 179645809 83339429 96306262 9 11895706265 182414051 179770963 81956916 97813529 10 11861415833 183430836 180686147 82165023 98520942 11 11867024890 183864157 181035165 84138310 96896833 12 0 0 0 0 0 13 254712 2169 06 138 203 14 388438371 5205 664 286 220 15 6051685 2067 933 839 93 ALL 95325980242 1465907587 1443473264 668015644 775454734 run time completed bpmon has terminated Chapter 6 Analyzing Application Performance 49 6 2 1 2 Memory Usage Reporting The second report type is a Memory Utilization Report built into bpmon This report shows the percentages of memory references made to a different socket from the one where the core is executing This report can also be repeated at a periodic rate A command example with its output is shown below Run from Terminal 1 llclat 1 10 c 4 Run from Terminal 2 sudo bpmon report memory run time 30 Output Update in 30 seconds ctrl c to exit qp BPMON Memory Utilization Report q CPU CPU Instruction Memory Read Local Remote Board
23. 10 1 and bigger than 10 bytes Topology of the Execution Environment The profilecomm module registers the topology of the execution environment so that the machine and the CPU on which each process is running can be identified and above all the intra and inter machine communications made visible bullx DE User s Guide 5 1 4 5 1 4 1 Using profilecomm When using profilecomm there are 2 separate operations data collection and then its analysis To be profiled by profilecomm an application must be linked with the MPI Analyser library profilecomm is disabled by default to enable it set the following environment variable export MPIANALYSER PROFILECOMM 1 When the application finishes the results of the data collection are written to a file mpiprofile pfc by default By default this file is saved in a format specific to profilecomm but it is possible to save it in a text format The readpfc command enables pfc files to be read and analyzed profilecomm Options Different options may be specified for profilecomm using the PFC OPTIONS environment variable For example export PFC OPTIONS f foo pfc Some of the options that modify the behavior of profilecomm when saving the results in a file are below file filename file Saves the result in the file file instead of the default file mpiprofile txt for text format files and mpiprofile pfc for profilecomm binary format files
24. 11 28 37 wallclock 5 381077 sec stop 09 14 12 11 28 42 comm 7 15 gbytes 9 64523e 01 total gflop sec 0 00000 00 total ERE HG EEE HEE HEE HEHE HEHE HEHE HEH HE HH region ntasks 4 total lt avg gt min max entries 4111 wallclock 21 517 5 37924 5 3785 5 38108 user 25 47 6 3675 6 29 6 44 system 0 88 0 22 0 16 0 26 mpi 1 53893 0 384732 0 0103738 0 53211 Scomm 7 14973 0 192783 9 89294 gflop sec 0 0 0 0 gbytes 0 964523 0 241131 0 241112 0 241161 time calls lt mpi gt wall MPI Allreduce 0 769333 72 49 99 3 58 PI Send 0 628268 637 40 83 2 92 PI Barrier 0 0887964 432 5 77 0 41 PI Bcast 0 048476 148 3 15 0 23 PI Irecv 0 00139042 563 0 09 0 01 PI Reduce 0 00099695 16 0 06 0 00 PI Wait 0 000902604 560 0 06 0 00 PI Gather 0 000289791 8 0 02 0 00 PI Recv 0 000234257 74 0 02 0 00 PI Comm size 0 00013079 991 0 01 0 00 PI Waitall 3 91998e 05 1 0 00 0 00 PI Probe 3 63181e 05 3 0 00 0 00 MPI Comm rank 3 54093e 05 232 0 00 0 00 EERE lt lt Chapter 5 Application Profiling 41 5 3 2 5 3 3 42 Note In the context of xPMPI the user applications have to be recompiled Hardware counters profiling is not supported by this integrated version of IPM mpiP is a lightweight profiling library for MPI
25. Guide 86 A2 78FK e MC Storage Guide 86 A2 79FK e MC InfiniBand Guide 86 A2 80FK MC Ethernet Guide 86 A2 82FK MC Security Guide 86 A2 81FK e bullx EP Administration Guide 86 A2 88FK PFS Administration Guide 86 A2 86FK e MPI User s Guide 86 A2 83FK e bullx DE User s Guide 86 A2 84FK e bullx BM User s Guide 86 A2 85FK MM Argos User s Guide 86 A2 87FK e Extended Offer Administration Guide 86 A2 89FK e bullx scs 4 R4 Documentation Portfolio 86 AP 23PA scs 4 RA Documentation Set 86 AP 33PA This list is not exhaustive Useful documentation is supplied on the Resource amp Documentation CD s delivered with your system You are strongly advised to refer carefully to this documentation before proceeding to configure use maintain or update your system bullx DE User s Guide Chapter 1 bullx Development Environment The Bull Extreme Computing offer development environment relies on three sets of tools Linux OS development tools These tools come as part of the Linux distribution They typically include GNU compilers gdb debugger as well as profiling tools such as gproof oprofile and valgrind See the Linux OS documentation for more information on these tools bullx scs 4 Extended Offer tools These tools are third party products which are selected validated in bullx supercomputing suite environment distributed and fully supported by B
26. In this example REM Incoming Traffic from one BCS should be equal to the LOM Incoming Traffic from another BCS Chapter 6 Analyzing Application Performance 5 The command above gives the following output BPMON Single Thread Event Results Event Description Event Count BCS PE REM Incoming Traffic 448530917 MC HOMO MCM 0xF OC 0 OCM 0xC NID 1 NIDM 0x01 BCS PE REM Incoming Traffic 483451 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 1 NIDM 0x01 BCS PE LOM Incoming Traffic 448466650 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0 NIDM 0x18 BCS PE LOM Incoming Traffic 476911 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0 NIDM 0x18 Elapsed time 76 024967 seconds 52 bullx DE User s Guide 6 3 6 3 1 6 3 2 Open SpeedShop This section describes the Open SpeedShop performance tool Open SpeedShop Overview Open SpeedShop is an open source multi platform Linux performance tool which is initially targeted to support performance analysis of applications running on both single node and large scale IA64 1 32 EM64T and AMD64 platforms Open SpeedShop is explicitly designed with usability in mind and is for application developers and computer scientists The base functionality includes e Sampling Experiments e Support for Callstack Analysis e Hardware Performance Counters e MPI Profiling and Tracing e Profiling and Tracing e Floating
27. NIDM 0x00 BCS_PE_LOM_Incoming_Traffic 1063453 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0 NIDM 0x00 Here are the write results from BPMON that also shows the BCS performance events measured H BPMON Single Thread Event Results Event Description Event Count 5 PE REM Incoming Traffic 11792759 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0x01 NIDM 0x01 5 PE REM Incoming Traffic 1940487514 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x01 NIDM 0x01 BCS_PE_LOM_Incoming_Traffic 9609143 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0 NIDM 0x00 BCS_PE_LOM_Incoming_Traffic 1940484848 MCM 0xF OC RdInvOwn OCM 0xF NID 0 NIDM 0x00 114 bullx DE User s Guide Total Memory Traffic for All BCSs Using Outgoing Traffic This BPMON monitor setup collects all the reads and writes from the requesting nodes using LOM events and the local nodes fulfilling the requests using REM events using Outgoing Traffic As the example shows the LOM event counts closely match the REM events counts Read opcodes are counted by using a mask to get the RdCur RdCode RdData from the HOMO Message Class in an event The Write opcode RdlnvOwn is specifically counted in a different event The test used generates reads for one test pass and writes for another test pass The test program generates about 500 000 000 remote memory requests per program instance and four instances are executed Here are the read results from BPMON that also shows th
28. Saves the result in a text format file readable with any text editor or reader This format is useful for debugging purpose but it is not easy to use beyond 10 processes b bin Saves the results in a profilecomm binary format file This is the default format The readpfc command is required to work with these files 5 Sync Synchronizes the processes during the time measurements This option is set by default ns Does not synchronize the processes during the time measurements v32 volumic32 Use 32 bit volumic matrices This can save memory when profiling application with a large number of processes A process must not send more than AGBs of data to another process v64 volumic64 Use 64 bits volumic matrices This is the default behavior It allows the profiling of processes which exchanges more than 4GBs of data Examples To profile the foo program and save the results of the data collection in the default file mpiprofile pfc MPIANALYSER PROFILECOMM 1 srun p my partion N 1 n 4 foo Chapter 5 Application Profiling 27 To save results of the data collection in the oo psc file MPIANALYSER PROFILECOMM 1 OPTIONS f foo pfc srun p my partion N 1 n 4 foo To save the result of the collect in text format in the foo txt file MPIANALYSER PROFILECOMM 1 PFC_OPTIONS t f foo txt srun p my partion N 1 n 4 foo 5 1 4 2 Messages Size Partitions pr
29. Socket Core Thread CPU Mhz Used Rate MIPS Bandwidth MBPS Loads Loads 0 0 0 0 0 2933 3 100 0 104 515 72 45 8 54 2 0 0 1 0 1 2933 3 100 0 104 516 83 46 05 54 05 0 0 2 0 2 2933 3 100 0 105 521 48 45 45 54 65 0 0 0 3 2933 3 100 0 105 519 10 45 6 54 4 0 al 0 0 4 1600 1 0 2 47 0 01 20 4 79 6 0 1 1 0 5 1609 1 0 05 173 0 01 85 9 14 1 0 1 2 0 6 1601 1 0 0 1 0 00 61 9 38 1 0 1 3 0 7 1600 8 0 0 514 0 04 n a 0 0 0 1 8 2933 3 100 0 104 514 00 45 65 54 4 0 0 1 1 9 2933 3 100 0 106 522 72 45 4 54 6 0 0 2 1 102933 3 100 0 105 520 98 45 2 54 85 0 0 3 1 11 2933 3 100 0 104 517 43 45 5 54 5 0 1 0 1 12 1613 7 0 05 0 0 00 0 al 1 1 13 1602 7 0 05 425 0 01 19 0 81 0 0 1 2 1 14 1637 4 0 05 16 0 00 6 4 93 6 0 1 3 1 15 1601 2 0 08 1492 0 02 92 9 7 1 Totals for 16 CPUs 36332 1 50 0 3505 4148 37 45 6 54 4 run time completed bpmon has terminated See The bpmon man page or help file for more information 6 2 2 BPMON PAPI CPU Performance Events The PAPI mechanism used by bpmon enables the review of both PAPI preset events and processor native events PAPI Preset Events PAPI preset events are the same for all hardware platforms and are derived by addition or subtraction of native events However if the platform processor s native events do not support the information collection required then some presets may not exist PAPI preset events offer the safest source of information f
30. Software Application Supported BCS Monitoring 89 Maus 91 INGMEDEvent SetUp een E EIE N 100 MM EI E 103 RO Evaril Setup iens n e eet hune debs eivai un e let crude vate e ea 105 BCS Key Architectural Values init ra retro oS Hd a de a 106 Message Class and Opcode 106 XQPI NodelD wes ea stc bus te 109 Configuration Management Description 111 Performance Monitor Configuration Registers 111 Event Configuration acoso cath catia deo ue AAT Pa MM a Ion 112 BCS BPMON Usage Examples re et Ste urea 114 Total Memory Traffic For All BCSs Using Incoming 114 Total Memory Traffic for All BCSs Using Outgoing 115 Memory Traffic For a Source and a Destination BCS Using Incoming Traffic 115 Preface iii iv bullx DE User s Guide Preface This guide describes the tools and libraries provided with bullx DE Development Environment that allow the development testing and optimal use of application programs on Bull extreme computing clusters In addition various Open Source and proprietary tools are described
31. Transaction Type OpCode Mask OCM Lookup Directory Status LST The definition will fill bits 64 20 of PMPE_PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and 3 For example Bits 64 20 in binary BCS PE Lookup Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 LST EXC 15 Appendix A Performance Monitoring with BCS Counters 95 96 101110000011110000000000000000000000000000000 This counts DRS transaction types for all opcodes for the selected Lookup Directory Statuses Where the set of exclusive Lookup Directory Statuses and their abbreviations are Exclusive State Shared State and 3 Sharers Shared State and 2 Sharers Shared State and 1 Sharer Invalid State For Unit Event Source chosen as LoMO 3 here is an example EXC 35 25 15 INV BCS_PE_LOM_Lookup_Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 LST EXC 1 101110000011110000000000000000000000000000000 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Lookup Traffic MC DRS MCM 0xF OC 0x0 OCM 0x0 157 1 5 10111000001 1110000000000000000000000000000000 Retry Monitoring You select the Retry Event Type You select Short or Long Retry type You select the set of retries that you wish to monitor The Retry Monitoring fields are Retry Type Retry Event TY EV The definition will fill bits 74 65 of PMPE PME register The PMR
32. communication matrix matrices Equivalent to cp n numeric only Does not display volume matrices This option cannot be used simultaneously with the volumic only option 2 pt2pt Displays point to point communication matrices Chapter 5 MPI Application Profiling 33 34 rate throughput Displays messages rate and data rate matrices instead of communications matrices s slatistics Computes and displays some statistics regarding communications S scalable Displays all scalable information this means all information whose size is independent of number of processes Useful when there is a great number of processes Equivalent to histT square matrices Displays the matrices containing the sum of the squared sizes of messages These matrices are used for standard deviation computation and are useless for final users This option is mainly provided for debugging purposes calltable Displays the call table T cHotal only Displays only the Total column of the call table By default readpfc displays also one column for each process y volumic only Does not display numeric matrices This option cannot be used simultaneously with n numeric only option bullx DE User s Guide 5 1 7 Exporting a Matrix or an Histogram The communication matrices and the histograms can be exported in different formats that can be used by other software programs for example spreadshee
33. configuration file General Configuration File Options app functions excluded string 1 stringN gt Application functions to exclude from profiling Example app functions excluded functionA _func_ Every function having one of the option s entry in its name will be ignored Caution must not be left blank when enabled app functions whitelist lt string 1 stringN gt Exception in the excluded application functions list Example app functions whitelist one_func_opt A function whose name is given as an entry of this option will not be ignored if it matches the app functions excluded option Caution must not be left blank when enabled app modules excluded string 1 stringN gt Application source file to exclude from profiling Example app modules excluded file1 c file2 cpp Every source file having one of the option s entry in its name will be ignored Caution must not be left blank when enabled Chapter 4 Application Analysis with bullxprof 15 16 app modules whitelist lt string1 stringN gt Exception in the excluded application source file list Example app modules whitelist file2 cpp A source file whose name is given as an entry of this option will not be ignored if it matches the app functions excluded option Caution must not be left blank when enabled app libraries lt string 1 stringN gt Comma separated list of shared libraries to include in the application profiling The lib
34. features include e Display of the History Repository database This provides a graphic display of all the files and directories in the history repository database e Context menu items to perform operations on tree objects This allows the user to select one or more tree objects and perform some operation on the objects selected The kinds of operations to be supported include Opening files to see contents Loading an experiment database into the hpcviewer perspective Comparing the content of two selected files a side by side display that highlights differences Comparing all objects in two selected directories to provide a list of the files in those directories that are different with the ability to see each file s differences by opening one of the files in that list mport and Export tar files Delete the selected projects and or code passports Merge Application files and System files The objective of the merge utility is to create one application or system file for each bhpcrun lt system_name gt lt rank gt application or system file with the same content Files with the same content will be merged into one file and header information will be added to the merged files to track which process ranks contain the same content e Preference page to control the History Repository display This provides controls that affect the History Repository Explorer View e Preference page to control the Grouping Opti
35. following counters are particularly interesting Paez ror cvc number of CPU cycles and oes number of floating point operations For more information on the display counters use the papi avail d command More information about HPCToolkit See e HPCToolkit web at http www hpctoolkit org for more information regarding HPCToolkit e HPCToolkit User s Manual at http hpctoolkit org manual HPCToolkit users manual pdf for more detailed information including Quick Start FAQ and Troubleshooting HPCToolkit bullx DE User s Guide 6 5 6 5 1 6 5 1 1 Bull Enhanced HPCToolkit Bull Enhanced HPCToolkit is an application performance profiling tool for HPC users It is based on the current HPCToolkit open source product which was designed and built by Rice University TX USA The Bull Enhanced HPCToolkit provides added value for users needing profile based performance analysis in order to optimize their running software applications See Section 6 4 HPCToolkit for more information about the HPCToolkit The Bull Enhanced HPC Toolkit contains three main components 1 History Component see section 6 5 1 2 Viewing Component see section 6 5 2 3 HPCToolkit Wrappers see section 6 5 3 History Component The History Component provides a means to store information related to a test run in a repository This facility allows the user to keep a history of test runs so
36. for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 74 65 in binary BCS PE Short Retry EV2 NEW TY2ATOM BCS PE Long Where the set of exclusive Retry Events and their abbreviations are Retry has Occurred New Retry has Occurred Valid Transaction Seen in Retry Detection Where the set of inclusive Short Retry Types and their abbreviations are Impossible Lookup Atomicity Same Twin Address already in Pipeline Where the set of inclusive Long Retry Types and their abbreviations are Back Invalidate Refused Single ECC Errors Full Conflict Partial Conflict W TID Pool Unavailable E TID Pool Unavailable Output Channel not Available For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Short Retry EV NEW TY 2ATOM BCS PE LOM Long Retry EV RET TY FC PC bullx DE User s Guide 0000001010 1000110001 RET NEW VAL IMP ATOM BIR SEE FC PC WTID ETID OCNA 0000001010 1000110001 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Short Retry EV2 NEW TY2ATOM 0000001010 BCS PE REM Long Retry EV RET TY FC PC 1000110001 Starvation Monitoring You select the Starvation Type You select Starvation Event You select Starvation Threshold if you choose Event 011 or 100 otherwise it is set to 0 The Starvation Monito
37. id114525 7 27 58921 19 darshan gz It will generate a multi page PDF file based on the name of the input file e darshan parser This tool generates a full human readable dump of all information contained in a log file The following example essentially converts the contents of the log file into a fully expanded text file darshan parser logfile job characterization txt See http www mes anl gov research projects darshan docs darshan util html darshan_parser for a complete description of darshan parser results e darshan convert Converts an existing log file to the newest log format If the output has a bz2 extension then it will be re compressed in bz2 format rather than gz format It also has command line options for anonymizing personal data and adding metadata annotation to the log header e darshan diff Compares two darshan log files and shows counters that differ e darshan analyzer Walks an entire directory tree of Darshan log files and produces a summary of the types of access methods used in those log files e darshandogutils This is a library rather than an executable but it provides a C interface for opening and parsing Darshan log files This is the recommended method for writing custom utilities as darshan logutils provides a relatively stable interface across different versions of Darshan and different log formats 7 2 5 Darshan Limitations darshan lt version gt _intelm
38. in order to use a tool or a set of tools For instance an environment can consist of a set of compatible products including a defined release of a FORTRAN compiler a C compiler a debugger and mathematical libraries In this way you can easily reproduce trial conditions or use only proven environments The Environment Modules package relies on modulefiles to allow dynamic modification of a user s environment Each module file contains the information needed to configure the shell for an application Once the Modules package is initialized the environment can be modified on a per module basis using the module command which interprets module files Typically module files instruct the module command to alter or set shell environment variables such as PATH MANPATH etc module files may be shared by many users on a system and users may have their own collection to supplement or replace the shared module files Modules can be loaded and unloaded dynamically and atomically in a clean fashion All popular shells are supported including bash ksh zsh sh csh tcsh as well as some scripting languages such as Perl Modules are useful in managing different versions of applications Modules can also be bundled into metamodules that will load an entire suite of different applications Chapter 2 bullx DE User Environment 2 3 4 Using Modules The following command gives the list of available modules on a cluster opt modules vers
39. individual function call paths and or generating event traces recording individual runtime events Scalasca allows switching between both options to occur without re compiling or re linking Summarization is particularly useful as it presents an overview of performance behavior and of local metrics such as those derived from hardware counters In addition it can also be used to optimize the instrumentation for later trace generation When tracing is enabled each process generates a trace file containing records for all the process local events Following program termination Scalasca loads the trace files into main memory and analyzes them in parallel using as many CPUs as have been used for the target application itself During the analysis Scalasca searches for characteristic patterns indicating wait states and related performance properties classifies detected instances by category and quantifies their significance The result is a pattern analysis report similar in structure to the summary report but enriched with higher level communication and synchronization inefficiency metrics Chapter 5 MPI Application Profiling 39 5 2 2 5 2 3 40 Scalasca Usage Using Scalasca consists in loading a module file which will set the different paths for binaries and libraries The Scalasca package provides three module files e scalasca version bullxmpi gnu This module file is to be loaded to use Scalasca with applications c
40. library Use the MPIANALYSER LINK environment variable can be used to link the binary with mpianalyser e mpianalyser 1 2 preload this module file sets the user environment for using mpianalyser without recompilation of your dynamically linked program Note that this module file sets the LD PRELOAD variable that will any dynamically linked program as long as this module is loaded It is highly recommended to unload this module immediately after your profiling session Chapter 5 Application Profiling 25 5 1 2 5 1 3 26 Communication Matrices The profilecomm library collects separately the point to point communications and the collective communications It also collects the number of messages and the volume that the sender and receiver have exchanged Finally the library builds 4 types of communication matrices e Communication matrix of the number of point to point messages e Communication matrix of the volume in bytes of point to point messages e Communication matrix of the number of collective messages e Communication matrix of the volume in bytes of collective messages The volume only indicates the payload of the messages In order to compute the standard deviation of messages size two other matrices are collected They contain the sum of squared messages sizes for poinHo point and for collective communications In order to obtain precise information about messages sizes each numeric matrix can be s
41. origin gt perf_db callpath xml lt project gt lt code passport gt bhpcprof mpi lt data origin gt perf_db experiment 1 mdb lt project gt lt code passport gt bhpcprof mpi lt data origin gt perf_db experiment mt lt project gt lt code passport gt bhpcprof mpi lt data origin gt perf_db experiment xml The performance database source files lt project gt lt code passport gt bhpcprof mpi lt data origin gt perf_db src xxx lt project gt lt code passport gt bhpcprof mpi lt data origin gt perf_db src usr xxx Standard Error and Standard Output project code passport bhpcprofmpi data origin stdout project code passport bhpcprofmpi data origin stderr bullx DE User s Guide 6 5 4 6 5 4 1 Test Case Test cases are identified by a project name and code passport name The project name is provided by the user running the test as a way to separate his tests from tests run by people on other projects It will be provided by the user to all of the scripts run as part of the test case The code passport name represents a single test run by the user It is possible to run the same test many times which should create many code passports When the same test is run many times it would be good to be able to recognize that they are all different runs of the same test For this reason the user provides a string to the start script which will be used to create a unique code passport name to be used for this
42. pfcplot for histograms Like pfcplot it can be used directly by users but it is not user friendly More details are available from the man page man histplot bullx DE User s Guide 5 2 5 2 1 Scalasca This section describes how to use the Scalasca performance analysis toolset Scalasca Overview Scalasca Scalable Performance Analysis of Large Scale Applications is an Open Source performance analysis toolset that has been specifically designed for use on large scale systems It is also well adapted for small and medium scale HPC platforms Scalasca supports incremental performance analysis procedures that integrate runtime summaries with in depth studies of concurrent behavior via event tracing adopting a strategy of successively refined measurement configurations A distinctive feature is the ability to identify wait states that occur for example due to unevenly distributed workloads Such wait states can lead to poor performance especially when trying to scale communication intensive applications to large processor counts The current version of Scalasca supports the performance analysis of applications based on the MPI OpenMP and hybrid programming constructs OpenMP and hybrid with restrictions most widely used in highly scalable HPC applications written in C C and Fortran on a wide range of current HPC platforms The user can choose between generating a summary report profile with aggregate performance metrics for
43. s Terminal DISK WRITE 496 OOOOOOOOOO0OO0O0O0O0GO0wu gt activity displayed by lotop Q QO Q QO QO QO QO QO QO QO O O O O O of DISK READ 7 42 M s Total DISK WRITE 3 14 READ 5 5 UJ 4 UJ Ul Please note that lotop needs root privileges to run 00 5 00 00 00 5 00 COMMAND cp Rfvd n tmp usr cp Rfvd b tmp usr cp Rfvd e tmp usr cp Rfvd c tmp usr dpkg kjournald2 syslogd u syslog init kthreadd ksoftirqd 0 watchdog 0 events 0 khelper netns async mgr pm chrome type zygote dbus daemon system chrome type zygote google chrome khpsbpkt 00 mixer app ior fd 32 See The lotop man page for usage information http guichaz free fr iotop for more details Chapter 7 Profiling 71 7 2 7 2 1 72 Darshan Darshan is a scalable HPC characterization tool It is designed to capture an accurate picture of application I O behavior including properties such as patterns of access within files with minimum overhead Darshan can be used to investigate and tune the I O behavior of complex HPC applications In addition Darshan s lightweight design makes it suitable for full ime deployment for workload characterization o
44. sent messages Volume Volume of sent messages bytes Avg message size Average size of messages bytes Std deviation Standard deviation of messages size Variation coef Variation coefficient of messages size Frequency msg s Average frequency of messages messages per second Throughput B s Average throughput for sent messages bytes per second 32 bullx DE User s Guide 5 1 5 8 5 1 6 Topology Section This section shows the distribution of processes on nodes and processors This distribution is displayed in two different ways First for each process the node and the CPU in the node where it is running and secondly the list of running processes for each node Example 8 Processes Running on 2 Nodes Topology 8 process on 2 hosts process hostid cpuid 0 0 0 1 0 1 2 0 2 3 0 3 4 1 0 5 1 1 6 1 2 7 1 3 host processes 0 0123 1 4567 Profilecomm Data Display Options The following options can be used to display the data a all Displays all the information Equivalent to ghimst c collective Displays collective communication matrices g topology Displays the topology of execution environment h header Displays header of the profilecomm file i histograms Displays messages size histograms 4 joined Displays entire numeric matrices i e not split This is the default J Display numeric matrices split according to messages size m matrix matrices Displays
45. source of the count Here are the four choices Unit Event Source LLchO 3 amp LLihO 1 amp LLxhO 2 111111111 Unit Event Source LLchO 3 111100000 Unit Event Source LLihO 1 000011000 Unit Event Source LLxhO 2 000000111 Appendix A Performance Monitoring with BCS Counters 103 104 The syntax for the expert user that does not wish any software tool help in defining an event is to provide the PMR and PMLL PME register contents BCS_LLI PMR 0x3FFO0004 LLCH 0 0x7E0420 LLIH 0 0x7E0420 LLXH 0 0x7E0420 Interface Monitoring You select the Select the type of OB to LL traffic needed You select the Event The Interface Monitoring fields are Interface Select Interface Event IS IE The definition will fill bits 32 25 of PMLL PME register The PMR for the chosen counter for this event should have values shown in the LL Event Setup section with Unit Event Source chosen as LLchO 3 LLihO 1 and LLxhO 3 For example Bits 32 25 in binary BCS LL Interface IS2OLO 1 IE FLT Where the set of exclusive Interface Select Types and their abbreviations are to LL Flit O and 1 to LL Flit 2 and to LL Flit O 1 2 and 3 to Flit O 1 2 and 3 and Traffic Only LL to HD R to HD L LL to HD L Snoop Traffic Only LL to NC Flit O and 1 LL to RO Flit O and 1 LL to RO Flit 2 and 3 LL to RO Flit O 1 2 and 3 LLC I to OBX or LLX to OBC I REM Flit O and 1 LLC to OBC_LOM OB to L
46. to count The definition will fill bits 33 30 of the PMNC_PME register For example Bits 33 30 in binary BCS NCMH ECC Error CXS 0001 BCS NCMH ECC Error CXS4XCS 0011 Where the set of inclusive ECC Error Types and their abbreviations are QPI to NCCX Single ECC error CXS to QPI NCXC Single ECC error XCS QPI to Double ECC error CXD to QPI Double ECC error XCD The PMR for the chosen counter for this event should have values shown in the NCMH Event Setup section Interface Monitoring You select QPI to Output Buffer NCCX to OB NCXC to OB You select the Event that you want to count The definition will fill bits 36 34 of PMNC PME register For example Bits 36 34 in binary BCS NCCX OB PKT 001 BCS NCMH OB FLT 110 Where the set of exclusive Interface Events and their abbreviations are Packet has been emitted PKT A Flit has been emitted FLT Lack of credit on a Flit waiting to be emitted LOC bullx DE User s Guide The PMR for the chosen counter for this event should have values shown in the NCMH Event Setup section Traffic Monitoring You select direction QPI to or to QPI You select the destination node ID and its DNID Mask You select the requestor node ID RHNID and its RHNID Mask You select Transaction Type and Mask msgclass opcode The Traffic Monitoring fields are Destination Node ID NID DNID
47. to point numeric number of messages 0 1 1k 0 0 1 1k elk 0 0 0 1 1k 0 0 0 Ilk 1 1k 0 0 L 1k 0 1 1k Chapter 5 MPI Application Profiling 29 volumic Bytes 0 818 8k 0 818 8k 0 0 0 0 0 0 0 818 8k 0 0 818 8k 0 J split option is set then this command are partitions Example If the file contains several partitions and the displays as many numeric matrices as there Point to point numeric number of messages 0 lt msg size 1000 0 800 0 0 800 800 0 0 0 800 0 0 0 800 800 0 0 800 0 800 1000 lt msg size 1000000 0 300 0 0 300 300 0 0 0 300 0 0 0 300 300 0 0 300 0 300 1000000 lt msg size 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 volumic Bytes 0 818 8k 0 0 818 8k 818 8k 0 0 0 818 8k 0 0 0 818 8k 818 8k 0 0 818 8k 0 818 8k If the r rate option is set then the messages rate and data rate matrices are shown instead of communications matrices These rates are the average rates for all execution times not the instantaneous rates Example Point to point message rate msg s 0 118 2k 0 0 118 2k 118 2k 0 0 0 118 2k 0 0 0 118 2k 118 2 0 0 118 2k 0 118 2 data rate Bytes s 0 88 01M 0 0 88 01 88 01M 0 0 0 88 01 0 0 0 88 01M 88 01 0 0 88 01M 0 88 01 5 1 5 4 Collective Section The collective section is equivalent to the pointto point section for collective communication matrices Example Collective numeric number of messages 0 102 202 102 406 102 0
48. volumic communication matrix according to the exported matrix bullx DE User s Guide x object export object Exports a communication matrix or histogram specified by the object argument Values for object are the following help List of available matrices and histograms pn part Pointto point numeric communication matrix The optional item part is the np pari partition number for split matrices If part is not set the entire matrix i e the sum of the split matrices is exported pv vp Point to point volumic communication matrix cn part Collective numeric communication matrix Collective volumic communication matrix ph hp Pointto point messages size histogram ch hc Collective messages size histogram th ht Total messages size histogram collective and point to point ah ha Both point to point and collective messages size histograms all histograms Other options H help usage Displays help messages q Does not display help warning messages error messages continue to be displayed V version Displays program version Examples display all information available in foo pfc file enter readpfc a foo pfc This will give information similar to that below Header Version Flags Header Elapsed World s umber Partiti num int num vol Topology 4 proces process 0 1 2 3 host pr 0 0 2 little endian Size 40 bytes
49. 0 0100 0001 FERR 0100 0011 NcRdPtl 0100 0100 NcCfgRd 0100 0101 Non Coherent NcLTRd unsupported 0100 0110 Standard NCS 4 NclORd 0100 0111 NcCfgWr 0100 1001 NcLTWr unsupported 0100 1010 NclOWr 0100 1011 NcMsgS 0100 1100 INcP2PS 0100 1101 Table Message Class and Opcode Mapping bullx DE User s Guide and NodelD Maps The following are the NodelD maps represent the NodelDs used by the protocol internal to the mainboard and the XGPI NodelDs used by the protocol between mainboards QPI NodelD Map Component Agent NID HAO 00001 NHM 0 Ubox 00010 1 00011 HAO 00101 NHM 1 Ubox 00110 1 00111 HAO 01001 2 01010 1 1 01011 01101 NHM 3 Ubox 01110 1 01111 O 00000 1 00100 10001 NCM 10010 BCS 1 10011 10101 10111 Table 2 NodelD Map Appendix A Performance Monitoring with BCS Counters 109 110 NodelD Component Agent NID 00000 00001 BCS 0 NCM 00010 CA2 HA2 00011 CA3 HA3 00100 01000 1 1 01001 BCS 1 NCM 01010 CA2 HA2 01011 CA3 HA3 01100 10000 1 1 10001 BCS2 NCM 10010 CA2 HA2 10011 CA3 HA3 10100
50. 0 100 202 202 0 0 0 202 102 100 0 0 202 30 bullx DE User s Guide volumic Bytes O 409 6k 421 6k 409 6k 1 241M 12 04k 0 0 12k 24 04k 421 6k 0 0 0 421 6k 12 04k 409 6k 0 0 421 6k 5 1 5 5 Call table section This section contains the call table If the ct total only option is activated only the total column is displayed Example Call table 0 Allgather 0 Allgatherv 0 Allreduce 2 Alltoall 0 Alltoallv 0 Bcast 200 Bsend 0 Gather 0 Gatherv 0 Ibsend 0 Irsend 0 Isend 0 0 0 0 0 0 0 0 k 0 0 0 0 N e N Reduce 20 Reduce scatter Rsend Scan Scatter Scatterv Send d Sendrecv Sendrecv replace Ssend Start N e N H 5 1 5 6 Histograms Section This section contains the message sizes histograms It shows the number of messages whose size is zero between 1 and 9 between 10 and 99 between 108 and 10 1 and greater than 10 Histograms of msg sizes size pt2pt coll total 0 0 0 0 1 800 6 806 10 1 2k 6 1 206k 100 1 2k 500 1 7k 1000 1 2k 500 1 7k 104 0 0 0 105 0 0 0 106 0 0 0 107 0 0 0 108 0 0 0 109 0 0 0 Chapter 5 Application Profiling 31 5 1 57 Statistics Section This section displays statistics computed by readpfc These statistics are based on the information contained in the data collection file T
51. 11 InvitoE 0000 1000 AckCnfltWbl 0000 1001 WbMtol 0000 1100 WbMtoE 0000 1101 WbMtoS 0000 1110 AckCnflt 0000 1111 Home Response Rspl 0001 0000 HOM 1 RspS 0001 0001 106 bullx DE User s Guide Message Class Name Message Opcode Class Encoding RspCnflt 0001 0100 RspCnfltOwn 0001 0110 RspFwd 0001 1000 RspFwdl 0001 1001 RspFwdS 0001 1010 RspFwdlWb 0001 1011 RspFwdSWb 0001 1100 RspIWb 0001 1101 RspSWb 0001 1110 DataC FEIMS 1110 0000 DatalNc 1110 0011 DataC FEIS FrcAckCnflt 1110 0001 DataC FEIS 1110 0010 WbiData 1110 0100 reins 1110 0101 WbEData 1110 0110 NonSnpWrData unsupported 1110 0111 WbiDataPtl 1110 1000 WbEDoataPtl 1110 1010 NonSnpWrDataPtl unsupported 1110 1011 Gnt_Cmp 0010 0000 Cnt FrcAckCnflt 0010 0001 Cmp 0010 1000 FrcAckCnflt 0010 1001 Me Dore INDR a 0010 1010 Cmp FwdlnvOwn 0010 1011 Cmp FwdlnvltoE 0010 1100 CmpD 0010 0100 AbortTO unsupported 0010 0101 Appendix A Performance Monitoring with BCS Counters 107 Message Class Name Message Opcode Class Encoding NcWr 1100 0000 WcWr 1100 0001 NcMsgB 1100 1000 PurgeTC TKW 1100 1001 IntLogical NHM 1100 1001 PE WES IntPhysical 1100 1010 IntPrioUpd 1100 1011 NcW rPtl 1100 1100 WcWIPtl 1100 1101 NCP2PB 1100 1110 DebugData 1100 1111 NcRd 0100 000
52. B NCXC to Specific type All types 1 4d 1 111 1 1 1 No Event 1 INDR Non Data Response 1 0 NCS Non Coherent Standard 1 1 0 0 Non Coherent Bypass 1 1 1 0 1085 Data Response 1 1 1 1 1 Specific ID OJAIIID s Traffic Outgoing packet Direction DNID Outgoing DNID Mask 85 LL and OB Event Types Link Layer LL is the interface between QPI IOH XQPI and the Protocol Engines and Routing Layer of the BCS Output Buffers OB store and route messages from the Protocol Engines to the Link Layer LL event types are monitored in the LL units LLCH LLIH and LLXH by setting fields in the PMLLO or PME registers in the selected unit Each unit consists of multiple instances four in LLCH two in LLIH three in LLXH Unlike the PE units the LL unit instance need not have identical settings for their PME registers as each instance is connected to a specific agent OB Event types are monitored in the appropriate LL unit The following event type can be monitored in LL Descriptive information is in addition to the general description at the beginning of this section Interface measure OB to LL traffic Below is a depiction of the 33 bit PMLL PME register It is shown as a 32 bit packets and a 1 bit packet as that is how it is read and written in Configuration Access mode using the ai OO O 86 5 5
53. Error VOFS 1000000 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Error DDLS 0010000 Appendix A Performance Monitoring with BCS Counters 91 92 Twin Lines Monitoring You select the Event of this type that you want to count The definition will fill bits 9 7 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 9 7 in binary BCS PE Twin Lines LDS 001 BCS PE Twin Lines LM 010 BCS PE Twin Lines LHO 011 Where the set of events and their abbreviations are Lookup to Directory SRAM LDS Lookup Miss LM Lookup Hit with one of the Twin Lines in non l State LHO Lookup Hit with both of the Twin Lines non l State LHB For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Twin Lines LM 010 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Twin Lines LHB 100 Directory Active Levels Monitoring You select the Directory Active Levels Threshold 0 31 You select Active Levels Event greater than or equal The Directory Active Levels Monitoring field is Threshold THR The definition will fill bits 16 10 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 16 10 in binary BCS PE Direc
54. IN 3 salloc Granted job allocation 47136 mpirun n 9 pp sndrcv spbl squeue JOBID PARTITION NAMEUSER ST TIME NODES NODELIST REASON 47136 Zeus bashsenglont R 24 47 3 inti 41 43 padb O rmgr slurm x 47136 ThreadId 1 main at pp sndrcv spbl c 52 PMPI Finalize at ompi mpi finalize at bullx DE User s Guide barrier at opal progress at opal event loop at 11 dispatch at 11 at ThreadId 2 lone at tart thread at tl openib async thread at oll at hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finalize at barrier at opal progress at opal event loop at poll dispatch at poll at ThreadId 2 lone at tart thread at tl openib async thread at oll at hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 47 PI Recv at ca pml obl recv at pal progress at tl openib component progress at WE T3 hreadId 2 lone at tart thread at tl openib async thread at oll at hreadId 3 lone at tart thread at ervice thread start at elect at hre
55. L Flit O OB to LL Flit 1 OB to LL Flit 2 OB to LL Flit 3 OB to LL Flit O and 1 OB to LL Flit 2 and 3 OB to LL Flit O 1 2 and 3 OB to LL Flit O 1 2 and 3 and VNO Traffic Only Where the set of exclusive Interface Event Types and their abbreviations are A Packet has been Emitted A Packet has been Emitted with Idle Latency A Flit has been Emitted Lack or Credit on a Flit Waiting to be Emitted For Unit Event Source chosen as LLchO 3 here is an example BCS LL Interface IS2 OLOT IE FLT For Unit Event Source chosen as LLihO 1 here is an example BCS Interface IS2OLOT IE FLT For Unit Event Source chosen as LLxhO 2 here is an example BCS LLXH Interface IS2OLO 1 IE FLT bullx DE User s Guide 10001011 CL23 10123 CLV LHR LHL LSNP LNO1 LRO1 LR23 LRO123 LOBX LOBC OLO OL OL2 OLO1 OL23 OLO123 OLV PKT PIL FLT LOC 10001011 10001011 10001011 RO Event Setup Only internal Interface Traffic is measured For the RO count events the PMR for the chosen counter for this event should have the following settings where Unit Event Source can have one of three values Counter Enable Source local count enable 001 Counter Status Output Source perfcon 000 Count Mode count events OO Counter Event Source unit event Counter and Status Reset Source no reset Compare Mode disabled 00 Unit Event Source ROIC amp
56. LOMH events 118 96 Table A 5 Event Configuration Registers Appendix A Performance Monitoring with BCS Counters 113 A 8 BCS BPMON Usage Examples Total Memory Traffic For All BCSs Using Incoming Traffic This BPMON monitor setup collects all the reads and writes from the requesting nodes using REM events and the local nodes fulfilling the requests using LOM events using Incoming Traffic As the example shows the REM event counts closely match the LOM events counts Read opcodes are counted by using a mask to get the RdCur RdCode RdData from the HOMO Message Class in an event The Write opcode RdlnvOwn is specifically counted in a different event The test used generates reads for one test pass and writes for another test pass The test program generates about 500 000 000 remote memory requests per program instance and four instances are executed Here are the read results from BPMON that also shows the BCS performance events measured H BPMON Single Thread Event Results Event Description Event Count 5 PE REM Incoming Traffic 1978856781 MCM 0xF OC 0 OCM 0xC NID 0x01 NIDM 0x01 BCS PE REM Incoming Traffic 1066138 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x01 NIDM 0x01 BCS_PE_LOM_Incoming_Traffic 1976723675 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0
57. NSQ1000000 H trace T opt hpctk test cases MpiSpinWheels mpirun mca btl tcp self np 1 host bones x ROOT bynode display map bhpcprof mpi ndemoproj bhpcstop ndemoproj The bhpcviewer Repository Perspective of the code passport MpiSpinWheels 20120713 143757 data created by the above example is displayed below EN bhpcviewer MpiSpinWheels View Window Help Ej hpcviewer amp Repository Ei History Repository Explorer 53 779 25 pmhistrep b C bwlproj b C cflproj v demoproj MpiSpinWheels 20120713 143757 bhpcprof mpi amp bones 0 stderr stdout bhpcrun v bones 1 MpiSpinWheels 000001 000 708d7f11 17897 0 hpcrun MpiSpinWheels 000001 000 708d7f11 17897 0 hpctrace MpiSpinWheels 000001 000 708d7f11 17897 0 log application libraries stderr stdout sys_type variables bones 3 bones 5 bones 7 gt sulu 0 sulu 2 gt sulu 4 sulu 6 bhpcstruct v sulu MpiSpinWheels hpcstruct v exec MpiSpinWheels stderr stdout b C gamproj b v Figure 6 3 bhpcviewer Repository page 68 bullx DE User s Guide 6 5 5 HPCToolkit Configuration Files The enhanced HPCToolkit provides configuration files that are used to control the execution of each of the components in the package Each enhanced HPCToolkit component will use a configuration file named xxx conf where xxx is the tool name
58. Outgoing Traffic BCS PE Tracker Traffic BCS PE LOM Tracker Traffic BCS PE REM Tracker Traffic BCS PE Lookup Traffic BCS PE Lookup Traffic BCS PE REM Lookup Traffic BCS PE Short Retry BCS PE LOM Short Retry BCS PE REM Short Retry BCS PE Long Retry BCS PE LOM Long Retry BCS PE REM Long Retry BCS PE Starvation BCS PE LOM Starvation BCS PE REM Starvation BCS PE Buffer Occupation BCS PE LOM Buffer Occupation BCS PE REM Buffer Occupation BCS PE Interface RT East BCS PE LOM Interface East Appendix A Performance Monitoring with BCS Counters 89 90 BCS PE REM Interface RT East BCS PE Interface RT West BCS PE LOM Interface RT West BCS PE REM Interface RT West BCS PE Tx Request BCS PE LOM Tx Request BCS PE REM Tx Request BCS PE Tx Response BCS PE LOM Tx Response BCS PE REM Tx Response BCS NCMH BCS NCMH Buffer Occupation BCS NCMH Tx GPI Alloc BCS NCMH Alloc BCS NCMH Tx GPI Release BCS NCMH Release BCS NCMH Lock Message BCS NCMH Unlock Message BCS NCMH Lock Message Latency BCS NCMH ECC Error BCS NCMH NCCX OB BCS NCMH NCXC OB BCS NCMH Traffic BCS NCMH Traffic BCS LL BCS LL Interface BCS LL Interface BCS LL Interface BCS LL LLXH Interface BCS BCS RO Interface BCS RO
59. Point Exception Analysis In addition Open SpeedShop is designed to be modular and extensible It supports several levels of plug ins which allow users to add their own performance experiments Open SpeedShop Usage Using Open SpeedShop consists in loading a module file which will set the different paths for binaries and libraries and some environment variables required for a proper usage The Open SpeedShop package provides two module files e openspeedshop version bullxmpi This module file is to be loaded to use Open SpeedShop with applications compiled with bullxMPI or any OpenMPI based MPI implementation e openspeedshop version intelmpi This module file is to be loaded to use Open SpeedShop with applications compiled with Intel MPI This integrated version of Open SpeedShop has been configured to use the offline mode of operation which links the performance data collection modules with your application and collects the performance data you specify Chapter 6 Analyzing Application Performance 53 6 3 3 54 More Information See the documentation available from http www openspeedshop org for more details on using Open SpeedShop Convenience commands are provided as a very simple syntax and an easier way to invoke the offline functionality http www openspeedshop org wp wp content uploads 20 13 03 OSSQuickStartGuide20 1 2 pdf Man pages are available for the Open SpeedShop invocation com
60. Readpfc output The main feature of readpfc is to display the information contained in the seven different sections of a profilecomm file These are e Header e Point to point e Collective e table e Histograms e Statistics e Topology Note header histograms statistics and topology sections are not included in the output when the t text text format options are used 5 1 5 2 Header Section Displays information contained into the header of a profilecomm file The more interesting fields are e Elapsed Time indicates the length of the data collection e World size indicates the number of processes e Number of partitions indicates the number of partitions e Partitions limits indicates the list of size limits for the messages partitions only used if there are several partitions The other fields are less interesting for final users but are used internally by readpfc Example Header Version 2 Flags little endian Header size 40 bytes Elapsed time 9303 us World size 4 umber of partitions 3 Partitions limits 1000 1000000 num_intsz 4 bytes 32 bits num_volsz 8 bytes 64 bits 5 1 5 3 Point to Point Communications Section For point to point communication matrices use the following The number of communication messages is displayed first then the volume If either the numericonly or volumiconly options are used then only one matrix is displayed accordingly Point
61. X X X x x xx lt x lt x x KK DX X X X KK x x AX KKK ox x X x x x x x x M OX X X ox ox ox ox x x X OX X X x x x 88 bullx DE User s Guide Software Application Supported BCS Monitoring Events In this section the set of BCS Performance monitoring events is described Each performance event is named the syntax for requesting it is defined and the abbreviations of the many fields that must be used by name and the contents of those fields are defined The message classes and their opcodes are used as defined in Section A 6 In making this description of the supported performance monitoring events some simplifications are made Therefore if a user only uses this syntax to describe events then not all capability in the BCS performance monitoring is available A list of all performance events is presented here in the order defined in this section As defined they collect counts from all the BCSs in the node BCS PE BCS PE Error BCS PE LOM Error BCS PE REM Error BCS PE Twin Lines BCS PE LOM Twin Lines BCS PE REM Twin Lines BCS PE Directory Active Levels BCS PE LOM Directory Active Levels BCS PE REM Directory Active Levels BCS PE Directory Access Event BCS PE LOM Directory Access Event BCS PE REM Directory Access Event BCS PE Incoming Traffic BCS PE LOM Incoming Traffic BCS PE REM Incoming Traffic BCS PE Outgoing Traffic BCS PE LOM Outgoing Traffic BCS PE REM
62. aced in all major units Two events can be decoded per cycle in each block All events are then centralized in the Performance Monitoring Central Counter block PMCC implemented in the Non Coherent Manager Unit NCMH The PMCC consists of four counters Event Detection Each unit has two blocks containing the Performance Monitoring Event register PME which can be independently programmed to detect and forward different events These blocks are named and PMxx1 where xx is the unit identifier whose events are referred to as eventO and event respectively This two block construct allows two similar events in the same unit to be selected and sent to the counter blocks for example a target event such as a directory access with a specific state as one event and a reference event of all directory read accesses as a second event Event Counting All unit event outputs are collected in the central counter block located in the NCMH unit Here the events are selected as inputs to the four counters Each counter is controlled by a Performance Monitoring Resource Control and Status register PMR Events from PMxxO are hardwired to the event selection for counterO of each counter pair events from PMxx1 are hardwired to the event selection for counter of each counter pair This is important to keep in mind if one is trying to combine events from different units into one counter 80 bullx DE User s Guide Event Types This i
63. adId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finalize at barrier at opal progress at opal event loop at poll dispatch at poll at hreadId 2 lone at tart thread at tl openib async thread at GLEN at Wu hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finalize at barrier at opal progress at H Chapter 3 Debugging Application with padb 9 opal event loop at poll dispatch at poll at ThreadId 2 lone at tart thread at tl openib async thread at oll at hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finalize at barrier at opal progress at opal event loop at poll dispatch at poll at ThreadId 2 lone at tart thread at tl openib async thread at oll at hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finaliz
64. ank Average time The average time spent in the functions of the group Percentage of MPI IO time The percentage of time spent the MPI region Percentage of walltime The percentage of walltime Max MPI IO volume rank The maximum volume of data processed in the group and the candidate process Min MPI IO volume rank minimum volume of II MPI IO data processed in the group and the candidate process Total volume Total volume of data processed in MB Average volume Average volume of data processed in MB MPHO bandwidth Volume of data processed in a second in MB s The detailed report produced when the trace level is set to 2 gives a report for with the following information for each MPI IO function Min Time rank minimum time spent executing the MPI IO function and the candidate process rank Max time rank maximum time spent executing the MPI IO function and the candidate process rank average time The average time spent in the MPI IO function region Percentage of the MPI IO time for the MPI IO function walltime Percentage of the walltime for the MPI IO function Chapter 4 Application Analysis with bullxprof 23 24 bullx DE User s Guide Chapter 5 Application Profiling 5 1 Analyser This section describes how to use the MPI Analyser profiling tool 5 1 1 Analyser Overview mpianalyser is a profiling tool developed by Bull for its own MPI implementation This i
65. ank The minimum time spent in the functions of the group and the candidate process rank The average time spent in the functions of the group The percentage of time spent the region The percentage of walltime The maximum number of messages exchanged in the group and the candidate process The minimum number of messages exchanged in the group and the candidate process The total number of messages exchanged in the group The average number of messages exchanged in the group Number of messages exchanged in a second Total volume of data exchanged in MB Average volume of data exchanged in MB Volume of data exchanged in a second in MB s The detailed report produced when the trace level is set to 2 gives a report for with the following information for each function Min Time rank The minimum time spent executing the MPI function and the candidate process rank Max time rank maximum time spent executing the MPI function and the candidate process rank average time The average time spent in the function region Percentage of the MPI time for the MPI function walltime Percentage of the walltime for the MPI function experiment Sequential Program For a sequential program the summary report produced when the trace level is set to 1 gives the following information Total IO time Total IO read time Total IO read volume Total IO read bandwidth Total IO write time Total time spent
66. applications Because it only collects statistical information about functions mpiP generates considerably less overhead and much less data than tracing tools All the information captured by mpiP is task local It only uses communication during report generation typically at the end of the experiment to merge results from all of the tasks into one output file Note In the context of xPMPI the user applications have not to be recompiled At the end of the run mpiP generates a mpiP report file in the current directory default We suggest modifying this default to your favorite directory setting the environment variable MPIP as follows See http mpip sourceforge net mpiP Output for a complete description of the results Should you want to influence the mpiP runtime and customize the generated report more options are available with the environment variable MPIP there http mpip sourceforge net Runtime_Configuration xPMPI Configuration The combination of tools can be managed with a configuration file indicating which tools are activated and their order of execution HEE PEER EE HEHE EEE EERE HE RHEE FE EEE EE HE EEE EE EE EEE HE HEE EE EEE RE XPMPI configuration file PERE REE EE HEHE EE EERE HE EEE EE FH EEE EE HEHE EE EERE HE HERE EE EEE RE module mpiP module ipm The keyword module declares that the tool is activated The tools are chained in their order of declaration A default configuration fil
67. ask MCM Transaction Type OpCode Mask OCM Tracker Output State TOS Tracker Output Response Type Received TOR 94 bullx DE User s Guide The definition will fill bits 64 20 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 64 20 in binary BCS PE Tracker Traffic MC2 DRS MCM OxF OC 0x0 OCM O0x0 TOS dnSnp TOR Rsp Wb RspSWb 10111000001 1 110000000000000000000000000000000 This counts DRS transaction types for all opcodes for the selected Tracker Output state For Unit Event Source chosen as LoMO 3 here is an example BCS_PE_LOM_Tracker_Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 TOS dnSnp TOR Rsp Wb RspSWb 10111000001 1110000000000000000000000000000000 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Tracker Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 TOS dnSnp TOR Rsp Wb RspSWb 10111000001 1110000000000000000000000000000000 Lookup Response Traffic Identification Monitoring There are four cases of Traffic Identification Events For this case Lookup Response 11 is chosen The defaulted fields are Node ID NID 00000 NID NID Mask 00000 NIDM Tracker Output State 00 TOS Tracker Output Response Type Received 0000000000 TOR The filled fields are Direction 11 Transaction Type MsgClass MC Transaction Type OpCode OC Transaction Type MsgClass Mask MCM
68. ation time profiling hwc hardware metrics profiling mpi 1 functions profiling io POSIX I O functions profiling mpiio MPI I O functions profiling force Forces the application profiling when multithreading is detected This version of bullxprof does not support multithreading By default bullxprof will stop running when multithreading OpenMP pthread is detected within the profiled binary h help Displays the help message 4 list Prints the list of functions that can be instrumented 1 ibs lt lib1 so libN so gt List of shared libraries to include in the application profiling bullxprof does not profile share libraries by default The library full path name is needed if the library path name is not in the LD_LIBRARY_PATH environment variable m metrics lt metric1 metricN gt Enables profiling of the selected metrics Applies to the hwc experiment only Possible metric values are flops consumed GFLOPS ibc Instructions by Cycles cmr Cache Miss Rate in clr Cache Line Reuse output lt mode1 modeN gt Determines report production output mode Possible output modes are stdout file and csv stdout causes reports to be dumped on standard error stream file causes reports to be created as files in a directory named bullxprof YYYYMMDD HHMM SLURM JOB ID csv causes reports to be created as CSV files in a directory named bullxprof YYYYMMDD HHMM SLURM JOB ID R region lt region1
69. b experiment 1 mdb lt project gt lt code passport bhpcprof data origin gt perf_db experiment mt lt project gt lt code passport bhpcprof data origin gt perf_db experiment xml The performance database source files lt project gt lt code passport bhpcprof data origin perf db src xxx lt project gt lt code passport bhpcprof data origin gt perf_db src usr xxx Standard Error and Standard Output lt project gt lt code passport bhpcprof data origin stdout lt project gt lt code passport bhpcprof data origin stderr Hotplot Component bhpcprof mpi This component is a wrapper around the HPCToolkit hpcprof mpi component It provides an interface that can be used to add value to a performance database The bhpcprof mpi wrapper performs these actions e Collect the information from the code passport that would normally be used by hpcprof mpi to build a performance database e Optionally call a user provided command script to allow the user to modify the set of data to be passed to hpcprof mpi e Execute the hpcprof mpi component to build a performance database as an XML file intended to be displayed by the GUI viewer e the Passport Library to write the performance database created by hpcprof mpi to the specified code passport in the designated History Repository location The performance database and supporting files lt project gt lt code passport gt bhpcprof mpi lt data
70. begin MPI File write ordered end File get type extent MPI File set atomicity MPI File get atomicity MPI File sync MPI File set errhandler MPI File get errhandler 4 5 Profiling reports This section details the information contained in the different profiling reports 4 5 1 Timing experiment Sequential Program For a sequential program the summary report produced when the trace level is set to 1 gives the following information process walltime The overall execution time of the program time The execution time spent in a region percentage The percentage of walltime spent in a region The detailed report produced when the trace level is set to 2 gives the following information region The region the function belongs to number of calls Number of times the function was called by the program exclusive time Time exclusively spent in the function without inner function calls percentage Percentage of walltime spent in the function Parallel Program In a MPI context the summary report produced when the trace level is set to 1 gives the following information process walltime The execution time of the overall program number of processes Number of MPI processes Comm compute ratio Ratio of time spent communicating on time spent computing And per region ALL USER MPI MPI IO and the following information Min Time rank minimum time spent in the region executing the funct
71. bullx scs 4 RA bullx DE User s Guide O gt E x REFERENCE 86 2 84FK 02 The following copyright notice protects this book under Copyright laws which prohibit such actions as but not limited to copying distributing modifying and making derivative works Copyright O Bull SAS 2014 Printed in France Trademarks and Acknowledgements We acknowledge the rights of the proprietors of the trademarks mentioned in this manual All brand names and software and hardware product names are subject to trademark and or patent protection Quoting of brand and product names is for information purposes only and does not represent trademark misuse Software January 2014 Bull Cedoc 357 avenve Patton BP 20845 49008 Angers Cedex 01 FRANCE The information in this document is subject to change without notice Bull will not be liable for errors contained herein or for incidental or consequential damages in connection with the use of this material Table of Contents Ie MERI Iniended Redders M i RM M e RN v Highlighting Related Publica Ds assise A Fas T B RP vi Chapter 1 bulls Development Environment 1 Chapter 2 bullx DE User 3 2 1
72. d for the following reasons e provide a solid foundation for cross platform performance analysis tools e present a set of standard definitions for performance metrics all platforms e provide a standard API among users vendors and academics PAPI supplies two interfaces e high level interface for simple measurements e low level interface programmable adaptable to specific machines and linking the measurements PAPI should only be used by specialists interested in optimizing scientific programs These specialists can focus on code sequences using PAPI functions PAPI tools are all open source tools High level PAPI Interface The high level API provides the ability to start stop and read the counters for a specified list of events It is particularly well designed for programmers who need simple event measurements using PAPI preset events Compared with the low level API the high level is easier to use and requires less setup additional calls However this ease of use leads to a somewhat higher overhead and the loss of flexibility Note Earlier versions of the high level API are not thread safe This restriction has been removed with PAPI 3 Chapter 6 Analyzing Application Performance A3 Below is a simple code example using the high level API include lt papi h gt define NUM_FLOPS 10000 define NUM_EVENTS 1 main int Events NUM EVENTS PAPI INS lo
73. dding the counters to be approximately twice as large as the first value after reading the counters For more details please refer to PAPI man and documentation which are installed with the product in usr share directory Collecting FLOP Counts on Sandy Bridge Processors Floating Point OPerations FLOP performance events are very machine type sensitive The focus here will be the Sandy Bridge processor Here are some general insights 1 Users think in terms of how many computing operations are done as a count of many numbers are added subtracted compared multiplied or divided 2 Hardware engineers think in terms of how many instructions are done that add subtract compare multiply or divide Three types of operations are provided on these machines 1 Scalar One operand per register 2 Packed in 128 bit Register 4 single precision numbers or 2 double precision numbers 3 Packed in 256 bit Register 8 single precision numbers or 4 double precision numbers The FLOP performance events collected by PAPI are influenced by these three types of operations The performance events count one for each instruction regardless of the number of operations done To compensate for this PAPI has defined several presets that compute the user expected number of FLOPs by collecting several performance events and multiplying each one by the proper constant The PAPI Wiki has a very interesting page that goes into great detail on this topic
74. dicate where optimization benefits can be achieved Attribute costs very precisely HPCToolkit is unique in its ability to associate measurements in the context of dynamic calls loops and inlined code 6 4 1 HPCToolkit Workflow The HPCToolkit design principles led to the development of a general methodology resulting in a workflow that is organized around four different capabilities Measurement of performance metrics during the execution of an application Analysis of application binaries to reveal the program structure Correlation of dynamic performance metrics with the structure of the source code Presentation of performance metrics and associated source code Chapter 6 Analyzing Application Performance 55 compile amp link call stack profile interpret profile correlate w source hpcprof Figure 6 1 HPCToolkit Workflow As shown in the workflow diagram above firstly one compiles and links the application for a production run using full optimization Then the application is launched with the hpcrun measurement tool this uses statistical sampling to produce a performance profile Thirdly hpestruct is invoked this tool analyzes the application binaries to recover information about files functions loops and inlined code Fourthly hpcprof is used to combine performance measurements with information about the program structure to produce a performance database Finally it is possible to examine the performance databa
75. e at barrier at opal progress at opal event loop at 11 dispatch at poll at hreadId 2 lone at tart thread at tl openib async thread at oll at hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finalize at barrier at opal progress at opal event loop at poll dispatch at poll at hreadId 2 lone at tart thread at tl openib async thread at OLIN at WS hreadId 3 lone at tart thread at ervice thread start at elect at hreadId 1 ain at pp sndrcv spbl c 52 PI Finalize at ompi mpi finalize at barrier at opal progress at 10 bullx DE User s Guide tuBudoououdud oon tuB8udoououd ooon H O0 00 do o J J J OY OY Ow OY OY O OY OY OY OY OY OY O1 O1 O1 O1 O1 O1 O1 O1 O1 O1 O1 O1 Oi HAA RRR RB as ads tuBudoouodud oonouon H 8 0pal event loop at 8 poll dispatch at 8 poll at 8 ThreadId 2 8 clone at 8 start thread at 8 btl openib thread at 8 poll at 8 ThreadId 3
76. e BCS performance events measured BPMON Single Thread Event Results Event Description Event Count BCS PE REM Outgoing Traffic 1976316475 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0x00 NIDM 0x00 BCS_PE_REM_Outgoing_Traffic 10 23535 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x00 NIDM 0x00 BCS_PE_LOM_Outgoing_Traffic 1975865466 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0x01 NIDM 0x01 BCS_PE_LOM_Outgoing_Traffic 1021035 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x01 NIDM 0x01 Here are the write results from BPMON that also shows the BCS performance events measured BPMON Single Thread Event Results Event Description Event Count 5 PE REM Outgoing Traffic 9663484 MC HOMO MCM 0xF OC 0 OCM 0xC 0 00 NIDM 0x00 BCS_PE_REM_Outgoing_Traffic 1941802417 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x00 NIDM 0x00 BCS_PE_LOM_Outgoing_Traffic 9217576 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0x01 NIDM 0x01 BCS_PE_LOM_Outgoing_Traffic 1941799879 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x01 NIDM 0x01 Memory Traffic For a Source and a Destination BCS Using Incoming Traffic This BPMON monitor setup collects all the reads and writes from the requesting node on BCSO using REM events and the local node fulfilling the requests on BCS3 using LOM events u
77. e is installed in the following location opt bullxde mpicompanions xPMPI etc xpmpi conf A user defined configuration file can be specified with the PNMPI CONF environment file export PNMPI CONF path to user defined configuration file xPMPI Usage Using xPMPI consists in loading a module file The environment will be set to allow the tool to intercept MPI functions call without changing the application regular launch process Do not forget to unload the module file to disable the use of xPMPI after a profiling session bullx DE User s Guide Chapter 6 Analyzing Application Performance Different tools are available to monitor the performance of your application and to help identify problems and to highlight where performance improvements can be made These include e PAPI an open source tool e Performance Monitor bpmon a Linux command line single node performance monitoring tool which uses the PAPI interface to access the hardware performance events counters of most processors e HPCToolkit an open source tool based on PAPI and included in the bullx supercomputer suite delivery e Bull Enhanced HPCToolkit based on the current HPCToolkit it provides added value for HPC users needing profile based performance analysis in order to optimize their running software applications e Open SpeedShop an open source multi platform Linux performance tool 6 1 PAPI 6 1 1 Performance is use
78. e responses during Tracker phases Snoop Snoopy Nodes Snoop Directory Nodes and Read Memory for cache to cache transfers d Lookup Response Traffic directory status during Read Access of IPT In Process Table or SRAM directories Shared and Exclusive State events can act as indicators of program affinity 5 Transaction Latency measure latency for Read Write or Snoop transactions based on the opcode dependent upon an opcode mask 6 Starvation measure starvation starts duration or number of starved transactions versus a threshold 7 Retry measure initial retries all retries and all transactions that enter the Retry Detection stage select between Short early detection or Long detection at end of pipeline and one some all Retry types 8 Directory Access measure Read and Update accesses to the SRAM directory or to both IPT and SRAM In ReM the directories comprise the ILD in LoM the directories comprise the ELD 9 Directory Levels measurement of level occupation at or greater than a specific threshold Used in association with the timer and multiple runs at different thresholds to make a histogram of occupation 10 Twin Lines measure different types of SRAM Directory Look ups related to entries that contain a Twin Line address defined as a pair of addresses that differ by one specific address bit allows for sharing of the directory entry A depiction of the 119 bit PMPE PME register follows It i
79. e system we are running on e Collect dynamic libraries used by application e Execute the hpcrun component to execute a test case e Collect and store the performance profile data generated from a single invocation of hpcrun on one or more nodes e the Passport Library to write the performance profile to the specified code passport in the designated History Repository location The performance profile lt project gt lt code passport bhpcrun data origin gt lt test case name gt xxx hpcrun lt project gt lt code passport bhpcrun data origin gt lt test case name gt xxx hpctrace lt project gt lt code passport gt bhpcrun lt data origin gt lt test case name gt xxx log The application executable location project code passport bhpcrun data origin gt application The dynamic libraries lt project gt lt code passport bhpcrun data origin gt libraries Standard Error and Standard Output project code passport bhpcrun data origin gt stdout project code passport bhpcrun data origin gt stderr Environment information for the system project code passport bhpcrun data gt 5 type project code passport bhpcrun data origin gt variables The user may also provide optional scripts to perform tasks at specified points during the execution of the bhpcrun script The optional prologue script will be executed by bhpcrun prior to execution of t
80. ed based upon the unit type PE LL 2 compare mode or no comparison select maximum compare compare then update or no comparison mode 3 reset source for counter and status select partner s compare or overflow status partner s event or nothing as the reset 4 source of counter events select PME event partner s status or clock 5 count mode count events or clocks after event 6 destination of counter status output select PERFCON or partner 7 counter enable source local by PERFCON or timer partner s status or disabled 8 reset counter and clear status bits Using Bull s tools the user has no capability to use the Interval Timer Or Compare mechanisms A depiction of the 32 bit PMR register follows Field description details can be found in the PMR Configuration Register Description Appendix A Performance Monitoring with BCS Counters 87 Counter Status Output Counter Enable Counter and Counter Comp Status Reset Event Source Unit Event Source 0 0 0 Status reported in PERFCON Status reported to Counter Partner Overflow status 1 status 0 0 Count events O 1 Count clocks after event 0 Unit PME register event 0 0 1 Counter Partner Status O 1 O Clock 0 Reset Source Counter Partner s status Counter Partner s incoming event o X Xx lt x x x KK KK x DX
81. ed openSSH pdsh Perl and gdb 3 2 Features The stack trace generation operation mode is supported 3 3 padb with SLURM bullx MPI Bull has developed specific features to support the combination of SLURM and environments Specifically applications compiled with libraries should be launched using mpirun command OpenMPI launch command within a resource managed by SLURM using the salloc command Some examples of job launching command combinations are shown below Example 1 salloc w hostl host2 mpirun n 16 ompi appli Example 2 salloc w hostl host2 salloc Granted job allocation XXXX mpirun n 16 ompi appli Chapter 3 Debugging Application with padb 7 3 4 8 3 salloc w host1 host2 salloc Granted job allocation XXXX srun n 1 mpirun n 16 ompi_appli Example 4 salloc IN 3 salloc Granted job allocation XXXX 5 srun n 1 mpirun n 16 ompi_appli Using padb Synopsis 0 rmgr slurm x t a jobid Get processes stacks tree based output for stack traces All jobs for this user jobid Job Id obtained by the slurm squeue command An environment variable can be set for the Resource Manager for example export PADB_RMGR slurm then padb command synopsis becomes simpler as shown padb x t a jobid Examples A short example is shown below 5 salloc p Zeus
82. ed on all possible target nodes where the test will be executed The bhpcstruct wrapper performs these actions e Collect special metrics from the program structure to create the program summary e Execute the hpcstruct component to create the program structure e Collect information to create the program environment e gt the Passport Library to write program structure metrics and environment information to the specified code passport in the designated History Repository location The scope tree produced by hpcstruct project code passport bhpcstruct system test case name gt hpcstruct The executable of the test case lt project gt lt code passport bhpcstruct system exec test case name Standard Error and Standard Output project code passport bhpcstruct system stdout lt project gt lt code passport bhpcstruct system stderr bullx DE User s Guide 6 5 3 5 6 5 3 6 Parallel Manager Component bhpcrun This component is a wrapper around the HPCToolkit hpcrun component For applications bhpcrun must be installed on all possible target nodes where the test will be executed bhpcrun must preserve the node name process name and MPI rank for MPI processes used during sample collection to allow tying of abnormal samples back to the node and or process on which they occurred The bhpcrun wrapper performs these actions e Collect environment information for th
83. el manager component uses a configuration file named bhpcrun conf A hypothetical configuration file for this component could look something like this User login level configuration for bhpcrun name democonf events PAPI 1000000 PAPI TOT INS1000000 hpcargs v 2 testcase opt hpctk test cases MpiSpinWheels testargs maxruntime 01 00 00 6 5 5 3 HOTPLOT Configuration File bhpcprof conf The hotplot application uses a configuration file named bhpcprof conf A hypothetical configuration file for this component looks something like this User login level configuration for bhpcprof name democonf include home hpctk pmhistrep lt project gt lt cpp gt bhpcprof lt data origin perf db hpcargs v 2 70 bullx DE User s Guide Chapter 7 1 Profiling This chapter describes I O profiling tools 7 1 lotop lotop is a lightweight top like tool that shows the I O activity on disk of running processes Figure 7 1 PRIO USER 12068 be 4 g 12071 be 4 g 12075 be 4 0 12076 be 4 g 12669 be 4 root 1170 be 3 root 2531 be 4 syslog 1 be 4 root 2 be 4 root 3 be 4 root 4 rt 4 root 5 be 4 root 6 be 4 root 9 be 4 root 12 be 4 root 13 be 4 root 11790 be 4 g 2575 be 4 messageb 11792 be 4 g 3672 be 4 g 1042 be 4 root 3587 be 4 g DISK K s M s M s K s B s B s B s B s B s B s B s B s B s B s B s B s B s B s B s B s B s B
84. er of cycles from a Start Event to a Stop Event As a single measurement isn t useful the average latency is measured by counting the latencies of all target transactions and dividing that by the number of target transactions The counter definition above is the definition of target transactions for this example A pair of counters is required to accumulate the total latency time PAIRO CNTO is set up to create a signal that lasts for the duration of the transaction The start event of the transaction for example Lock sent to NCMH is the Event Source the stop event Unlock sent to is programmed as the Event Source input to the Partner counter and is used by CNTO as the reset source The compare register for this counter is initialized with one and the compare output is sent to the partner as the Status Output Set up the NCMH event registers for a Lock Latency transaction Event is the Lock Event is the Unlock The monitoring event is requested by BCS NCMH Lock Message Latency Pair PMEO Bits 29 28 in binary BCS NCMH Lock Message 01 Lock Latency Event Lock message sent 01 Pair PMET Bits 29 28 in binary BCS NCMH Unlock Message 10 Set up PMCC for the Interval Timer or Local Count Enable method of running the monitor Collect the results by reading the counter PMD registers Note that PAIRO_CNTO is not read as it is not interesting The PAIRO CNTO PMR for this event should have the fol
85. ers information that help targeting the program s potential hotspots 4 1 Environment It is highly recommended to use the module file provided to have the environment set correctly before using the tools see Section 2 4 bullx DE Module Files 1 Load the bullxprof module file module load bullxprof version 2 Load the MPI bullx MPI OpenMPI or Intel MPI environment when profiling an MPI parallel program The PAPI environment needed for hwe profiling is automatically loaded by the bullxprof module file When profiling Intel compiled application the Intel environment must be loaded before the MPI environment 4 2 Usage The bullxprof command line is launched as follows bullxprof bullxprof options program lt prog args gt P prog In a parallel context you can use bullxprof along with mpirun or srun as follows mpirun bullxprof bullxprof args mpirun mpirun args program program args gt srun bullxprof bullxprof args srun srun args program program args gt Chapter 4 Application Analysis with bullxprof 13 4 3 Command Line Options bullxprof can be configured at run time with the following command line switches d debug lt debuglevel gt Sets the tool s verbosity level off 1 low 2 medium and 3 high e experiments expl exp2 expN gt Determines which profiling experiments will be run Possible experiments are timing applic
86. es 0 The response to the Type field message has been received DataC for Read for Write and last Snp for Snoop bullx DE User s Guide Non Coherent Manager Unit manages non coherent transactions through and interfaces NCMH event types are monitored by setting fields in the PMNCO or PME registers The following event types can be monitored in NCMH Descriptive information is in addition to the general description at the beginning of this section 1 Interface measure non coherent traffic from NC to QPI or 2 3 4 Buffer Occupation measure occupation of or Tracker buffers Error measure ECC errors in NC register files Traffic Identification two choices for traffic direction QPI to and to QPI Traffic identification can be made using the outgoing mask enabled updated and the incoming mask enabled RHNID in addition to Transaction Type 5 tracker 6 Transaction Latency measure latency of selected transactions from the Lock Latency measure latency of Lock transactions A depiction of the 74 bit PMNC PME register follows It is shown in 32 bit packets as that is how it is read and written in Configuration Access mode using the BCS CSR Field description details can be found in the PMNC Event Configuration Register Description Transaction
87. executing IO functions Total time spent executing IO read like functions Total volume of data read MB Total volume of data read in a second MB s Total time spent executing IO writelike and close functions Chapter 4 Application Analysis with bullxprof 21 22 Total IO write volume Total volume of data written MB Total IO write bandwidth Total volume of data written in a second MB s The detailed report produced when the trace level is set to 2 gives for each POSIX IO function the following information Calls Total number of call for the IO function Executive time Time spent executing the IO function Percentage Percentage of walltime Parallel Program In a context the summary report produced when the trace level is set to 1 gives information about three 3 groups of POSIX IO functions Read read like functions e g read readv etc Write write like e g write pwrite etc and close functions Total All POSIX IO functions For each group of POSIX IO functions the summary report gives the following information Max IO time rank The maximum time spent in the functions of the group and the candidate process rank Min IO time rank The minimum time spent in the functions of the group and the candidate process rank Average time average time spent in the functions of the group Percentage of IO time The percentage of time spent the MPI region Percentage of walltime The percentage of walltime Ma
88. f large systems Darshan Usage Using Darshan consists in loading a module file which will set the different paths for binaries and libraries Also the user will be reminded to set the DARSHAN LOGPATH variable to the directory the Darshan s log files should be located Darshan instruments applications via either compile time wrappers for static executables or dynamic library preloading for dynamic executables The Darshan package provides several module files described below e following module files are to be loaded to use Darshan with applications compiled with bullx or any OpenMPI based MPI implementation and using GNU compilers darshan lt version gt _bullxmpi_gnu_noinst It is intended to be used with dynamically linked binary and prepend the Darshan library to the LD_PRELOAD environment variable No recompilation is needed for the user application darshan lt version gt _bullxmpi_gnu_inst It is for use with static executables and needs the application to be recompiled with provided Darshan wrappers e following module files are to be loaded to use Darshan with applications compiled with bullx MPI or any OpenMP based MPI implementation and using Intel compilers darshan version bullxmpi intel noinst It is intended to be used with dynamically linked binary and prepend the Darshan library to PRELOAD environment variable No recompilation is needed for the user application The Intel comp
89. flit waiting to be emitted RO to flow0 2 flits RO to OB flow1 2 flits RO to OB flow0 and flow1 4 flits to ROCI flowO and flow1 unused in bullx DE User s Guide 4 Event Counts and Counter Threshold Comparisons There are four Performance Monitor Counters comprised of a counter and a data storage register the Performance Monitoring Data register PMD Counting is enabled by selecting a Counter Enable source either a Local Enable Interval Timer or the counter s partner It is important to note that Local Enable and Interval Timer are controlled by the global registers PERFCON and PTCTL and are mutually exclusive meaning that all counters making this selection will receive the same enable source For example one cannot choose Local Enable for one counter and Interval Timer for another Each PMD can be compared with its own Performance Monitoring Compare register PMC There are two comparison modes maximum compare and compare then update In maximum compare mode the PMC is loaded with an initial value and a notification occurs when the PMD reaches this value In the compare then update mode the PMC is loaded each time the PMD exceeds the PMC value Each PM Counter is controlled by a Performance Monitoring Resource Control and Status register PMR The fields to carry out the actions described above are listed below 1 unit selection for events or no event select the units whose events are to be monitor
90. for experienced application programmers and tool developers who need fine grained measurements and control of the PAPI interface Unlike the high level interface it allows both PAPI preset and native event measurements The low level API features the possibility of getting information about the executable and the hardware and to set options for multiplexing and overflow handling Compared with high level API the low level API increases efficiency and functionality An Event Set is a user defined group of hardware events preset or native which all together provide meaningful information The users specify the events to be added to the Event Set and attributes such as the counting domain user or kernel whether or not the events are to be multiplexed and whether the Event Set is to be used for overflow or profiling PAPI manages other Event Set settings such as the low level hardware registers to use the most recently read counter values and the Event Set state running not running Following is a simple code example using the low level API It applies the same technique as the high level example include lt papi h gt include stdio h define NUM FLOPS 10000 main int retval EventSet PAPI NULL long long values 1 Initialize the PAPI library retval PAPI library init PAPI VER CURRENT if retval PAPI VER CURRENT fprintf stderr PAPI library init error Nn exit 1
91. g tool which uses the PAPI interface to access the hardware performance events counters of most processors It is possible to monitor a single thread or the entire system with bpmon The set of events that can be measured depends on the underlying processor In general bpmon gives access to all processor specific performance events bpmon can monitor the performance of the application the node s Command execution performance can be monitored by bpmon For example the command below gives the following output bpmon Syntax bpmon INSTRUCTIONS RETIRED LLC MISSES MEM LOAD RETIRED L3 MISS MEM UNCORE RE TIRED LOCAL DRAM MEM UNCORE RETIRED REMOTE DRAM opt hpctk test cases llclat S 4 i 256 200 o r Run a single copy of the test on the current thread Started Timing Reads Command is Reads with Range 200 MB and Stride 256 B with Average Time 63 533 ns Elapsed Time of Run of Current Thread is 37 880739086 4 BPMON Single Thread Event Results 4 Event Description Event Count INSTRUCTIONS RETIRED 10807933019 LLC MISSES 537361852 MEM LOAD RETIRED L3 MISS 536834525 MEM UNCORE RETIRED LOCAL DRAM 536834304 MEM UNCORE RETIRED REMOTE DRAM 67 Elapsed time 37 893312 seconds bullx DE User s Guide 6 2 1 bpmon Reporting Mode For all or a subset of node processors bpmon pr
92. he hpcrun script and the optional epilogue script will be executed as the last step in the bhpcrun script The optional data script will be executed just after the hpcrun script has completed but prior to the move of the profile data into the History Repository allowing the user to manipulate the profile data prior to its insertion In addition a maximum run time value can be provided to limit the execution time of the bhpcrun test run Hotplot Component bhpcprof This component is a wrapper around the HPCToolkit hpcprof component It provides an interface that can be used to add value to a performance database The bhpcprof wrapper performs these actions Chapter 6 Analyzing Application Performance 65 6 5 3 7 66 e Collect the information from the code passport that would normally be used by hpcprof to build a performance database e Optionally call a user provided command script to allow the user to modify the set of data to be passed to hpcprof e Execute the hpcprof component to build a performance database as an XML file intended to be displayed by the GUI viewer e the Passport Library to write the performance database created by hpcprof to the specified code passport in the designated History Repository location performance database and supporting files lt project gt lt code passport bhpcprof data origin perf db callpath xml lt project gt lt code passport bhpcprof data origin gt perf_d
93. his section is divided into two or three sub sections e General statistics section contains statistics for the whole application e Per process average section contains averages per process e Messages sizes partitions section displays the distribution of messages among the partitions This section is only present if there are several partitions e For each statistic we distinguish point to point communications from collective communications Example General statistics Total time 0 009303s 0 00 00 009303 pt2pt colt total Messages count 4400 1012 5412 Volume 3 2752MB 2 10822MB 5 38342MB Avg message size 744B 2 08322kB 995B Std deviation 1216 4 1989 1 1488 4 Variation coef 1 6341 0 95481 1 4963 Frequency msg s 472 966k 108 782k 581 748k Throughput B s 352 06MB s 226 62MB s 578 68MB s Per process averag pt2pt coll total Messages count 1100 253 1353 Volume 818 8kB 527 054kB 1 34585MB Frequency msg s 118 241k 27 1955k 145 437k Throughput B s 88 015MB s 56 654MB s 144 67MB s Messages sizes partitions pt2pt count coll count total count 0 lt sz 1000 3 2 03 73 5 1 02 51 3 7 03 69 1000 lt sz lt 1000000 1 2 03 27 5 02 49 1 7 03 31 1000000 lt 2 0 5 0 0 5 The message sizes partitions should be examined first Where Total time Total execution time between MPI Init and MPI Finalize Messages count Number of
94. ile path project string identifies the user or group running test provided by user when bhpcstart is run code passport simple passport yyyymmdd hhmmss timestamp added when passport created simple passport string identifies application and or test being run provided by user when bhpcstart is run test tool string name of tool that generated the test results bhpcstruct bhpcrun bhpcprof bhpcprof mpi data origin lt system gt lt rank gt system string system generating test results ranks string mpi rank of process generating test results not present if not an mpi job file path string file or directory pathname relative to data origin often just a simple file name lt pathname gt string path to a file outside of the repository that may be absolute or relative History Repository Environment Variables The Bull HPCToolkit extension uses an environment variable to define the location of the History Repository The environment variable BHPCTK ROOT must set to the path name of the repository root In this release it is a requirement that the repository root path be locally accessible from all nodes used in the test run The environment variable allows multiple repositories on the same system it also allows multiple users to share the same repository REPO is used by the passport library to locate the History Repository when applica
95. ilers environment followed by the bullxmpi environment must be loaded before loading this module file Please use the compilervars sh script provided by Intel to load the Intel compilers environment darshan version bullxmpi intel inst It is for use with static executables and needs the application to be recompiled with provided Darshan wrappers bullx DE User s Guide 7 2 2 7 2 3 e following module files are to be loaded to use Darshan with applications compiled with Intel MPI darshan lt version gt _intelmpi_noinst It is intended to be used with dynamically linked binary and prepend the Darshan library to the LD_PRELOAD environment variable No recompilation is needed for the user application The Intel compilers environment followed by the Intel MPI environment must be loaded before loading this module file Please use the compilervars c sh script provided by Intel to load the Intel compilers environment and mpivars c sh to load the Intel MPI environment darshan version intelmpi inst It is for use with static executables and needs the application to be recompiled with provided Darshan wrappers Darshan log files Before using Darshan the location of the tool generated traces has to be set This can be done by setting the DARSHAN_LOGPATH environment variable to an existing location export DARSHAN LOGPATH path to logs Compiling with Darshan To allow trace generation with Darshan the
96. ion 341 6 opt modules 3 1 6 modulefiles dot module info null module cvs modules use own opt modules modulefiles oscar modules 1 0 3 default Modules available the user are listed under the line opt modules modulefiles The command to load a module is module load module name The command to verify the loaded modules list is module list Using the avail command it is possible that some modules will be marked default module avail These modules are those that have been loaded without the user specifying a module version number For example the following commands are the same module load configuration module load configuration 2 The module unload command unloads a module The module purge command clears all the modules from the environment module purge It is not possible to load modules that include different versions of intel cc or intel fc at the same time because they cause conflicts bullx DE User s Guide 2 4 bullx DE Module Files bullx Development Environment provides module files for all the embedded tools that help to configure the user s environment see Sections 2 2 and 2 3 The following command loads the bullx DE main module module load bullxde Loading this module will make available the tools module these can be listed by using the module avail command as shown in the example below Example module avail
97. ion and the candidate process rank Max time rank maximum time spent in the region executing the function and the candidate process rank average time The average time spent the function percentage Percentage of walltime spent in the region Chapter 4 Application Analysis with bullxprof 19 4 5 2 4 5 3 20 The detailed report produced when the trace level is set to 2 gives region report with the following information for each function Min Time rank minimum time spent in the region executing the function and the candidate process rank Max time rank maximum time spent in the region executing the function and the candidate process rank average time The average time spent the function region Percentage of the time spent in the region for the function walltime Percentage of the walltime for the function HWC experiment hwc experiment computes hardware metrics using one or multiple PAPI hardware counters The metric computation is limited to the underlying PAPI counters availability A selected metric might not be displayed when the PAPI hardware counters needed for its computation are not available In that case a message is logged into the bxprof err file created in the bullxprof launch directory Sequential Program For a sequential program the summary report produced when the trace level is set to 1 gives the global value of user selected HW metrics The detailed report produced
98. l flow graph and then uses interval analysis to identify loop nests within the control flow It combines this information with compiler generated line map information in a way that allows HPCToolkit to correlate the samples associated with machine instructions to the program s procedures and loops This correlation is possible even in the presence of optimizations such as inlining and loop transformations such as fusion and compiler generated loops from scalarization of Fortran 90 array operations or array copies induced by Fortran 90 s calling conventions hpcprof hpcprof correlates the raw profiling measurements from hpcrun with the source code abstractions produced by hpcstruct hpcprof generates high level metrics in the form of a performance database called the Experiment database which uses the Experiment XML format for use with hpcviewer hpcprof flat is the flat view version of hpcprof and correlates measurements from hpcrun flat with the program structure produced by hpcstruct hpcproftt correlates flat profile metrics with either source code structure or object code and generates textual output suitable for a terminal hpcproftt also generates textual dumps of profile files hpcprof mpi correlates the call path profiling metrics in parallel produced by hpcrun with the source code structure created by hpcstruct It produces an Experiment database for use with the hpcviewer or hpctraceviewer tool hpcprof mpi is especially designed f
99. lic Real Address CSR Attribute Function Description Name 1 8 310 pts PAIRO_CNT1_PMD ae FDnC_503C E 140F Current value PairO Counter current count high order 44 32 bits CNTO 0000 FDnC 504C 3 1413 Current value Pair CounterO current count low order 31 0 bits PAIRT 0000 FDnC 5050 3 1414 Current value Pair CounterO current count high order 44 32 bits PAIRT 1 0000 506043 1418 Current value Pairl Counter current count low order 31 0 bits PAIRT CNTI 0000 FDnC 5064 3 1419 Current value Pair Counter current count high order bits AE E E PMTIM 31 0 0000 501031404 Ro O 44 32 _ 0000 5014 3 1405 Table A 4 Performance Monitor Configuration Registers Event Configuration Registers Register Symbolic Real Address CSR Address Name essen man 2 i 0 1 2 3 k 0 4 8 C v lLCH PMLLI 0000 FDni 2000 k800 LLCH events Event bits 31 0 u_LLCH PMLLI 0000 FDni 2004 0 k801 LLCH events Event bit 32 for Inst O 1 i 4 5 0 4 Heim Mano er iens eer pu u HPM 0000_FDni_2004 FDni_2004 1 RW LH events events Event bit 32 mum umi i 8 9 A k 0 4 8 o 0000 0000 s 0004 2K01 RW i enis ve 22 112 bull
100. lowing settings Counter Enable Source local count enable timer 001 Appendix A Performance Monitoring with BCS Counters 101 102 Counter Status Output Source partner 001 Count Mode count events 00 Counter Event Source unit event Counter and Status Reset Source partner s incoming event 010 Compare Mode max compare 01 Unit Event Source ncmh 000000001 Unit Type Source PE 00 The PAIRO PMR for this event should have the following settings Counter Enable Source local count enable timer 001 Counter Status Output Source perfcon 000 Count Mode count events 00 Counter Event Source partner status 001 Counter and Status Reset Source no reset Compare Mode disabled 00 Unit Event Source same as CNTO Unit Type Source PE 00 The PAIRO CNTO PMC for this event should have the Compare value set to 1 PAIRO CNTI is setup to count cycles for the duration of the transaction the sum of the latencies of all target transactions The partner status the comparison of the PAIRO CNTO PMD with the value 1 is the Event Source Note that the Unit Event Source is set up for one of the PE units but it is not being used as the Counter Event Source for this counter it is being used by the partner as a reset source remember the hard link between eventO counterO and event counter1 ECC Error Monitoring You select the ECC errors you want
101. mand openss and every convenience script Extensive information about how to use the Open SpeedShop experiments and how to view the performance information in informative ways is provided here http www openspeedshop org wp wp content uploads 2013 04 OpenSpeedShop 202 User Manual v13 pdf bullx DE User s Guide 6 4 HPCToolkit HPCToolkit provides a set of profiling tools to help improve the performance of the system These tools perform profiling operations on executables and display information in a user friendly way An important advantage of HPCToolkit over other profiling tools is that it does not require the use of compile time profiling options or re linking of the executable Note In this chapter the term executable refers to a Linux program file in ELF Executable and Linking Format format HPCToolkit is designed to Work at binary level to ensure language independence This enables HPCToolkit to support the measurement and analysis of multi lingual codes using external binary only libraries Profile instead of adding code instrumentation Sample based profiling is less intrusive than code instrumentation and uses a modest data volume Collect and correlate multiple performance metrics Typically performance problems cannot be diagnosed using only one type of event Compute derived metrics to help analysis Derived metrics such as the bandwidth used for the memory often provide insights that will in
102. mation For a full workflow example and more about the application performance analysis see http www vi hps org upload material tw11 Scalasca pdf For more information on Scalasca concepts and projects see http www scalasca org bullx DE User s Guide 5 3 5 3 1 xPMPI xPMPI is a framework allowing the use of multiple tools PMPI is the MPI profiling layer defined by the MPI standard to allow the interception of MPI function calls By definition only one tool can intercept a function and forward the call to the real implementation library xPMPI is a framework that acts as a PMPI multiplexer by intercepting the MPI function calls and forwards the call to a chain of patched PMPI tools Supported tools xPMPI allows the combination of the following PMPI tools IPM IPM is a portable profiling tool for parallel codes It provides a low overhead performance profile of the performance aspects and resource utilization in a parallel program Communication computation and IO are the primary focus At the end of a run IPM dumps a textbased report where aggregate wallclock time memory usage and flops are reported along with the percentage of wallclock time spent in MPI calls as shown in the following example IPMv0 9834R HEE IE EAE TE HHH HHH HHH EH HE AE FE HE FE FE AE FE AE FE EE command TF completed host dakar1 x86 64 Linux mpi tasks 4 on 1 nodes start 09 14 12
103. mote memory and two for Local memory 1 BCS_PE_REM_Incoming_Traffic MC HOMO MCM 0xF OC 0 OCM 0xC NID 1 NI DM 0x01 counts the number of CPU reads that are satisfied from a Remote node 2 BCS PE REM Incoming Traffic MC 2HOMO MCM OxF OC RdlnvOwn OCM OxF NI D 1 NIDM 0x01 counts the number of CPU writes that are satisfied from a Remote node 3 BCS PE LOM Incoming Traffice MC HOMO MCM OxF OC20 OCM OxC NIDzO NI 0 1 8 counts the number of CPU reads that are satisfied from the Local node 4 BCS PE IOM Incoming 0 D 0 NIDM 0x18 counts the number of CPU writes that are satisfied from the Local node Command example bpmon e BCS PE REM Incoming Traffic MC HOMO 0 0 OCM 0xC NID 1 NIDM 0x01 5 PE REM Incoming Traffic MC HOMO 0 OC RdInvOwn OCM 0xF NID 1 NIDM 0x01 BCS PE Incoming Traffic MC HOMO MCM 0xF O0C 20 0CM 70xC NID 0 NIDM 0x18 BCS PE Incoming Traffic MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0 NIDM 0x18 llclat r 200 1 1 S r 200 4 1 o r S is the command being measured This test generates 128M L3 Cache Read Misses Only this workload must run on the system under test so that the measurement results can be related to the workload as BCS events cannot be limited to a specific process in the way that the CPU events can
104. n BUF WB 0 00000000000001 1 Appendix A Performance Monitoring with BCS Counters 97 98 Where set of exclusive Buffer Names and their abbreviations are Write Buffer WB DCT DCT LOT LOT West TID Pool O WTO West TID Pool 1 WTI West TID Pool 2 WT2 West TID Pool 3 WT3 Sum of West TID Pools WTA East TID Pool ETP East NDR Virtual FIFO ENDR East SNP Virtual FIFO ESNP West HOM Virtual FIFO WHOM West SNP Virtual FIFO WSNP WSB WSB For Unit Event Source chosen as LoMO 3 here is an example BCS_PE_LOM_Buffer_Occupation BUF WTO THR gt 7 001100000011101 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Buffer Occupation BUF2WTO THR 7 001100000011101 Interface Monitoring You select the direction of packet flit flow Then you can count the number of flits emitted The definition will fill bits 106 105 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 for example Bits 106 105 in binary BCS PE Interface RT East 01 BCS PE Interface RT West 10 BCS PE Interface RT East counts the number of flits that has been emitted RT East to OB BCS PE Interface RT West counts the number of flits that has been emitted RT West to OB For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Interface RT East 01 BCS PE LOM Interface RT West 10 For Unit Event Sou
105. n areas A virtualization layer has been added so it becomes possible to split a machine in terms of CPUs The main motivation of this patch is to give the Linux kernel full administration capabilities concerning CPUs CPUSETs are rigidly defined and a process running inside this predefined area will not be able to run on other parts of the system This is useful for e Creating sets of CPUs on a system and binding applications to them e Providing a way of creating sets of CPUs inside a set of CPUs so that a system administrator can partition a system among users and users can further partition their partition among their applications Typical Usage of CPUSETS e CPU bound applications Many applications as it is often the case for cluster apps used to have a one process on one processor policy using sched_setaffinity to define this but what if we have to run several such apps at the same time One can do this by creating a CPUSET for each app e Critical applications processors inside strict areas may not be used by other areas Thus a critical application may be run inside an area with the knowledge that other processes will not use its CPUs This means that other applications will not be able to lower its reactivity This can be done by creating a CPUSET for the critical application and another for all the other tasks Bull CPUSETS CPUSETS are integrated in the standard Linux kernel However the Bull kernel includes the f
106. nSnp Undef Rspl RspS RspCnilt RsplWb RspSWb RspFwdl RspFwdS RspFwdlWb RspFwdSWb homeReq Undef DataC DataC S F DataC_E M xx x aK X X x x xx OX X aK X Xx ox x x x 0X X X ox x ox x X X Ax ox x x OX 0X X X Xx x X x 0 0 snSnp snoop Snoopy nodes 0 1 dnSnp snoop Directory nodes 1 0 access memory Traffic Identification Transaction Type msgclass opcode required for all Traffic ID Events Incoming Outgoing Lookup Response Transaction Type Mask Node ID msgclass opcode DNID outgoing required for all Traffic ID Events RHNID incoming Lookup Directory Status NID Mask m lt 0000 laanooaaooa Incoming Outgoing Tracker Output No Event Exclusive State Shared State and 3 sharers Shared State and 2 sharers Shared State and 1 sharers Invalid State x x x x olol e Specific NID All NID s CA for default configuration for default configuration Ubox for default configuration Xx uw 36x04 X X x X X X Specific type 0 types L1 1 1 1 Home Request 1HOM Home Response amp Writes commands only INDR Non Data Response ISNP Snoop INCS Non Coherent Standard INCB Non Coherent Bypass i IDRS Data Response ISPC Special Control Lookup Response
107. ng long values NUM EVENTS Start counting events if PAPI start counters Events NUM EVENTS PAPI OK handle error 1 Defined in tests do loops c in the PAPI source distribution do flops NUM FLOPS Read the counters if PAPI read counters values NUM EVENTS OK handle error 1 printf After reading the counters 11 1 0 flops NUM FLOPS Add the counters if PAPI accum counters values NUM EVENTS PAPI OK handle error 1 printf After adding the counters lld n values 0 double a b c b 10000 times flops NUM FLOPS Stop counting events if PAPI stop counters values NUM EVENTS PAPI OK handle error 1 printf After stopping the counters lld n values 0 After reading the counters 441027 After adding the counters 891959 After stopping the counters 443994 Note that the second value after adding the counters is approximately twice as large as the first value after reading the counters This is because PAPI read counters resets and leaves the counters running then PAPI accum counters adds the current counter value into the values array 44 bullx DE User s Guide 6 1 2 Low level Interface The low level manages hardware events in user defined groups called Event Sets It is particularly well designed
108. nt types are monitored in the PE units ReM and LoM by setting fields in the PMPEO or PMPE1 PME registers in the selected unit Each unit consists of four instances which must have identical settings for their PME registers For example if you have chosen to monitor an event using the PME register in ReM all four PMPEO PME registers in ReM must have the same value In the cases where only one instance event is to be used such as measuring average latency the event registers should still be set up the same for all instances with the counter control registers selecting only one instance The following event types can be monitored in the PE Descriptive information is in addition to the general description above 1 Interface measure traffic from a PE block to OB Can choose either West Caching Agent or side or East Home Agent or side 2 Buffer Occupation the size of the buffer is in parentheses 3 Errors measure directory Tracker and Virtual Output FIFO ECC errors Appendix A Performance Monitoring with BCS Counters 81 82 4 Traffic Identification four choices for traffic direction a Incoming Traffic incoming traffic can be identified by mask enabled Request or Home Node ID RHNID in addition to Transaction Type b Outgoing Traffic outgoing traffic can be identified by mask enabled Destination Node ID in addition to Transaction Type c Tracker Output Traffic measur
109. ofilecomm allows the numeric matrices to be split according to the size of the messages This feature is activated by setting the PARTITIONS environment variable By default there is only one partition i e the numeric matrices are not split The PFC PARTITIONS environment variable must be of the form partitions limits in which partitions represents the number of partitions and limits is a comma separated list of sorted numbers representing the size limits in bytes If limits is not set profilecomm uses the builtin default limits for the requested partition number Example 1 3 partitions using the default limits 1000 1000000 export PFC PARTITIONS 3 Example 2 3 partitions using user defined limits in this case the partition number can be safely omitted export PFC PARTITIONS 3 500 1000 5 export PFC PARTITIONS 500 1000 Note 4 profilecomm supports a maximum of 10 partitions only 5 1 5 profilecomm Data Analysis To analyze data collected with profilecomm the readpfc command and other tools including spreadsheets can be used The main features of readpfc are the following e Displaying the data contained in profilecomm files e Exporting communication matrices in standard file formats 5 1 5 1 readpfc syntax readpfc options file If ile is not specified readpfc reads the default file mpiprofile pfc in the current directory 28 bullx DE User s Guide
110. ol and principally show the processes that are part of each group e Updates to take advantage of information in the performance database Syntax To run the enhanced Bull HPCToolkit viewer application use the bhpcviewer command bhpcviewer See The bhpcviewer application Help menu and then Bull Extensions Manual for more information about the bhpcviewer 62 bullx DE User s Guide 6 5 3 6 5 3 1 vU pmhistrep amp bwiproj b amp cfiproj amp gamproj b c IMB EXT P200 E2 G4 20120516 194322 IMB EXT P35 E2 63 20120516 193734 IMB 10 P200 E2 G4 20120516 192002 gt D IMB IO P35 E2 63 20120516 191423 gt gt IMB MPI1 P200 E2 G4 20120516 190004 b gt IMB MPI1 P35 E2 63 20120516 185413 b MpiSpinWheels P150 E1 64 20120501 013227 D MpiSpinWheels P200 E2 64 20120516 182002 amp bhpcprof mpi T amp berini9 0 b perf db D stderr stdout b bhpcrun wv i bhpestruct v berin73 MpiSpinWheels hpcstruct v exec MpiSpinWheels stderr stdout b MpiSpinWheels P220 E1 G4 20120502 203933 b MpiSpinWheels P35 F2 63 20120516 184612 MpiSpinWheels P50 E1 G3 20120501 012515 b amp sgmproj EDT 1512 Fle window aron ed 3j About li History Help manual gt f BullHpcviewerExtensions html 3 eo file opt oulxde profilers ohpcviewer 1 0 Bul 1fibexec ohpcviewer plugins com bull hpc Bull Extensions to Rice Hpcviewer Package This manual pro
111. ollowing additional CPUSET features Migration Change on the fly the execution area for a whole set of processes for example to give more resources to a critical application When you change the CPU list of a CPUSET all the processes that belong to the CPUSET will be migrated to stay inside the CPU list if and as necessary Virtualization Translate the masks of CPUs given to sched_setaffinity so they stay inside the set of CPUs With this mechanism processors are virtualized for the use of sched setaffinity and proc information Thus any former application using this system call to bind processes to processors will work with virtual CPUs without any change A new file is added to each CPUSET in the CPUSET file system to allow a CPUSET to be virtualized or not Chapter 8 Libraries and Other Tools 77 8 3 2 78 CPUSETs management tools The ptools package provides a set of tools to help create manage end delete CPUSETs pcreate and pexec to create a CPUSET pdestroy to destroy a CPUSET pls to list the existing CPUSETs pshell to launch a shell within an environment created with pcreate or pexec pplace and passign to control the placement of processes on CPUs See The tools man pages for more details on their usage bullx DE User s Guide Appendix A Performance Monitoring with BCS Counters The performance monitoring implemented in the BCS chip provides a means for measuring system performance and de
112. ompiled with bullxMPI or any based MPI implementation and using GNU compilers e scalasca version bullxmpi intel This module file is to be loaded to use Scalasca with applications compiled with bullxMPI or any based MPI implementation and using Intel compilers e scalasca version intelmpi This module file is to be loaded to use Scalasca with applications compiled with Intel MPI and Intel compilers To be able to use Scalasca with an application the first step is to recompile the application to get it instrumented In addition to an almost automatic approach using compiler inserted instrumentation semi automatic POMP and manual instrumentation approaches are also supported Manual instrumentation can be used either to augment automatic instrumentation with region or phase annotations which can improve the structure of analysis reports or if other instrumentations fail Once the application instrumented next steps are execution measurement collection and analysis and analysis report examination Use the scalasca command with appropriate action flags to instrument application object files and executables analyze execution measurements and interactively examine measurement analysis experiment archives Note The PDT based source code instrumentation is not supported by this integrated version of Scalasca http www vi hps org upload material tw11 Scalasca pdf More Infor
113. omplished by providing a project and a simple code passport name one without a date time stamp e Setting an existing code passport to be the current is done by providing the project name and the full code passport name including data time stamp Once the bhpcstart script is run all other scripts that reference the current project and code passport only require the project name of the repository name Stop Component bhpcstop When the bhpcstop wrapper is run it will clear the current code passport name for the input project to stop future scripts from putting more data into this code passport Clean Component bhpcclean When the bhpcclean wrapper is run it will remove the hpcrun metrics data collected from a previous run of the bhpcrun script user may wish to do this after they create a code passport and then run the bhpcrun script if he finds that the bhpcrun used incorrect parameters or that the wrong versions of software were installed on some of the systems A user must run this script before they will be allowed to rerun bhpcrun This is necessary because another run of bhpcrun when there is already data collected will cause the test case to contain invalid data To be able to present consistent data all of the information must have come from the same test run Compilation Component bhpcstruct This component is a wrapper around the HPCToolkit hpcstruct component For MPI applications bhpcstruct must be install
114. ons for the new views This provides controls that affect the Grouped Metrics View and the Raw Metrics View Chapter 6 Analyzing Application Performance 61 e Group Metrics view The idea behind creating the grouped metrics view is that in any large run some of the processes will behave differently than other processes The approach is to separate the processes into groups of processes which generated similar behavior The analyst can decide that one group is running correctly and another running incorrectly After doing the grouping the user will have a few sets of processes that behaved differently from one another This view only needs to present one set of data for each group and the analyst only needs to compare the performance differences between the groups and not all the processes e Raw Metrics view This view shows the raw metric values for all of the processes at one program scope e Additional Grouping Features The grouping tool features include Algorithm to provide an initial optimum number of groups The grouping mechanism as a default has an algorithm that chooses the optimum number of groups Or the user may specify the number of groups A Automatic hotspot detection This helps the analyst focus on the program scopes that are of the most value to analyze and highlights them using different colors that may be chosen by the user Grouping properties view The grouping properties are the results of the grouping to
115. or analyzing and attributing measurements from large scale executions hpcviewer hpcviewer presents the Experiment database produced by hpcprof or hpcprof mpi so that the user can quickly and easily view the performance databases generated Chapter 9 Analyzing Application Performance 57 6 4 2 5 6 4 3 58 Display Counters The hpcrun tool uses the hardware counters as parameters To know which counters are available for your configuration use the papi avail command The hpcrun and hpcrun flat tools will also give this information Vendor string and code GenuineIntel 1 odel string and code 32 1 CPU Revision 0 000000 PU Megahertz 1600 000122 CPU s in this Node 6 odes in this System 1 Total CPU s 6 umber Hardware Counters 12 ax Multiplex Counters 2 32 The following correspond to fields in the PAPI event info t structure Name Code Avail DerivDescription Note PAPI TOT CYC 0x8000003b Yes No Total cycles L1 x80000000 Yes No Levell data cache misses PAPI L1 ICMO x80000001 Yes No Level 1 instruction cache misses PAPI L2 DCMO x80000002 Yes YesLevel 2 data cache misses PAPI FSQ INS 0x80000064 No NoFloating point square root instructions PAPI FNV INS 0x80000065 No NoFloating point inverse instructions PAPI FP OPS 0x80000066 Yes NoFloating point operations Of 103 possible events 60 are available of which 17 are derived The
116. or users who are not expert on the processor s native events bpmon allows users to generate a list of available PAPI preset events from which the event counts to be used can be chosen 50 bullx DE User s Guide PAPI Processor Native Events bpmon allows the user to generate a list of the processor s native events supported by PAPI The user can then review the list and choose which ones to use See Intel64 and IA 32 Architectures Software Developers Manual Volume 3B System Programming Guide Part 2 document order number 253669 for details of performance events available for Intel processors 6 2 3 BPMON with the Bull Coherent Switch The Bull Performance Monitor tool BPMON includes the ability to report performance monitor events from the Bull Coherent Switch BCS The BCS is the Bull hardware that interfaces memory traffic between the four mainboard sockets and the next mainboard in multi mainboard bullx supernode systems These performance events provide an insight into the non uniform memory architecture NUMA related behavior of the system The BCS capability is provided by adding a BCS component to the PAPI used with and a BCS driver to provide an interface to the BCS hardware performance monitor The BCS performance monitor can collect counts for up to four BCS events simultaneously Here is an example using the Traffic Identification performance event Four Incoming Traffic events are collected two for Re
117. other labels found in this configuration file When a component encounters this special label it locks the values provided with each of the labels in the list IF a label s value has been locked it prevents the component from replacing it with a value found in a later configuration file Most components also support command line arguments which follow the same rules described above for configuration file labels The values provided on a command line argument will replace a configuration file value unless it was locked in one of the configuration files The lock directive provides an environment in which administrators can set configuration values for specific arguments in the etc bullhpctk xxx conf files that users cannot override assuming that users have only read access to the config files in If a directive is found that tries to change a locked value the component prints a warning but continues to run using the value set prior to when it was locked Chapter 6 Analyzing Application Performance 69 6 5 5 1 Compilation Component Configuration File bhpcstruct conf The compilation component uses a configuration file named bhpstruct conf A hypothetical configuration file for this component could look something like this User login level configuration for bhpcstruct name democonf hpcargs v 2 testcase opt hpctk test cases MpiSpinWheels lock testcase 6 5 5 2 Parallel Manager Configuration File bhpcrun conf The parall
118. ovides two reporting modes 6 2 1 1 Processor Performance Reporting Processor performance reporting lists a set s of performance events in tables with one row per processor specified and the different performance events in columns This can be set to repeat the reporting at regular intervals as shown in the example below Experiment to measure L3 Cache Performance on each Processor without using Uncore Events INSTRUCTIONS RETIRED measures Total Instructions Executed LLC MISSES measures L3 Cache Misses MEM LOAD RETIRED L3 MISS measures L3 Data Cache Load Misses MEM UNCORE RETIRED LOCAL DRAM measures L3 Data Cache Load Misses Satisfied from Local DRAM MEM UNCORE RETIRED REMOTE DRAM measures L3 Data Cache Load Misses Satisfied from Remote DRAM run time 30 event INSTRUCTIONS RETIRED LLC MISSES MEM LOAD RETIRED L3 MISS MEM UNCORE RETIRED LOCAL DRAM MEM UNCORE RETIRED REMOTE DRAM report event A command example with its output is shown below Run from Terminal 1 llclat 1 10 c 4 Run from Terminal 2 sudo bpmon c opt bullxde perftools bpmon share doc bpmon examples 1l13crw Update in 30 seconds ctrl c to exit q BPMON CPU Event Results 4R CPU INSTRUCTIONS RETIRE LLC MISSES MEM LOAD RETIRED MEM UNCORE RETIRED MEM UNCORE RETIRED L3 MISS LOCAL DRAM
119. pcode Mask The Buffer Occupation Monitoring fields are Transaction Type MsgClass MC Transaction Type OpCode OC Transaction Type MsgClass Mask MCM Transaction Type OpCode Mask OCM The definition will fill bits 27 9 of PMNC PME register For example Bits 27 9 in binary BCS_NCMH_Tx_GPI_Alloc MC DRS MCM OxF OC 0 OCM 0 0101110111100000000 BCS_NCMH_Tx_X PI_Alloc MC DRS MCM 0xF OC 0 OCM 0 0111110111100000000 100 bullx DE User s Guide BCS_NCMH_Tx_QPI_Release MC DRS MCM 0xF OC 0 OCM 0 1001110111100000000 BCS_NCMH_Tx_X PI_Release MC DRS MCM OxF OC 0 OCM 0 1011110111100000000 The PMR for the chosen counter for this event should have values shown in the NCMH Event Setup section This is setup to count NCMH Transactions Lock Monitoring Two ways are available to use the Lock Latency event 1 As a counter to count lock messages and or 2 Asa timer to accumulate the time that Locks are closed To setup the counter capability you select one of the two counters listed below the count results are expected to be the same The definition will fill bits 29 28 of PMNC_PME register For example Bits 29 28 in binary BCS_NCMH_Lock_Message 01 BCS NCMH Unlock Message 10 The PMR for the chosen counter for this event should have values shown in the NCMH Event Setup section There are a number of different latency measurements that can be taken in the PE and NCMH units single measurement is taken by counting the numb
120. pi_noinst and darshan lt version gt _intelmpi_inst will not produce instrumentation for Fortran executables They only work with C and C executables 74 bullx DE User s Guide Chapter 8 Libraries and Other Tools This chapter describes Boost libraries and other tools 8 1 Boost Boost is a collection of high quality libraries intended to be widely useful and usable across a broad spectrum of application Boost libraries are fully compliant with the standard library and offer means to manipulate efficiently threads regular expressions filesystem operations smart pointers strings mathematical graphs any many others Boost contains two types of libraries header only libraries These libraries are fully defined and implemented within C header files hpp files Compiling an application with these libraries consists in indicating the compiler where to find Boost header files with the compilation option In the context of bullx DE loading the Boost module will automatically make the Boost header files visible to the compiler through the CPATH environment variable shared or static libraries To compile with these libraries one has to indicate the compiler where to find the libraries In the context of bullx DE the BOOST LIB environment variable can be used to indicate the Boost libraries as shown in the following example Compiling with Boost shared or static libraries g source cpp L BOOST LIB lboost
121. plit into several matrices according to the size of the messages The number of partitions and the size limits can be defined through the PFC_PARTITIONS environment variable In a pointto point communication the sender and receiver of each message is clearly identified this results in a well defined position in the communication matrix In a collective communication the initial sender s and final receiver s are identified but the path of the message is unknown The profilecomm library disregards the real path of the messages A collective communication is shown as a set of messages sent directly by the initial sender s to the final receiver s Execution Time The measured execution time is the maximum time interval between the calls to Init and MPI_Finalize for all the processes By default the processes are synchronized during measurements However if necessary the synchronization may be by passed using an option of the profilecomm library Call Table The call table contains the number of calls for each profiled function of each process For collective communications since a call generates an unknown number of messages the values indicated in the call table do not correspond to the number of messages Histograms profilecomm collects two messages size histograms one for point to point and one for collective communications Each histogram contains the number of messages for sizes O 1 to 9 10 to 99 100 to 999 108 to
122. ponse 11 For this case Incoming 00 is chosen The defaulted fields are Lookup Directory Status 00000 LST Tracker Output State 00 TOS Tracker Output Response Type Received 0000000000 TOR The filled fields are Direction 00 Transaction Type MsgClass MC Transaction Type OpCode OC Transaction Type MsgClass Mask MCM Transaction Type OpCode Mask OCM Node ID NID Mask NIDM The definition will fill bits 64 20 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 64 20 in binary BCS PE Incoming Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 NID 0 NIDM 0 00111000001 1110000000000000000000000000000000 This counts DRS transaction types for all opcodes for all RHNIDs For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Incoming Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 NID 0 NIDM 0 001110000011110000000000000000000000000000000 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Incoming Traffic MC DRS MCM OxF OC20x0 OCM 0x0 NID 0 NIDM 0 00111000001 1110000000000000000000000000000000 Appendix A Performance Monitoring with BCS Counters 93 Outgoing Traffic Identification Monitoring There are four cases of Traffic Identification Events For this case Outgoing 01 is chosen The defaulted fields are Lookup Directory Status 00000 LST T
123. r bits 0000 FDnC 50303 140C iti Initial or current value PairO Counter compare value or max 31 0 count low order bits PAIRO CNTI Initial or current value PairO Counter compare value or max 44 32 count high order bits PAIRT 0000 FDnC 5044 3 1411 Initial or current value Pairl CounterO compare value or max 31 0 count low order bits P Initial or current value Pairl CounterO compare value or max count high order bits Ini 0000 FDnC 5000 3 1400 RW RW RW W 0000 FDnC 5034 3 140D R 0000 FDnC 5048 3 1412 44 32 PAIRT CNTI 0000_FDnC_5058 3_1416 nitial or current value Pairl Counter compare value or max 31 0 count low order bits Initial or current value Pairl Counter compare value or max count high order bits PAIRT CNTI 0000 FDnC 505C 3 1417 44 32 Registers that are read and can be cleared _ E _ Current value PairO CounterO current count low order bits a Current value PairO CounterO current count high order bits 000 5038 3_140 IRW Currentvalue PairO Counter current count low order Appendix A Performance Monitoring with BCS Counters 111 RW RW RW RW RW RW RW CNTO 0000 FDnC 5024 3 1409 31 0 AIRO_CNTO_PMD 0000 FDnC 5028 3 140A 44 32 CNTI PMD 0 Register Symbo
124. racker Output State 00 TOS Tracker Output Response Type Received 0000000000 TOR The filled fields are Direction 01 Transaction Type MsgClass MC Transaction Type OpCode Transaction Type MsgClass Mask MCM Transaction Type OpCode Mask OCM Node ID NID NID NID Mask NIDM The definition will fill bits 64 20 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 64 20 in binary BCS PE Outgoing Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 NID 0 NIDM 0 01111000001 1 110000000000000000000000000000000 This counts DRS transaction types for all opcodes for all DNIDs For Unit Event Source chosen as LoMO 3 here is an example BCS_PE_LOM_Outgoing_Traffic MC DRS MCM OxF OC 0x0 OCM 0x0 NID 0 NIDM 0 00111000001 1 110000000000000000000000000000000 For Unit Event Source chosen as ReMO 3 here is an example BCS_PE_REM_Outgoing_Traffic MC DRS MCM 0xF OC 0x0 OCM 0x0 NID 0 NIDM 0 001110000011110000000000000000000000000000000 Tracker Output Traffic Identification Monitoring There are four cases of Traffic Identification Events For this case Tracker Output 10 is chosen The defaulted fields are Node ID NID 00000 NID NID Mask 00000 NIDM Lookup Directory Status 00000 LST The filled fields are Direction 10 Transaction Type MsgClass MC Transaction Type OpCode OC Transaction Type MsgClass M
125. rary full path name is needed if the library path name is not in the LD_LIBRARY_PATH environment variable Example app libraries libfoo so path to libbar so Caution must not be left blank when enabled bullxprof debug lt number gt Sets the tool s verbosity level O off 1 low 2 medium and 3 high bullxprof experiments lt exp 1 expN gt Determines which profiling experiments are to be activated Possible experiments are timing hwc mpi io and mpiio bullxprof smartdisplay lt 0 1 gt Prints the reports using a smart display time as hours minutes seconds other values as when value is 1 Disabled otherwise bullxprof output lt model modeN gt Determines report production output mode Possible output modes are stdout file and csv stdout causes reports to be dumped on standard error stream file causes reports to be created as files in a directory named bullxprof YYYYMMDD HHMM SLURM_JOB_ ID csv causes reports to be created as CSV files in a directory named bullxprof YYYYMMDD HHMM SLURM JOB ID timing experiment Configuration File Options bullxprof timing tracelevel2 number timing Experiment reports specific level of detail 1 basic 2 detailed and 3 advanced bullxprof timing user threshold lt float gt Enables the display of user function statistics when percentage of user region time is over the given value Set to O to di
126. rce chosen as ReMO 3 here is an example BCS PE REM Interface RT East 01 BCS PE REM Interface RT West 10 Transaction Monitoring You select the Event Request or Response You select the Transaction Type Then you select the Opcode and Opcode Mask The Buffer Occupation Monitoring fields are Transaction Type TY OpCode OC OpCode Mask OCM The definition will fill bits 118 107 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example bullx DE User s Guide Bits 118 107 in binary BCS PE Tx Request TY2 Write OC WbMtol OCM OxF 010101001111 BCS PE Tx Response TY2Snoop OC SnplnvOwn OCM OxF 101011001111 For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Tx Request TY2 Write OC WbMtol OCM OxF 10101001111 BCS PE LOM Tx 5 0 101011001111 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Tx Request TY 2Write OC2 WbMtol OCM OxF 010101001111 BCS PE REM Tx Response TY2Snoop OC SnplnvOwn OCM OxF 101011001111 This is setup to count PE Transactions Appendix A Performance Monitoring with BCS Counters 99 NCMH Event Setup For the NCMH count events the PMR for the chosen counter for this event should have the following settings Counter Enable Source local count enable 001 Counter S
127. reports specific level of detail 1 basic 2 detailed and 3 advanced bullxprof timing mpiio threshold lt float gt Enables the display of MPI I O function statistics when percentage of MPI I O region time is over the given value Set to O to disable this feature bullxprof mpiio functions lt function 1 functionN gt bullx DE User s Guide The list of profiled MPI IO functions Supported values are selected from the following values MPI File open MPI_File_close MPI File delete MPI File set size MPI File preallocatePI File get size MPI File get group MPI File get amode File set info MPI File get info MPI File set view MPI File get view MPI File read at MPI File read at all MPI File write at MPI File write at all MPI File iread at MPI File iwrite at MPI File read File read all MPI File write MPI File write all MPI File iread MPI File iwrite MPI File seek MPI File get position MPI File get byte offset MPI File read shared MPI File write shared MPI File iread shared MPI File iwrite shared MPI File read ordered MPI File write ordered MPI File seek shared MPI File get position shared MPI File read at all begin MPI File read at all end MPI File write at all begin MPI File write at all end MPI File read all begin MPI File read all end MPI File write all begin MPI File write all end MPI File read ordered begin MPI File read ordered end MPI File write ordered
128. ring fields are Starvation Type TY Starvation Event EV The definition will fill bits 89 75 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 89 75 in binary BCS PE Starvation TY2Snoop EV ACT 001000000000010 BCS_PE_Starvation TY WrReq EV_THR gt 3 011000000011011 Where the set of exclusive Starvation Events and their abbreviations are Start of New Starvation Mechanism EV_STR Starvation Mechanism is Active EV_ACT Threshold Comparison Including the threshold amount EV_THR For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Starvation TY2W rReq EV THR 3 011000000011011 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Starvation TY2W rReq EV THR 3 011000000011011 Buffer Occupation Monitoring You select the Buffer Select choose the buffer to monitor You select comparison Event greater than or equal You select occupation Threshold The Buffer Occupation Monitoring fields are Buffer Select BUF Threshold THR The definition will fill bits 104 90 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 104 90 in binary BCS PE Buffer Occupation BUF2WTO THR 7 001100000011101 PE Buffer Occupatio
129. s a non intrusive tool which allows the display of data from counters that has been logged when the application runs mpianalyser uses the PMPI interface to analyze the behavior of MPI programs profilecomm is a part of mpianalyser and is dedicated to MPI application profiling It has been designed to be e light it uses few resources and so does not slow down the application e Easy to run it is used to characterize the communications in a program Communication matrices are constructed with it Profilecomm is a post mortem tool which does not allow on line monitoring Data is collected as long as the program is running At the end of the program data is written into a file for future analysis readpfc is a tool with a command line interface which handles the data that has been collected Its main uses are the following display the data collected e export communication matrices in a format that can be used by other applications Data Collected The profilecomm module provides the following information e Communication matrices e Execution time e of calls of MPI functions e Message size histograms e Topology of the execution environment Environment The user environment can be set to use mpianalyser through the provided module files see Section 2 4 bullx DE Module Files e mpianalyser 1 2 link this module file sets the user environment for linking an MPI binary with the mpianalyser s
130. s a general description of event types Any differences or additions in the units are addressed in later sections 1 Interface measure BCS internal traffic from the selected unit to a destination unit Details about message type are not available at this level of measurement 2 Buffer Occupation measurement of buffer occupation at or greater than a specified threshold Used in association with the timer and multiple runs at different thresholds to make a histogram of occupation Errors measure double and single ECC errors 4 Traffic Identification measure various events in the life of a transaction based on the traffic direction and the transaction type message class and opcode dependent upon a mask Incoming and Outgoing directions are with respect to the unit being monitored 5 Latency measure latency for selected message sequences often dependent upon a mask PE Event Types LoM Local space Manager is responsible for ensuring coherency for local addresses It behaves as a Home Agent on representing the Home Agents of all the other modules and as a Caching Agent on representing the Caching Agents of the local module ReM Remote space Manager is responsible for ensuring coherency on remote addresses It behaves as a Home Agent on representing the Home Agents of all the other modules and as a Caching Agent on representing the Caching Agents of the local module Protocol Engine PE eve
131. s shown in 32 bit packets as that is how it is read and written in Configuration Access mode using the BCS CSR Field description details can be found in the PMPE Event Configuration Register Description bullx DE User s Guide x 2k x x KK x x OX KKK KK ox x x 0X X X X x x 1 x Traffic Identification Directory Tracker Output Access Directory Active Levels Twin Lines Error Monitoring Response Type Received Event Threshold Event Event Event 17 i2 11 10 0 20 r0 0 0 0 0 0 0 0 0550 000000 0 0 0 0 0 0 0 NoEvent X X X X X x 1 Directory SRAM single ECC error X X X X X 1 x SRAM double ECC error X X X X 1 x x Directory LOT single ECC error X X X 1 x x x Directory DCT singe ECC error X X 1 x x x x Directory DLIT singe ECC error X 1 x x x x x Tracker singe ECC error 1 x x x x x x Virtual Output FIFO singe ECC error No Event Lookup to Directory SRAM Lookup miss Lookup hit with one of the twin lines in non l state Lookup hit with both of the twin lines in non l state No Event 0 1 The number of active levels is greater than threshold 1__0 The number of active levels is equal to the threshold Event Directory SRAM update access Directory SRAM read access Directory IPT or SRAM update access Directory IPT or SRAM read access Transaction elected to access the pipeline No Event snSnp or d
132. sable this feature bullxprof timing region lt region1 regionN gt bullx DE User s Guide Enables time profiling of the selected code region Possible regions are user user code mpi MPI functions io POSIX I O functions mpiio MPI I O functions hwc experiment Configuration File Options bullxprof hwc tracelevelz number hwc experiment reports specific level of detail 1 basic 2 detailed and 3 advanced bullxprof hwe metrics lt metric metricN gt Enables profiling of the selected metrics Possible metric values are flops consumed GFLOPS ibe Instructions by Cycles Cache Miss Rate in clr Cache Line Reuse mpi experiment Configuration File Options bullxprof mpi tracelevel lt number gt mpi experiment reports specific level of detail 1 basic 2 detailed and 3 advanced bullxprof timing mpi threshold lt float gt Enables the display of MPI function statistics when percentage of MPI region time is over the given value Set to to disable this feature bullxprof mpi functions lt function1 functionN gt The list of profiled MPI functions Supported values are selected from the following values MPI_Allgather MPI Allgatherv MPI_Allreduce MPI_Alltoall MPL Alltoally MPI Barrier MPI_Bcast MPI_Bsend MPI_Bsend_init Cancel MPI Cart create MPI_Cart_sub MPI_Comm_create MPI Comm dup MPI Comm free MPI Comm split _ Comm compare MPI Finali
133. se with an interactive viewer called hpcviewer 6 4 2 HPCToolkit Tools The tools included in the HPCToolkit are 6 4 2 1 hpcrun hpcrun uses event based sampling to measure program performance Sample events correspond to periodic interrupts induced by an interval timer or overflow of hardware performance counters measuring events such as cycles instructions executed cache misses and memory bus transactions During an interrupt hpcrun attributes samples to calling contexts to form call path profiles To accurately measure code from black box vendor compilers hpcrun uses on the fly binary analysis to enable stack unwinding of fully optimized code without compiler support even code that lacks frame pointers and uses optimizations such as tail calls hpcrun stores sample counts and their associated calling contexts in a calling context tree CCT hpcrun flat the flat view version of hpcrun measures the execution of an executable by a statistical sampling of the hardware performance counters to create flat profiles A flat profile is an IP histogram where IP is the instruction pointer 56 bullx DE User s Guide 6 4 2 2 6 4 2 3 6 4 2 4 hpcstruct hpcstruct analyzes the application binary to determine its static program structure Its goal is to recover information about procedures loop nests and inlined code For each procedure in the binary hpcstruct parses its machine code identifies branch instructions builds a contro
134. sing Incoming Traffic As the example shows the REM event counts closely match the LOM events counts The test used generates reads for one test pass and writes for another test pass The test program generates about 500 000 000 remote memory requests per program instance and one instance is executed Here are the read results from BPMON that also shows the BCS performance events measured Appendix A Performance Monitoring with BCS Counters 115 116 BPMON Single Thread Event Results Event Description Event Count BCS0 PE REM Incoming Traffic 496006785 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0x01 NIDM 0x01 BCSO_PE_REM_Incoming_Traffic 246838 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0x01 NIDM 0x01 BCS3_PE_LOM_Incoming_Traffic 494996140 MC HOMO MCM 0xF OC 0 OCM 0xC NID 0 NIDM 0x00 BCS3_PE_LOM_Incoming_Traffic 221481 MC HOMO MCM 0xF OC RdInvOwn OCM 0xF NID 0 NIDM 0x00 Here are the write results from BPMON that also shows the BCS performance events measured BPMON Single Thread Event Results ription Event Count Incoming_Traffic 3485584 CM 0xF OC 0 OCM 0xC NID 0x01 NIDM 0x01 Incoming_Traffic 489939668 CM 0xF OC RdInvOwn OCM 0xF NID 0x01 NIDM 0x01 Incoming_Traffic 2502476 CM 0xF OC 0 OCM 0xC NID 0 NIDM 0x00 OM Incoming Traffic 489917358 CM 0xF OC RdInvOwn
135. tScript format eps Export in Encapsulated PostScript format svg Export in Scalable Vector Graphics format fig xfig Export in xfig format epslatex Export in LaTex and Encapsulated PostScript format pslatex Export in LaTex format and PostScript inline pstex Export in Tex format and PostScript inline The available values are the following NJ mportant When using epslatex two files are written xxx tex and xx eps The filename indicated in the option is the name of the Latex file logscale base Uses a logarithmic color scale Default value for logarithm basis is 10 this basis can be modified using the base argument This option is only relevant when exporting in a graphical format nogrid Does not display the grid on a graphical representation of the matrix o file output file Specifies the file name for an export file The default filenames are out csv out mm out dat out ps out svg out fig or out tex according to export format This option is only available with the option palette pa Uses a personalized colored palette This option is only relevant when exporting in a graphical format This palette must be compatible with the defined function of gnuplot for instance palette 0 white 1 red 2 black or palette 0 0000 1 2 ff0000 title fitle Uses a personalized title a graphical display The default title is Point to point collective numeric
136. tatus Output Source perfcon 000 Count Mode count events Counter Event Source unit event Counter and Status Reset Source no reset Compare Mode disabled 00 Unit Event Source ncmh 000000001 Unit Type Source PE 00 The syntax for the expert user that does not wish any software tool help in defining an event is to provide the PMR and PMNC PME register contents BCS_NCMH PMR 0x00100004 NCMH 0 0x7E0420 0 Buffer Occupation Monitoring You select Tracker Buffer or the Tracker Buffer You select the Threshold 0 to 63 You select comparison Event greater than or equal The Buffer Occupation Monitoring fields are QPI Tracker Buffer QPI_Tracker Tracker Buffer XQPI_Tracker To the field name is appended the comparison event type gt or and the Threshold amount as shown in the example below The definition will fill bits 8 O of PANC_PME register For example Bits 8 0 in binary BCS NCMH Buffer Occupation GPI gt 3 1 001111101 BCS NCMH Buffer 3 100001101 BCS NCMH Buffer gt 0 000000001 The PMR for the chosen counter for this event should have values shown above Transaction Monitoring You select the Event Allocate or Release You Select the Buffer Tracker or Tracker Then you select the Transaction Type Msgclass Msgclass Mask Opcode and O
137. tch back and forth between these perspectives This provides a nice way to organize yhat unrelated information EE Figure 6 2 bhpcviewer Bull Extensions Manual page HPCToolkit Wrappers Wrapper commands or scripts primary purpose is to run another command or script They provide pre and post processing as well as support for configuration control of the arguments of both the wrapper and the script it runs Often the input and output files for the wrappers are obtained from or written to the History Repository The bhpcstruct bhpcrun bhpcprof and bhpcprof mpi wrappers be invoked as 5 along with bhpcstart bhpcstop and bhpcclean Command line help Each of Bul Enhanced HPCToolkit command line wrappers will generate help message summarizing the tool s usage arguments and options To display the help information for the wrappers enter wrapper name h or wrapper name help Start Component bhpcstart When the bhpcstart wrapper is run it will set the environment used by a test case A test case consists of running several scripts each of which collects some of the data related to the test When the test is finished the bhpcstop script should be run This wrapper can be used to create a new code passport or to set an existing code passport to be the current one used by other scripts Chapter 6 Analyzing Application Performance 63 6 5 3 2 6 5 3 3 6 5 3 4 64 e Creating a new one is acc
138. tecting bottlenecks caused by hardware or software This Appendix describes some of the ways that the Performance Monitoring PM resources can be programmed to obtain some basic measurements A 1 Bull Coherent Switch Architecture To be able to create monitoring experiments the user must have some understanding of the BCS architecture The BCS units are e Remote Space Manager REM and Local Space Manager LOM collectively referred to as the Protocol Engine PE e Layer QPI IOH XQ PI LLCH and collectively referred to as LL Output Buffering blocks to OBC OBI and are considered to be part of the appropriate LL unit for the purposes of Performance Monitoring collectively referred to as OB e Manager Unit NCMH e Route Through IOH to GPI GPHo IOH ROIC and collectively referred to as RO Figure A 1 shows a schematic representation of the BCS units with their performance monitoring blocks and connections PMLLO PMLLO PMLL1 PMLL1 PMLLO PMLL1 PMLLO PMLL1 PMLLO PMLL1 Figure BCS Architecture for performance monitoring blocks and connections Appendix A Performance Monitoring with BCS Counters 79 2 Performance Monitoring Architecture Performance Monitoring as supported by BPMON and Bull s PAPI enhancement is composed of two parts e event detection event counting Event detection logic is pl
139. tes the program structure to the code passport 4 One launches an application with the parallel manager component bhpcrun which in turn invokes the classic hpcrun tool to execute the binary with statistical sampling bhpcrun collects performance profiles from the one or more nodes on which the binary was executed and adds them to the code passport It also collects environment information about the executable on that system This includes the executables size and build date plus the environment variables that were set and list of dynamic libraries used by the executable on that node 5 One invokes the hotplot component bhpcprof or bhpcprof mpi which in turn invokes the classic hpcprof or hpcprof mpi tool to correlate the performance data with the source structure creating a performance database This database is then added to the code passport 6 One invokes the stop component bhpcstop with a project name The last code passport name file is deleted for the project Chapter 6 Analyzing Application Performance 67 7 sample bash script test case to run an MPI MpiSpinWheels job opt hpctk test cases MpiSpinWheels is displayed below export BHPCTK REPO ROOT home hpctk pmhistrep bhpcstart ndemoproj MpiSpinWheels bhpcstruct ndemoproj T opt hpctk test cases MpiSpinWheels mpirun mca btl tcp self np 8 x SBHPCTK REPO ROOT host sulu bones bynode display map bhpcrun ndemoproj e PAPI TOT CYCG1000000 e PAPI TOT I
140. test case The unique name is created by the passport manager by appending a date time stamp to the user provided string The passport manager will also keep track of the current code passport string plus date time stamp being used for each project This allows scripts run following the bhpcstart script to get the code passport name being used for the current test from the passport manager so it does not need to be provided by the user to any other scripts run for the test case When the bhpcstop script is run it will clear the current code passport name to stop future scripts from putting more data into this code passport The user needs to create a new code passport or set an existing one to be current again before running additional scripts Test run work flow The work flow is similar to the classical Toolkit however the input and output files for the Toolkit components are obtained from or written to a code passport as outlined below 1 One must initialize the BHPCTK ROOT environment variable with the path name of the History Repository repository root 2 One invokes the start component bhpcstart with a project name and a simple or full code passport name code passport is created if a partial name is entered and the last code passport name file is created for the project 3 One invokes the compilation component bhpcstruct which in turn invokes the classic hpcstruct tool to perform binary analysis bhpcstruct wri
141. that they can be enhanced with added value viewed or compared at a later time This component consists of the following parts e History Repository e History Repository Environment Variables e Passport Library e Passport Manager Application History Repository The History Repository is a database whose entries are code passports from many different test runs Each execution of the user s program which may occur across multiple nodes results in one code passport in the History Repository Data in the History Repository is stored in a file structure which is grouped first by project and then by code passports within a project A code passport contains all of the results from running a single test including environment information such as compiler version compilation platform surrounding software distributions program structure information and performance information including raw performance profiles and performance databases repository name represents a set of data within a repository This set may be a single file many files or even all of the files in the repository The fields in a repository name support glob style pattern matching to provide a friendly way to specify the desired set of repository files Chapter 6 Analyzing Application Performance 59 6 5 1 2 6 5 1 3 6 5 1 4 60 History Repository naming convention repo name gt project code passport gt lt test tool gt lt data origin gt lt f
142. time 9303 us ize 4 of partitions 3 ons limits 1000 1000000 Sz 4 bytes 32 bits Sz 8 bytes 64 bits S on 1 hosts hostid cpuid 0 0 0 1 0 2 0 3 ocesses 12 3 Chapter 5 Application Profiling 37 5 1 7 2 38 e display a point to point numerical communication matrix readpfc pn foo pfc Point to point numeric number of messages 0 1 1k 0 0 1 1k 1 1k 0 0 0 1 1k 0 0 0 1 1 0 0 1 1k 0 1 1k e To export the collective volumic communication matrix in CSV format in the default file readpfc x cv foo pfc e To export the first part small messages of point to point numerical communication matrices in PostScript format in the foo ps file 5 readpfc x 0 f ps o foo ps foo pfc ls foo ps pfcplot histplot and gnuplot The pfcplot script converts matrices into graphic using gnuplot It is generally used readpfc but can be used directly by the user who wants more flexibility The matrix must be exported with the f gnuplot option to be read by pfeplot For more details enter man pfcplot Users who have particular requirements can invoke gnuplot directly To do this the matrix must be exported with gnuplot format or with CSV format choosing space as the separator gt mportant Due to the limitations of gnuplot one null line and one null column are added to the exported matrix in gnuplot format Histplot is the equivalent of
143. tions that use it are run Passport Library This library provides an to manage the History Repository and the information found in the code passports stored within the repository The library is responsible for reading the environment variable BHPCTK ROOT to find out where the repository is located Passport Manager Application This application is a utility that can be used to access the data in a history repository Data in the history repository is stored in a file structure which is grouped first by project and then by code passports within a project A code passport contains all of the results from running a single test The Passport Manager tools are accessed with the command bhpcpm bullx DE User s Guide 6 5 2 Note Usage bhpcpm ACTION repository name OPTION lt pathname gt A required ACTION field is used to specify the desired function An optional OPTION field is used along with the ACTION to achieve the desired result Both fields can be entered with a and a single letter or a and a word An asterisk is a wild card used for all occurrences of an item To display the help information for the Passport Manager Application enter bhpcpm h or bhpcpm help Viewing Component The enhanced Bull HPCToolkit viewer bhpcveiwer adds new features to the Rice University GUI based hpcviewer which currently displays the contents of the performance database New bhpcveiwer
144. tory Active Levels THR 12 0110001 BCS PE Directory Active Levels THR 12 0110010 For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Directory Active Levels THR 12 0110001 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Directory Active Levels THR 12 0110010 Directory Access Monitoring You select the directory access type to count The definition will fill bits 19 17 of PMPE PME register The PMR for the chosen counter for this event should have values shown in the PE Event Setup section with Unit Event Source chosen as LoMO 3 and ReMO 3 For example Bits 19 17 in binary BCS PE Directory Access Event DSU 001 BCS PE Directory Access Event DIR 100 Where the set of exclusive events and their abbreviations are Directory SRAM Update Access DSU bullx DE User s Guide Directory SRAM Read Access DSR Directory IPT or SRAM Update Access DIU Directory IPT or SRAM Read Access DIR Transaction Elected to Access the Pipeline TEA For Unit Event Source chosen as LoMO 3 here is an example BCS PE LOM Directory Access Event TEA 101 For Unit Event Source chosen as ReMO 3 here is an example BCS PE REM Directory Access Event DSR 010 Incoming Traffic Identification Monitoring There are four cases of Traffic Identification Events This is the first The Traffic Identification Direction is selected by setting bits 64 63 Incoming 00 Outgoing 01 Tracker Output 10 Lookup Res
145. ts Three formats are available CSV Comma Separated Values MatrixMarket not available for histogram exports and gnuplot It is also possible to have a graphical display of the matrix or the histogram which is better for matrices with a large number of elements Obviously it is also possible to include the graphics in a report Seven graphic formats are available PostScript Encapsulated PostScript SVG xfig EPSLaTeX PSLaTeX and PSTeX All these formats are vectorial which means the dimensions of the graphics can be modified if necessary Figure 5 1 example of a communication matrix Point to point and collective messages size histograms collective ses pt2pt 3 o o o 7 gt 10000 100000 1 06 1 07 1 08 1 09 Message size Figure 5 2 An example of a histogram Chapter 5 Application Profiling 35 5 1 7 1 36 Options The following options may be used when exporting matrices csv separator sep Modifies CSV delimiter Default delimiter is comma u n Some software programs prefer a semicolon format format format Chooses export format Default format is CSV Comma Separated Values help Lists available export formats csv Export in CSV format mm market MatrixMarket Export in MatrixMarket format gp gnuplot Export in a format used by pfcplot so that a graphical display of the matrix can be produced ps postscript Export in Pos
146. ull They include Intel compilers and profiler tools DDT from Alinea TotalView from RogueWave parallel debuggers as well as Vampire See the bullx Extended Offer Administration Guide for details regarding the installation and configuration of these third party products for the development environment as part of the extended offer bullx DE Development Environment bullx DE is a component of bullx supercomputer suite It includes a collection of Open Source tools that help users to develop execute debug analyze and profile HPC parallel applications This guide describes the use of the tools and libraries provided with bullx DE Chapter 1 bullx Development Environment 1 2 bullx DE User s Guide Chapter 2 DE User Environment 2 1 bullx DE Installation Path The tools and libraries for the bullx Development Environment are installed under opt bullxde This directory contains the following sub directories debuggers Contains bullx DE core offer tools for debugging applications mpicompanions Contains tools and libraries used alongside bullx MPI perftools Contains basic tools to help tune application performance or to read performance counters for a running application profilers Contains application profilers utils Contains utilities used by other tools modulefiles Contains bullx DE tools module files 2 2 Environment Modules bullx DE uses Environment Modules to customize dynamically your shell environment
147. vent Source chosen as ROCI here is an example BCS RO Interface IS ROBO T IE LOC 1011 Appendix A Performance Monitoring with BCS Counters 105 A 6 BCS Key Architectural Values Message Class and Opcode Mapping Any Opcodes not explicitly defined are reserved for future use Opcodes listed as unsupported have been found to be unsupported in the current version of the BCS Other Opcodes may also be unsupported anyone wishing to discover them is directed to the Intel Protocol Specification Likewise a NHM or TWK designation means that the Opcode is only valid for that platform Once again the designation is not exhaustive the assumption being that a user who is counting events based upon Opcodes has the knowledge to be doing so or access to documentation that would interpret it Also NcMsgB and NcMsgS contain six and ten message types respectively which cannot be differentiated for performance monitoring Message Class Name Message Opcode Class Encoding SnpCur 0011 0000 SnpCode 0011 0001 SnpData 0011 0010 Snoop SNP 3 SnplnvOwn 0011 0100 SnplnvWbMtol or SnplnvXtol 0011 0101 SnplnvitoE 0011 1000 PrefetchHint unsupported 0011 1111 RdCur 0000 0000 RdCode 0000 0001 RdData 0000 0010 NonSnpRd unsupported 0000 0011 0000 0100 Inv WbMtol InvXtol 0000 0101 Home Request EvctCln NHM 0000 0110 HM 0 NonSnpWr unsupported 0000 01
148. vides a description of the Bull Hpcviewer package which is delivered as part of the Bull hpctoolkit package Bull has developed two sets of enhancements related to the Rice HPCToolkit tools The first set of enhancements provides a set of wrappers that can be used to run the performance data collection tools hpestruct hpcrun and hpeprof delivered by Rice The wrappers primary purpose 15 to collect the results of the performance tools and other information about the systems where they were run and save this information in a history repository The final result of running these wrappers 15 performance database which is also saved in the history repository The second set of enhancements provides Bull developed viewer that provides a GUI interface to both the history repository and the performance data stored in the performance databases found in the history repository The Bull viewer incorporates all of the features delivered with the Rice hpcviewer plus it adds new features which provide access to what has been preserved in the history repository and new tools to analyze the performance data It is the second set of enhancements described above that this manual describes The Bull viewer and the Rice hpcviewer that it is built on top of are implemented s Eclipse RCP applications This framework provides the capability to have multiple perspectives within a single application window With this framework it is very easy to swi
149. when the trace level is set to 2 gives the user selected HW metrics values for each function The report is dumped metric by metric Parallel Program In a context the summary report produced when the trace level 15 set to 1 gives the following information for each user selected HW metrics Min Value rank minimum count of the event for the overall program and the candidate process rank Max Value rank The maximum count of the event for the overall program and the candidate process rank Average The average count of the event for the overall program Total The cumulated count of the event for the overall program MPI experiment The summary report produced when the trace level is set to 1 gives information about four 4 groups of MPI functions Point to Point Send Receive like MPI functions MPI_Send MPI_SendRecv etc Collective Collective MPI functions e g MPI_Alltoall MPI_Reduce etc Synchronization Barrier and MPI Wait like functions All All MPI functions bullx DE User s Guide 4 5 4 For each group of functions summary report gives the following information Max time rank Min time rank Average time Percentage of MPI Percentage of walltime Max message count rank Min message count rank Total message count Average message count Message rate Total volume Average volume Bandwidth The maximum time spent in the functions of the group and the candidate process r
150. x DE User s Guide Register Symbolic Real Address CSR Address Name u_ROCI PMRO1 0000_FDn7_CC24 1_F309 ROCI events Event bits 3 0 u_NCMH PMNCO 0000 FDnC 6000 3_1800 NL NCMH events EventO bits 31 0 u_NCMH PMNCO 0000 FDnC 6004 3 1801 NCMH events EventO bits 63 32 73 64 NCMH PMNCI 0000 FDaC 7000 3 1C00 rw NCMH events Eventl bits 31 0 0000 FDnC 7004 3 1 1 RW NCMH events Event bits 63 32 u NCMH PMNCI 0000 FDnC 7008 3 1CO2 ps NCMH events 64 o JferBCS 0 1 2 5 n M3 5 7 mmm EN i 0 1 2 3 k 0 4 8 C aco 63 32 95 64 REM PMPEO 0000 FDni 300C 4 kCO3 REMH events 118 96 REMH u 0000 FDni 3800 4 REMH events Event bits 31 0 REMH u 0000 FDni 3804 4 1 RW REMH events Event bits 63 32 0000 FDni 3808 4 kEO2 uM REMH events 95 64 0000 FDni 380C REMH events Event bits 118 96 NEN 1 4 5 6 7 k 0 4 8 C skcoo 8 ll 63 32 95 64 LOMH u 1 0000 FDni 300C 5 kCO3 118 96 u_LOMH u_LOM PMPE1 0000 FDni 3800 5 Event bits 31 0 u_LOMH u_LOM PMPE1 0000 FDni 3804 5 kEOI RW LOMH events Event bits 63 32 LOMH u LOM PMPE1 0000 FDni 3808 5 kEO2 RW LOMH events Event bits 95 64 LOMH u LOM PMPE1 0000 FDni 380C 5
151. x IO volume rank The maximum volume of IO data processed in the group and the candidate process Min IO volume rank The minimum volume of IO data processed in the group and the candidate process Total volume Total volume of data processed in MB Average volume Average volume of data processed in MB bandwidth Volume of data processed in a second in MB s The detailed report produced when the trace level is set to 2 gives a report for with the following information for each POSIX IO function Min Time rank minimum time spent executing the IO function and the candidate process rank Max time rank maximum time spent executing the IO function and the candidate process rank average time The average time spent in the IO function region Percentage of the IO time for the IO function Percentage of the for the IO function bullx DE User s Guide 4 5 5 experiment The summary report produced when the trace level is set to 1 gives information about three 3 groups of MPI IO functions Read MPI File read like functions Write File write like functions Total All MPI IO functions For each group of Il MPI IO functions the summary report gives the following information Max MPHO time rank The maximum time spent in the functions of the group and the candidate process rank Min MPHO time rank The minimum time spent in the functions of the group and the candidate process r
152. xxxx o executable See http www boost org for more details Chapter 8 Libraries and Other Tools 75 8 2 OTF Open Trace Format OTF is a library used by the tools like Scalasca to generated traces in the OTF format The OTF package also contains additional tools to help processing OTF trace files e olfmerge converter program of OTF library e olfmerge mpi MPI version of otfmerge e otfaux append snapshots and statistics to existing OTF traces at given break time stamps e yif2ot f convert VTF3 trace files to OTF format e 2 convert OTF trace files to VIF format e olfdump convert OTF traces or parts of it into a human readable long version otf de compress compression program for single OTF files e otfconfig shows parameters of the OTF configuration e otfprofile generates a profile of a trace in Latex or CSV format e offshrink creates a new OTF file that only includes specified processes e olfinfo program to get basic information of a trace See e opt bullxde utils OTF share doc OTF otftools pdf documentation on OTF tool usage e www tu dresden de zih otf for more details 76 bullx DE User s Guide 8 3 5 is a collection of tools that help create and manage CPUSETS 8 3 1 CPUSETs CPUSETs are lightweight objects in the Linux kernel that enable users to partition their multiprocessor machine by creating executio
153. ze Gather MPI Gatherv MPI Get count MPI Graph create MPI Ibsend Init MPI Intercomm create Intercomm merge Iprobe MPI Irsend MPI Isend Issend MPI Pack MPI Probe MPI MPI init MPI Reduce MPI Reduce scatter MPI Request free MPI Rsend MPI Rsend init MPI Scan Scatter MPI Scatterv MPI Send MPI Send init MPI Sendrecv MPI Sendrecv replace MPI_Ssend MPI Ssend init MPI Test MPI Testall MPI Testany MPI Testsome MPI Start MPI Startall Unpack Wait MPI Waitall MPI Waitany MPI_Waitsome io experiment Configuration File Options bullxprof io tracelevel number io experiment reports specific level of detail 1 basic 2 detailed and 3 advanced bullxprof timing io threshold lt float gt Enables the display of POSIX function statistics when percentage of POSIX I O region time is over the given value Set to O to disable this feature Chapter 4 Application Analysis with bullxprof 17 18 bullxprof io functions lt function1 functionN gt The list of profiled IO functions Supported values are selected from the following values open close creat creat64 dup dup2 dup3 lseek lseek64 open64 pipe pread pread64 pwrite pwrite64 read readv sync fsync fdatasync write writev mpiio experiment Configuration File Options bullxprof mpiio tracelevel lt number gt mpiio experiment

bullx DE User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents