Home

Presentation

1. Amna n OP THIGH MPI e AMR COD LB AMR H costs for e inlined procedures 1 92e 11 1 92e 11 1 87e 11 1 77e 11 WALLCLOCK u Sum I 100 1 100 8 97 4 92 1 WALLCLOCK us Mean I 1 80e 08 1 80e 08 1 75e 08 1 660 08 1 77e 1 n 08 1 77e etric panerF 2 08 1 77e 11 1 77e 11 1 77e 11 1 77e 11 1 77e 11 1 77e 11 1 77e 11 1 73e 11 1 73e 11 1 73e 11 1 73e 11 5 37e 06 2 040 05 2 400 04 1 200 04 1 200 04 92 1 92 1 92 1 92 1 92 0 92 0 92 0 90 3 90 3 90 3 90 3 0 01 0 01 0 01 0 01 0 01 1 660 08 1 660 08 1 660 08 1 660 08 1 660 08 1 660 08 1 660 08 1 620 08 1 620 08 1 620 08 1 620 08 5 040 03 1 9le 02 2 250 01 1 120 01 1 120 01 WALLCLOC The Problem of Scaling 1 000 0 875 Efficiency N 2 0 625 0 500 N Ideal efficiency Actual efficiency OQ qd CPUs Note higher Is better Y a aS C Goal Automatic Scaling Analysis e Pinpoint scalability bottlenecks e Guide user to problems Quantify the magnitude of each problem e Diagnose the nature of the problem Challenges for Pinpointing Scalability Bottlenecks e Parallel applications modern software uses layers of libraries performance is often context dependent Example climate code skeleton Lini D sene D C ocean D Gmoseherd e Monitoring bottleneck nature computation data movement synchroniza
2. REALTIME usec Sum I REALTIME usec Sum E 6 32e 08 100 6 32e 08 100 6 06e 08 95 8 5 80e 08 91 8 5 80e 08 91 8 1 51e 04 0 0 3 38e 08 53 5 3 38e 08 53 5 6 48e 07 10 3 4 14e 07 6 5 36e 07 8 5 1 72e 07 2 4 73e 07 7 5 1 64e 07 2 6 4 34e 07 6 9 8 66e 06 1 2 82e 07 4 5 1 59e 07 2 N 1 63e 07 2 6 50e 04 0 0 2 00e 04 0 0 2 00e 04 0 0 2 53e 07 4 0 2 63e 07 4 2 2 63e 07 4 2 2 10e 05 0 0 2 52e 07 4 0 4 66e 05 0 1 2 15e 05 0 0 8 09e 04 0 0 e Runtime support is necessary for tools to bridge the gap 2 Challenges for OpenMP Node Programs Typically tools present an implementation level view of OpenMP threads asymmetric threads master thread worker thread run time frames are interspersed with user code e Hard to understand relationship to program structure e Hard to understand causes of idleness serial sections load imbalance in parallel regions waiting for critical sections or locks 28 OMPT An OpenMP Tools API e Goal a standardized tool interface for OpenMP prerequisite for portable tools missing piece of the OpenMP language standard e Design objectives enable tools to measure and attribute costs to application source and runtime system support low overhead tools based on asynchronous sampling e attribute to user level calling contexts associate a thread s activity at any point with a descriptive state mini
3. HPCToolkit and MPI HPCToolkit Troubleshooting why don t have any source code in the viewer hpcviewer isn t working well over the network what can I do e Installation guide 42 Getting OMPT enhanced Intel OpenMP Currently a prototype open source project https code google com p ompt intel openmp Soon will be provided to Intel for integration in their runtime Getting the prototype clone the git repository with the code git clone https code google com p ompt intel openmp cd ompt intel openmp git checkout ompt support 14x cd itt libompss make the resulting runtime with OMPT support will be in the exports directory 43 Using HPCToolkit Adjust your compiler flags if you want full attribution to src add g flag after any optimization flags See what sampling triggers are available on your platform hpcrun L f your system s login nodes are different you need to run this command on your compute nodes 44 Collecting Performance Data e Collecting traces use a time based sample source when collecting a trace CPUTIME REALTIME PAPI TOT CYC use the t option to hpcrun e Measuring threads use REALTIME to profile threads otherwise you miss when they sleep need to use HPCRUN IGNORE THREAD71 need to ignore OpenMP MPI helper threads e Measuring an MPI job using hpcrun change mpiexec np 4 your program argumen
4. analysis to deliver insights 39 For Your Reference Getting and Using HPC Toolkit 40 Getting HPCToolkit Open source software See hpctoolkit org for pointers See hpctoolkit org for instructions to download and build Three different pieces of HPCToolkit hpctoolkit externals source code available in an svn repository on google code hpctoolkit source code available in an svn repository on google code OMPT support is still in a branch svn co http hpctoolkit googlecode com svn branches hpctoolkit ompt hpcviewer and hpctraceviewer user interfaces binary packages for your laptop workstation or cluster http hpctoolkit org download hpcviewer hpcviewer and hpctraceviewer linux mac and windows binaries source code available for a Java Eclipse RCP project Useful external library PAPI for measuring hardware counters http icl cs utk edu papi 41 Detailed HPCToolkit Documentation http hpctoolkit org documentation html e User manual http hpctoolkit org manual HPC Toolkit users manual pdf Quick start guide essential overview that almost fits on one page Using HPCToolkit with statically linked programs a guide for using hpctoolkit on BG Q and Cray platforms The hpcviewer and hpctraceviewer user interfaces Effective strategies for analyzing program performance with HPC Toolkit analyzing scalability waste multicore performance
5. influence them 1 CREER RNC REPREHENE NM MM ME wf 30 if CF_marker i gt 0 3 2 set to be a C pt ead Calling Context View s Callers View fs Flat View amp Aalt Oh VI AC Scope v OMP IDLE Sum I OMP IDLE Sum E REALTIME usec Sum I REALTI b hypre BoomerAMGCoarsenFalgout 306 07 57 6 1 33e 07 6 v hypre_BoomerAMGCoarsen 89e 07 48 9 4 60e 07 28 1 13e 07 C loop at par coarsen c 621 856 07 42 5 9 78e 06 1 E 361 hypre ParCSRMatrixExtractConvBExt 15e406 4 026 06 4 gt 252 hypre ParCSR CommHandleDestroy 32e406 4 31e405 1 amp loop at par coarsen c 838 j 88e 05 2 88e 05 j 11e404 4 u loop at par coarsen c 248 096 05 0 2 096 05 4 416 04 0 0 E 272 hypre BoomerAMGI ndepSetlnit 246 05 81e404 21e404 34 Assessing Variability Demo AMG2006 4 MPI ranks x 8 OpenMP threads 3 helper threads npcviewer amgeoos o S ooo O File View Window Help E par_amg_setup c par_coarsen c W Plot graph hypre BoomerAMGCoarse 3 Plot graph hypre BoomerAMGCoarsenFalgout REALTIME usec I Metric Value 00 00 00 20 00 40 00 60 00 80 01 00 01 20 01 40 01 60 01 80 02 00 02 20 02 40 02 60 02 80 03 00 03 20 03 40 03 60 03 80 Process Thread Ss Calling Context View 23 Callers View ft Flat View A 4 6f WI m At wx ue Scope v OMP IDLE Sum I REALTIME usec Sum
6. pass only add lrefine 1 blocks to tree s 207 1 Second pass add the rest of the blocks 208 Do ipass 1 2 lnblocks old lnblocks proc mype 212 P Loop through all processors 213 Do iproc 8 nprocs 1 214 215 If Ciproc 8 Then 216 off_proc False 217 Else E Calling Context View 23 AS Callers View it Flat View T 316 fo W SA a Scope Experiment Aggregate Metrics Y flash gt g driver evolveflash Y E driver initflash Y BE grid initdomain Y B gr expanddomain Y loop at gr expandDomain F90 119 Y B amr refine derefine Y E amr morton process Y E find surrblks Y E local tree build Y loop at local tree build F90 211 loop at local tree build F90 216 loop at local tree build F90 286 gt BE pmpi sendrecv replace significant scaling losses caused by hpcviewer FLASH white dwarf IBM BG P weak 256 gt 8192 passing data around a ring of processors 96 scalability loss v we unu ww wm DO OD ND 46e 01 46e 01 4le 01 04e 01 58e 00 58e 00 85e 00 56e 00 45e 00 18e 00 18e 00 18e 00 18e 00 14e 00 47e 01 100 amp 100 57 5 42 5 34 9 34 9 27 9 22 6 22 2 21 1 21 1 21 1 18 4 6 2 2 256 WALLCLOCK u 5 O7e 08 46e408 02e 07 45e 07 45e 07 42e 07 87e 06 75e 05 40e 05 25e 05 25e 05 25e 05 55e 05 00e 04 uv uN O0 OO OO D NW Qo Uu wD 4 ut 07e 08 23 1 Improved Flash S
7. 1019 974s 24 0 5 2 E i m JU Al monitor main E E main DRTM Simulation n E DRTM Simulation ti m DRTM Task time s E region_root at gt TTE TS Y TT TS T 3 Other HPC Toolkit Capabilities e Performance analysis of GPU accelerated code Milind Chabbi Karthik Murthy Michael Fagan and John Mellor Crummey Effective Sampling Driven Performance Tools for GPU Accelerated Supercomputers SC13 Nov 2013 Denver Colorado USA e Data centric performance analysis Xu Liu and John Mellor Crummey A Tool to Analyze the Performance of Multithreaded Programs on NUMA Architectures PPoPP 14 Feb 2014 Orlando Florida USA Xu Liu and John Mellor Crummey A Data centric Profiler for Parallel Programs SC13 Nov 2013 Denver Colorado USA 38 Ongoing Work and Future Plans e Ongoing work refining support for OMPT in HPCToolkit and OpenMP runtime refining measurement analysis and attribution optimized code general multithreaded models e g TBB CilkPlus improving scalability of hpctraceviewer and server e Plans enhanced performance analysis of GPU accelerated code sampling based measurement on emerging NVIDIA GPUs resource centric performance analysis e g bandwidth I O communication memory refined data centric analysis GUI to attribute costs to data measurement and analysis for exascale automated
8. I REALTIME usec Sum E v g 2431 hypre PCGSetup 1 456 08 90 2 1 826 08 76 7 v g 236 hypre BoomerAMGSetup 1 456 08 90 18 1 826 08 76 78 609 hypre BoomerAMGCoarsenFalgout 308407 33e 07 hdd loop at par amg setup c 1319 5 24e 07 32 5 1 692408 71 18 P B 1292 hypre BoomerAMGBuildCoarseOperat 3 086 07 19 1 6 412407 27 0 8 62e 04 0 0 b g 1231 hypre_BoomerAMGBuildinterp 1 93e 07 12 0 1 432407 6 0 9 026 03 0 0 P B 585 hypre BoomerAMGCreateS 2 086 06 1 34 9 022407 38 0 1 15e 05 0 0 P g 1062 hypre BoomerAMGCoarseParms 1 126 05 0 1 1 602404 0 0 1 00e 03 0 0 P g 500 hypre ParVectorlnitialize 2 81e404 0 0 4 02e 03 0 0 ef a gt A Recipe for Tuning MPI OpenMP e In priority order get the large scale MPI parallelization right if processes are blocked performance will be lost get the OpenMP threading right e if threads are blocked performance will be lost get the node performance details right e assess memory hierarchy performance TLB cache assess pipeline performance graduated instructions 36 Putting it all Together DRTM DRTM code 48 MPI ranks x 6 OpenMP threads rank 3 helper threads e hpctraceviewer home7ohnmc applications drtm drtim mpi 48x5 davinci File View Window Help i Trace View Si 1 1 cb 4P 4L Hr CO AM 7 7 B8 Call Path mL Time Range 1018 444s 1021 505s Rank Range 1 7 47 3 Nus Hair
9. WV Depth view OB summary view la OpenMP A Challenge for Tools e Large gap between between threaded programming models and their implementations 000 2 hpcviewer LULESH OMP host gt x fH HRM eee ee ve e ve ee see e oe soe exe exec Soc exec oc oco oco 7 compute the hourglass modes j1 pragma omp parallel for firstprivate numElem hourg for Index t i2 0 i2 numElem i2 Real t fx local fy local fz local Real t hgfx 8 hgfy 8 hagfz 8 Real t coefficient em DFAT Ns Calling Context View 2 A Callers View i Flat View 4 16 Ro MISA a Scope Experiment Aggregate Metrics Y monitor begin thread v B 940 kmp launch worker void Y E 729 kmp launch thread v I 6314 kmp invoke task func gt B TE Z22CalcKinematicsForElemsid_1931__par_loop0_2_855 geL Z28CalcHourglassControlForElemsPdd 1516 spar loopO 2 424 geL Z23lntegrateStressForElemsiPdS S S 864 par loopO 2 125 gSL Z3lCalcMonotonicQGradientsForElemsv 2040 par loopO 2 965 B 6333 kmp join barrier int gt amp 6302 kmp clear x87 fpu status word kmp runtime c 6236 gt amp 940 kmp launch monitor void Y monitor main Y amp 483 main gt amp 3187 LagrangeLeapFrog gt E 3049 Domain AllocateNodeElemindexes gt E 2995 Domain AllocateElemPersistent unsigned long User level calling context for code in OpenMP parallel regions and tasks executed by worker threads is not readily available
10. analysis hpcstruct e For dynamically linked executables e g Linux compile and link as you usually do nothing special needed Note OpenMP currently requires a special enhanced runtime for tools to be added at link time or program launch presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi HPC Toolkit Workflow rofile compile amp link i P l call path source i code Measure execution unobtrusively launch optimized application binaries dynamically linked launch with hpcrun arguments control monitoring collect statistical call path profiles of events of interest presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi hpcrun execution profile binary program structure analysis hpcstruct 10 Call Path Profiling Measure and attribute costs in context sample timer or hardware counter overflows gather calling context using stack unwinding Call path sample Calling context tree return address return address return address instruction pointer Overhead proportional to sampling frequency not call frequency HPC Toolkit Workflow i profile compile amp link source i code call path profile execution hpcrun optimized binary f binary analysis hpcstruct e Ana
11. ary program structure analysis hpcstruct 14 Code centric Analysis with hpcviewer 4 PEE npcviewer a Lodunovsd Linux 64 CC ft PatchGodunov cpp 9 PolytropicPhysics cpp Advance the finer level ond take into account possible subcycling by allowing for a change in stepsleft NOTE the 1f test looks redundant with above but it is LevelGodunov H PolytropicPhysicsF f ay change during a regrid i source pane uring a subcycle I don t know stepsleft timeStep a levei 1 stepsLeft timeBoundary e loops e function calls in full context The first time the next finer level time oligns with the cu level time After that this is not the cose redundant dbs Ss Calling Context View 2 Callers View Te Flat View i view control metric displav NOTE this if test is Experiment Aggregate Metrics Y main Y 8 282 amrGodunov Yloop at amrGodunov cpp 186 Y B 216 AMR run double int v inlined from AMK cpp 604 navigation pane Y 8 654 AMR timeStep int int boo Y 8 953 AMR timeStep int int bool Y 953 AMR timeStep int int bool Yinlined from AMR cpp 794 gt g 903 AMRLevelPolytropicGas advance gt 8 919 BoxLayout size const gt B 911 AMRLevelPolytropicGas computeDt AMR cpp 795 gt Ep967 AMRLevelPolytropicGas postTimeStepO gt E801 std ostream amp std ostream _M_insert lt long gt long
12. caling of AMR Setup Time seconds 20 F standard surr blks construction orrery L 18 custom surr blks construction P dd 16 14 12 gt gt kd gt e 10 kd a v t kd T gt gt k gt gt a 2 NA 0 PEC va A SCS a eT cae 0 2000 4000 6000 8000 10000 12000 14000 16000 Number of cores Graph courtesy of Anshu Dubey U Chicago 24 Understanding Temporal Behavior e Profiling compresses out the temporal dimension temporal patterns e g serialization are invisible in profiles e What can we do Trace call path samples sketch N times per second take a call path sample of each thread organize the samples for each thread along a time line view how the execution evolves left to right what do we view assign each procedure a color view a depth slice of an execution Processes a m E D FLY LY 7 Basen Time A ADA NIe ara 25 Trace View of FLASH3 256PE Demo Time centric analysis load imbalance among threads appears as different lengths of colored bands along the x axis hpctraceviewer flash EB view C ic tc 1 1 24 4 v iMzT g i Cae Path ESLJ Time Range 72 4439 89 3275 Rank Range 27 76 Cross Hair 084 7973 41 DCMF COME Colecives Alreduce T
13. e 1 blocks to tree s Second pass add the rest of the blocks Do ipass 1 2 lnblocks old lnblocks proc mype Loop through all processors Do iproc 8 nprocs 1 If Ciproc 98 Then off_proc False Else Calling Context View 93 Callers View It 4 foo ER At A Scope Experiment Aggregate Metrics Y flash gt BB driver_evolveflash Y B driver initflash Y B grid initdomain Y B gr expanddomain Y loop at gr expandDomain F90 119 Y B amr refine derefine Y amp amr morton process Y find surrblks vY g local tree build Y loop at local tree build F90 211 loop at local tree build F90 216 loop at local tree build F90 286 gt B pmpi sendrecv replace 2 46e 01 2 46e 01 1 4le 01 1 04e 01 8 58e 00 8 58e 00 6 85e 00 5 56e 00 5 45e 00 5 18e 00 5 18e 00 5 18e 00 5 18e 00 1 14e 00 5 47e 01 100 amp 100 amp 57 58 42 5 34 9 34 9 27 9 22 6 22 2 21 1 21 1 21 1 21 18 4 68 2 2 5 07e 08 256 WALLCLOCK u 1 5 07e 08 4 46e 08 6 02e 07 3 45e 07 3 45e 07 3 42e 07 2 87e 06 9 75e 05 8 40e 05 8 25e 05 8 25e 05 8 25e 05 2 55e 05 5 00e 04 22 Difference call path profile from two executions different number of nodes different number of threads Pinpoint and quantify scalability bottlenecks within and across nodes Scalability Analysis 000 Driver initFlash F90 30 53 206 First
14. eep B MEI f i 2I TEST kmp x66 pause dl MUI FOE UE Bh MEMINI BI i RIIS Fi ea A E Depth View EI Summary View E Mini Map 31 Blame shifting Analyze Thread Performance Problem Approach Undirected A thread is idle for not Blame waiting for work shedding enough Tii GT D parallelism to keep all threads busy A thread is idle waiting for a mutex for idleness of threads waiting for the mutex Directed Blame Shifting Tallent amp Mellor Crummey PPoPP 2009 Tallent Mellor Crummey Porterfield PPoPP 2010 Liu Mellor Crummey Fagan ICS 2013 Blame shifting Metrics for OpenMP OMP IDLE attribute idleness to insufficiently parallel code being executed by other threads e OMP MUTEX attribute waiting for locks to code holding the lock e attribute to the lock release as a proxy e Measuring these metrics requires sampling using using a time based sample source REALTIME CPUTIME PAPI TOT CYC 33 Blame Shifting with AMG2006 Demo AMG2006 4 MPI ranks x 8 OpenMP threads 3 helper threads File View Window Help amp par_coarsen c 3 f par_csr_matop c u 617 measure array i 0 m 18 20 if debug flag 3 wall time time getWallclockSeconds 2 for ig 0 ig lt graph size igtt 23 i graph array ig 24 2 5 en A A S E a E a T O 26 Heuristic C pts don t interpolate from neighbors that
15. eful waste metrics e Tools use hpcviewer to assess node performance at the call path function and loop levels 91
16. erators multi level memory hierarchy result gap between typical and peak performance is huge e Complex applications present challenges measurement and analysis understanding behaviors and tuning performance e Multifaceted performance concerns computation data movement communication I O What Users Want e Multi platform programming model independent tools e Accurate measurement of complex parallel codes large multi lingual programs heterogeneous parallelism within and across nodes optimized code loop optimization templates inlining binary only libraries sometimes partially stripped complex execution environments dynamic binaries on clusters static binaries on supercomputers batch jobs e Effective performance analysis insightful analysis that pinpoints and explains problems correlate measurements with code for actionable results support analysis at the desired level intuitive enough for application scientists and engineers detailed enough for library developers and compiler writers e Scalable to large jobs Outline e Overview of Rice s HPC Toolkit Pinpointing scalability bottlenecks scalability bottlenecks on large scale parallel systems scaling on multicore processors e Understanding temporal behavior e Assessing variability across ranks and threads e Understanding threading performance blame shifting e At
17. http hpctoolkit org slides hpctoolkit ogl5 pdf Performance Analysis of MPI OpenMP Programs with HPC Toolkit John Mellor Crummey Department of Computer Science Rice University ws SS P 4 A Rice Oil amp Gas HPC Workshop March 2015 1 http hpctoolkit org Acknowledgments Project team Research Staff Laksono Adhianto Mike Fagan Mark Krentel Students Milind Chabbi Karthik Murthy Recent Alumni Xu Liu William and Mary 2014 Nathan Tallent PNNL 2010 Current funding DOE Office of Science ASCR X Stack PIPER Award Intel BP pledge Challenges for Computational Scientists e Rapidly evolving platforms and applications architecture rapidly changing multicore microprocessor designs increasing architectural diversity multicore manycore accelerators increasing scale of parallel systems applications transition from MPI everywhere to threaded implementations enhance vector parallelism augment computational capabilities e Computational scientists needs adapt to changes in emerging architectures improve scalability within and across nodes assess weaknesses in algorithms and their implementations Performance tools can play an important role as a guide 3 Performance Analysis Challenges e Complex node architectures are hard to use efficiently multi level parallelism multiple cores ILP SIMD accel
18. load consider communication frequency and volume e avoid excessive fine grain messages avoid serialization make sure that parallelism is available on the node as well for use with OpenMP e Use asynchronous communication primitives where possible make computation asynchrony tolerant e overlap communication with computation e Tools use hpcviewer to look for performance and scaling bottlenecks issues apparent within a single execution e comparative analysis of multiple executions strong or weak scaling use hpctraceviewer to understand MPI parallelization 49 Tuning Recipe for MPI OpenMP Il Get the OpenMP threading right e Employ OpenMP where appropriate avoid fine grain parallel regions and loop nests barriers at the end of loops and regions can be costly consider how load will be balanced between threads e Consider OpenMP tasking for functional parallelism e Tools use hpcviewer and hpctraceviewer to examine threading performance e the summary view can help you assess idleness 90 Tuning Recipe for MPI OpenMP III Get the node performance right e Use hpcrun to profile your code using hardware performance counters e measure resource stalls and compare them with instruction and cycle counts e measure the memory hierarchy performance e caches and TLB e assess vector vs scalar code e vectors are an opportunity to accelerate your code e see the HPCToolkit manual for how to compute us
19. lyze binary with hpcstruct recover program structure analyze machine code line map debugging information extract loop nesting amp identify inlined procedures map transformed loops and procedures to source program structure presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi 12 HPC Toolkit Workflow profile execution hpcrun call path compile amp link profile source i code optimized binary binary program structure analysis hpcstruct e Combine multiple profiles multiple threads multiple processes multiple executions Correlate metrics to static amp dynamic program structure presentation interpret profile hpcviewer database correlate w source J hpctraceviewer hpcprof hpcprof mpi 3 HPC Toolkit Workflow profile execution hpcrun call path profile compile amp link source i code e Presentation explore performance data from multiple perspectives rank order by metrics to focus on what s important compute derived metrics to help gain insight e g scalability losses waste CPI bandwidth graph thread level metrics for contexts explore evolution of behavior over time presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi optimized bin
20. mize overhead if OMPT interface is not in use features that may increase overhead are optional define interface for trace based performance tools don t impose an unreasonable development burden e runtime implementers e tool developers 29 OpenMP Tools API Status e April 2014 OpenMP TR2 OMPT An OpenMP Tools Application Programming Interface for Performance Analysis Alexandre Eichenberger IBM John Mellor Crummey Rice Martin Schulz LLNL et al http openmp org mp documents ompt tr2 pdf major step toward having a tools API added to OpenMP standard e OMPT implementations BM Intel prototype LLVM coming e Next steps transition OMPT prototype into Intel for use with production OpenMP runtime propose OMPT additions to the language standard Analyzing MPI OpenMP with OMPT Demo AMG2006 4 MPI ranks x 8 OpenMP threads 3 helper threads nyctraceviewer momefohnmcrapplications AMGZ0O6NestpctooKit amyzooe datanase Ts 2 File View Window Help I Trace View wv amp T 7 amp Joy BY gt 7B 8 Call Path H Time Range 0 0s 7 021s Rank Range 0 0 3 t Cross Hair 4 41s 1 5 8 B nT i E monitor main fence2 E main EN E hypre PCGSetup B hypre BoomerAMGSetup hypre BoomerAMGCreateS B hypre BoomerAMGCreateS B l EMNEEUEE E MIRI COLE I ll kmp fork barrier int int JN W kmp hyper barrier release barri J kmp wait sl
21. tion 2 pragmatic constraints acceptable data volume low perturbation for use in production runs 18 Performance Analysis with Expectations e You have performance expectations for your parallel code strong scaling linear speedup weak scaling constant execution time e Put your expectations to work measure performance under different conditions e g different levels of parallelism or different inputs express your expectations as an equation compute the deviation from expectations for each calling context for both inclusive and exclusive costs correlate the metrics with the source code explore the annotated call tree interactively 19 Pinpointing and Quantifying Scalability Bottlenecks UP x iQ Xi am UN coefficients for analysis of weak scaling F 20 Scalability Analysis Demo Code University of Chicago FLASH Simulation white dwarf detonation Platform Blue Gene P Experiment 8192 vs 256 processors Scaling type weak piety 2neuboy aM Orzag Tang MHD 7 vortex Rayleigh Taylor instability Helium burning on neutron stars Magnetic Rayleoh Taylor A ene gner Figures courtesy of FLASH Team University of Chicago 21 Scalability Analysis of Flash Demo 000 Driver_initFlash F90 r uild F X S hpcviewer FLASH white dwarf IBM BG P weak 256 gt 8192 5 l First pass only add lrefin
22. ts to mpiexec np 4 hpcrun e REALTIME 1000 eOMP_IDLE t your_program arguments 45 Digesting your Performance Data e Use hpcstruct to reconstruct program structure e g hpcstruct your app creates your app hpcstruct Correlate measurements to source code hpcprof use on a workstation to analyze data from modest runs hpcprof mpi use on a cluster s compute nodes to analyze data in parallel from lots of nodes threads 46 Analysis and Visualization Use hpcviewer to open resulting database warning first time you graph any data it will pause to combine info from all threads into one file e Use hpctraceviewer to explore traces warning first time you open a trace database the viewer will pause to combine info from all threads into one file e Try our our user interfaces before collecting your own data example performance data at http hpctoolkit org examples html 47 Monitoring Large Executions Collecting performance data on every node is typically not necessary Can improve scalability of data collection by recording data for only a fraction of processes set environment variable HPCRUN PROCESS FRACTION e g collect data for 10 of your processes set environment variable HPCRUN PROCESS FRACTION 0 10 48 Tuning Recipe for MPI OpenMP Get the large scale MPI parallelization right first e Use an appropriate domain decomposition balance
23. uning strategy e Putting it all together analyze an execution of a DRTM code 48 MPI ranks x 6 OpenMP e Ongoing work and future plans e For your reference getting and using HPC Toolkit Rice University s HPCToolkit Employs binary level measurement and analysis observe fully optimized dynamically linked executions support multi lingual codes with external binary only libraries Uses sampling based measurement avoid instrumentation controllable overhead minimize systematic error and avoid blind spots enable data collection for large scale parallelism Collects and correlates multiple derived performance metrics diagnosis typically requires more than one species of metric Associates metrics with both static and dynamic context loop nests procedures inlined code calling context Supports top down performance analysis identify costs of interest and drill down to causes up and down call chains over time 7 HPC Toolkit Workflow profile execution hpcrun call path compile amp link profile source code binary binary program structure analysis hpcstruct presentation interpret profile hpcviewer database correlate w source hpctraceviewer hpcprof hpcprof mpi HPC Toolkit Workflow rofile compile amp link cuit call path profile hpcrun source optimized code binary l binary program structure

Presentation

Contents

Download Pdf Manuals

Related Search

Related Contents