Home

- Computer Science

image

Contents

1. double start finish middle middle 0 0 try InitKeLP argc argv KelpConfig CONTIG_MSG_IN_PLACE TRUE int L M N chk_freq niter reps si sj gi gj double eps cmdLine argc argv L M N eps niter chk_freq reps si sj gi gj Region3 domain 1 1 1 N N N Print header information OUTPUT rb3D run on P lt lt mpNodes lt lt nodes with N lt lt N lt lt endl OUTPUT Processors geometry lt lt gi lt lt x lt lt gj lt lt x lt lt mpNodes gixgj lt lt Blocking factors lt lt si 90 lt lt x lt lt sj lt lt endl OUTPUT Iter lt lt niter lt lt Reps lt lt reps lt lt endl if chk_freq lt niter OUTPUT Convg check every lt lt chk_freq lt lt Iterations Allocate space for the local part of the problem Use the dock library to help with the partitioning const Region3 STRIP_V 1 1 1 gi gj mpNodes gixgj Processors3 P STRIP_V OUTPUT P lt lt endl Decomposition3 T domain T distribute BLOCK1 BLOCK1 BLOCK1 P SputnikTimes OUTPUT T lt lt endl T addGhost 1 Manhattan grid T IrregularGrid3 rhs T Initialize the local grid InitGrid grid rhs fi11 0 0 IrregularGrid3 U grid double stop 1 0 double times new doublelreps double timesLoc new double reps const int RED 0 BLK 1 Do the computation
2. DecompositionX const RegionX amp R _domain R DecompositionX const DecompositionX amp D DecompositionX amp operator const DecompositionX amp D PSSS KK simple access functions DECC KK const PointX amp distributionRules const return _distType const RegionX amp domain const return _domain int domainEmpty const return _domain empty int domainLower const int dim const return _domain lower dim int domainUpper const int dim const return _domain upper dim int domainExtents const int dim const return _domain extents dim int domainSize const return _domain size query functions about the virtual processor array int pLower const int dim const return _Map lower dim int pUpper const int dim const return _Map upper dim int pExtents const int dim const return _Map extents dim int pMap const PointX amp P const return _Map P query functions about the global index domain int pIndex const int dim const int glndex const int pOwner const PointX amp P const RegionX pRegion const RegionX amp R const const XObjectX amp operator const int i const return FloorPlanX operator i J const XObjectX amp operator const PointX amp P const return this pMap P 72 void setowner const int i const int proc FloorPlanX setowner i proc void setowner const PointX amp P const int proc FloorPlanX s
3. TEST MODE ON n if maxThr 0 0 amp amp maxThr 1 0 cout lt lt MAX THREADS SET INDIVIDUALLY n else cout lt lt MAX THREADS SET AS A GROUP n else if testSMPS amp amp numThreads gt O numThr 0 gt 0 amp amp maxThreads 0 4 cout lt lt TEST MODE ON n if numThr 0 O amp amp numThr 1 0 cout lt lt SPECIFIC NUM THREADS SET INDIVIDUALLY n else cout lt lt SPECIFIC NUM THREADS SET AS A GROUP n else cout lt lt FIXED NUMBER OF THREADS n if numThr 0 O amp amp numThr 1 0 cout lt lt SPECFIC NUM THREADS SET INDIVIDUALLY n else cout lt lt SPECIFIC NUM THREADS SET AS A GROUP n k MPI_Get_processor_name procName amp j if k cout lt lt tname of SMP lt lt myid lt lt 78 lt lt procName lt lt endl int bestRun double bestTime double final double bestTimes new double nodes double staticTimes new double nodes int bestRuns new int nodes FILE stats MPI_Barrier MPI_COMM_WORLD MPI problem only root process gets argv correctly must broadcast ring size to other processes MPI_Bcast amp testSMPS 1 MPI_INT 0 MPI_COMM_WORLD MPI_Bcast amp hetero 1 MPI_INT 0 MPI_COMM_WORLD MPI_Bcast amp numThreads 1 MPI_INT 0 MPI_COMM_WORLD MPI_Bcast amp maxThreads 1 MPI_INT 0 MPI_COMM_WORLD MPI_Bcast amp maxThr 2 MPI_INT 0 MPI_COMM_WORLD MPI_Bcast amp numThr 2
4. s I have not only introduced a method for improved performance but reduced the complexity to do multi tier programming Most importantly Sputnik targets heterogeneous clusters As clusters of multiprocessors age unless a cluster is completely replaced at high cost as opposed to being upgraded being able to utilize an entire cluster without having an entire node or parts of several nodes remain idle is desirable No programmer researcher or owner of the cluster will want to waste precious time on a costly high maintenance piece of computing hardware To that end Sputnik does irregular partitionings and regulates the amount of OpenMP threads that are used within each node I C Organization of the Thesis This thesis discusses the structure of and problems associated with pro gramming a heterogeneous cluster of multiprocessors Chapter 2 presents background information on multiprocessors clusters of multiprocessors and tools for programming both Chapter 3 introduces a variety of methods for programming clusters of multiprocessors and also discusses a variety of experiments that I ran which led up to my conclusion that the API I had in mind would work Chapter 4 discusses the most important aspect of this thesis and my research the theory of operation of the Sputnik Model In addition chpater 4 also looks at a few implementation details Chapter 5 discusses the results of the Sputnik API in an application study Chapter
5. bestTime timings i bestRun i i MPI_Barrier MPI_COMM_WORLD bestTimes myid bestTime bestRuns myid bestRun for i 0 i lt nodes i 82 MPI_Bcast void amp bestTimes i 1 MPI_DOUBLE i MPI_COMM_WORLD MPI_Bcast void amp bestRuns il 1 MPI_INT i MPI_COMM_WORLD if myid 0 stats fopen StatPut w for i 0 i lt nodes i fprintf stats f n bestTimes i cout lt lt bestRuns node lt lt i lt lt lt lt bestRuns i lt lt endl cout lt lt bestTimes node lt lt i lt lt lt lt bestTimes i lt lt endl for i 0 i lt maxThreads 1 i fprintf stats Time for d threads f n i timings i fclose stats MPI_Barrier MPI_COMM_WORLD testSMPS 0 omp_set_num_threads bestRuns myid final SputnikMain argc argv testSMPS bestTimes else if testSMPS amp amp numThreads gt O numThr 0 gt 0 amp amp maxThreads 0 4 if numThr 0 0 omp_set_num_threads numThr myid cout lt lt SETTING ID THREADS lt lt myid lt lt lt lt numThr myid lt lt endl else omp_set_num_threads numThreads cout lt lt SETTING ID THREADS lt lt myid lt lt lt lt numThreads lt lt endl else 83 MPI_Barrier MPI_COMM_WORLD bestTimes myid SputnikMain argc argv testSMPS NULL MPI_Barrier MPI_COMM_WORLD for
6. i 0 i lt nodes i MPI_Bcast void amp bestTimes i 1 MPI_DOUBLE i MPI_COMM_WORLD if myid 0 for i 0 i lt nodes i cout lt lt bestTimes node lt lt i lt lt lt lt bestTimes i lt lt endl MPI_Barrier MPI_COMM_WORLD testSMPS 0 final SputnikMain argc argv testSMPS bestTimes testSMPS if numThr 0 0 omp_set_num_threads numThr myid else omp_set_num_threads numThreads if hetero if myid 0 cout lt lt TIMES TO PARTITION FOR n stats fopen StatPut r for j 0 j lt nodes j fscanf stats lf n amp staticTimes j if myid 0 cout lt lt staticTimes j lt lt seconds n fclose stats final SputnikMain argc argv testSMPS staticTimes 84 else Run unaltered final SputnikMain argc argv testSMPS NULL MPI_Barrier MPI_COMM_WORLD cout lt lt Final lt lt myid lt lt lt lt final lt lt endl cout lt lt Done lt lt endl cout flush MPI_Barrier MPI_COMM_WORLD Appendix C Source code to redblack3D with Sputnik C A rb F COBO RR KK KKK Kk KK 2k KKK KK K K subroutine rb7rrelax u ul0 uli ul2 uh0 uhi1 uh2 integer ul0 ul1 uh0 uhi ul2 uh2 double precision u ul0 uh0 ul1 uh1 ul2 uh2 peform 7 point red black relaxation for Poissons s equation with h 1 0 Originally written by Stephen J Fink Modified b
7. 2 u i j k c 2 u i 1 j k u i 1 j k 3 u i j 1 k u i j 1 k 4 u i j k 1 u i j k 1 5 c2 rhs i j k end do end do end do end do end do OMP END DO NOWAIT OMP END PARALLEL return end C B rb3D C DEORE kk rb3D C program that solves Poisson s equation on a unit cube using Red Black ordering We should never solve Poisson s equation this way this kernel is intended to be used in multigrid This version uses a modified custom MotionPlan to send contiguous messages where possible It also optimizes communication still further using the the Manhattan class to avoid communicating corner and edge ghost cells the solver uses a 7 point stencil so there is no need to send these extra points If you use this code as a starting point for another applcation and you need the corner or edge points do not use the Manhattan class use an IrregularGrid3 instead The code may be easily modified to this end Replace Manhattan by IrregularGrid3 and be sure to uncomment the code that sets up the Mover and MotionPlan objects XK XA XA XA XA X X A XA A KF XA KF KF XX X XX KF X Uncomment the following Mover member function calls execute Comment out the following Manhattan member function calls fillGhost Optimze Original 2D code was written by Stephen J Fink Extensively modified for benchmarking by Scott B Baden Deptartment of Com
8. I increased the problem size because I was not finding good scaling with N 761 Therefore for the case with 64 and 32 threads I had N 949 For 128 64 and 128 96 threads I used N 1163 As seen from the timings in table V 7 and especially from the speedups in table V 8 Sputnik performed well with large numbers of threads per system as well showing more than 34 improvement after repartitioning for either the 64 32 threads case or the 128 64 threads case As expected when the numbers of threads per node narrows and as the number of threads in total grows the speedup gains decline indicated by only a 6 77 improvement with the 128 96 threads case 53 Redblack3D with 48 Threads on balder 707 60F time seconds 10 Original Computation Time New Computation Time Predicted Time 0 a a mm 24 30 36 42 48 Number of Threads on aegir Figure V 4 Important redblack3D timings with 48 threads on balder and varying numbers of threads on aegir using the Sputnik library Threads Original Compute New Total New Compute balder aegir balder aegir balder aegir balder aegir 62 32 57 1 99 3 74 6 74 6085 71 5 71 8537 128 64 59 6 108 83 2697 83 7 80 3 75 5464 128 96 65 6 73 5 72 9498 72 7 65 5 68 841 Table V 7 Complete redblack3D timings for large numbers of threads per system V E 4 Anomalies There were some strange
9. J and S B Baden Run time Data Distribution for Block Structured Applications on Distributed Memory Computers CSE Technical Report Number CS94 386 September 1994 Foster Ian and Nicholas T Karonis A Grid Enabled MPI Message Passing in Heterogeneous Distributed Computing Systems PS SC 98 Conference Orlando FL Nov 1998 Gatlin K S and L Carter Architecture Cognizant Divide and Conquer Al gorithms SC 99 Conference Portland OR Nov 1999 Kesselman C et al lt http www globus org gt Gropp W W and E L Lusk A Taxonomy of Programming Models for Symmetric Multiprocessors and SMP Clusters Proc Programming Models for Massively Parallel Computers October 1995 Berlin pp 2 7 Hill J M D P I Crumpton D A Burgess The Theory Practice and a Tool for BSP Performance Prediction Applied to a CFD Application PRG TR 4 1996 Oxford University Computing Laboratory 1996 IBM PowerPC 604e User Manual lt http www chips ibm com products powerpc chips 604e 604eUM_book pdf gt Kleiman S D Shah B Smaalders Programming with Threads SunSoft Press Kohn S R A Parallel Software Infrastructure for Dynamic Block Irregular Scientific Calculations Ph D dissertation University of California at San Diego La Jolla CA 1995 Lauden J and D Lenowski The SGI Origin A ccNUMA Highly Scalable Server ISCA Lawrence Livermore National Labs Usin
10. Sputnik API if the experiment was successful After seeing that the program compiled and appeared to run successfully I had all of the elements in place to write the Sputnik API and begin gathering data on its effectiveness 2KeLP Web Page lt http www cse ucsd edu groups hpcl scg kelp gt Chapter IV The Sputnik Model and Theory of Operation IV A Introduction Using the experience gained from experiments with hand modifying var ious combinations of MPI KeLP OpenMP and Pthreads programs I extended the API and functionality of KeLP in a new set of routines called Sputnik These routines that I implemented perform two steps that work in tandem to achieve efficiency on heterogeneous clusters Although the Sputnik Model allows for any sort of shared memory mul tiprocessor and any sort of optimizations to be done in theory the Sputnik API has been written with a specific focus The API has been written with two of the many possible optimizations that are validated as good optimizations for the redblack3D benchmark in the next chapter The order of events for the Sputnik Model which the API based on is 1 ClusterDiscovery Runs the kernel of the program repeatedly on each sep arate node to determine the timings and relative performance changing program parameters over several runs to determine the configuration that 33 34 achieves optimal performance 2 ClusterOptimization Using the parameters which the pro
11. best times for each node This way not only does each node run with an optimal number of threads per node it may not be the maximum available but also with an optimal division of work time i SputnikMain int argc char argv bestTimes Appendix B Source Code to Sputnik B A DecompositionX h m4 paaa o kkk kkk kkk kkk kk kkk kk k kk kk kkk kkk kkk kkk k k kk k DecompositionX h m4 Author Stephen Fink Modified for Sputnik Sean Peisert Class DecompositionX represents a distributed index space with a regular block decomposition EEEE EE o ooo I AKIRA RI A kkk kkk kk XA A XA x x include ArrayX h include ProcessorsX h include GhostPlanX h tinclude menagerie h define BLOCK1 1 define BLOCK2 2 E E 2k 2k 2k 2k ak ak k k k k k 2k 2K 2K 2K FK 2K 2K 2K K 2K K 2K K K 2K 2K 2K 2K 2K 2K 2K 2K 2K 2K 2K K K K 2K 2K FK FK FK 2K 2K 2K 2K 2K K 2K K 2K 2K 2K 2K 2K 2K FK Class DecompositionX is a first class dynamic template BEAR class DecompositionX public GhostPlanX RegionX _domain Global region of DecompositionX PointX _distType Distribution directive in each dimension ArrayX lt int gt _Map maps virtual proc array to 1 d 70 71 floorplan public 1 RER RHIN IREM INR E HR kkk constructors and destructors 1 RER RHIN HER INR ER k k kkk DecompositionX DecompositionX ArrayIndexArguments _domain PointX 1 PointX ArrayIndices
12. combining both technologies to form a highly tuned piece of parallel code can be daunting For this reason using a library with multi tier support built in can make programming an application significantly easier than when using its component technologies MPI and Pthreads by hand with out sacrificing performance 23 I will discuss various programming libraries and methods in the next chapter Chapter III Heterogeneous Multi Tier Programming III A Previous Work Significant progress has already been made in programming homogeneous multi tier machines but the issue is still an open problem Software APIs have been developed which can specifically address a multi tier machine In the simplest approach at least one vendor has implemented MPI on their machines so that if a message is to go to another processor on node it gets converted into a native shared memory call If it goes off node it gets converted into a native messaging layer call Finally if it goes off machine it is sent via TCP IP A standardized technology exists based on this concept called Multiprotocol Active Messages and additionally Sun has implemented similar technology in their version of MPI that runs on the Sun Enterprise 6500 and Enterprise 10000 servers 42 Other vendors are working on their own implementations 16 17 III A 1 Multi Protocol MPI Multi protocol active messages are a technology that have been developed to allow different mes
13. directives around did not produce any parallel speedup at all Second it turned out that only when the program s built in cache tiling mechanism was disabled by using two specific options si 1 sj 1 did the pro gram produce any scaling as well And then as I mentioned before it performed several times worse than KeLP which uses MPI exclusively alone Again since the repartitioning increased the utilization of the cluster and speed of the program I did not feel that this affected the validity of my results I am confident that experiments on a true commodity cluster of multiprocessors as Sputnik was designed for will resolve the OpenMP scaling and speedup issues Chapter VI Conclusions and Future Work The results that Sputnik produced with the application study of red black3D indicate strongly that as part of the ClusterOptimization step discussed in chapter 1 repartitioning the load to balance out the time that each Origin2000 system spends running so that both nodes finish at the same time works The speedup over running unoptimized with uniform partitioning though not completely linear is good and works well for medium sized problems to very large ones The most dramatic results came with the example of 48 threads on balder and 24 threads on aegir where the repartitioning revealed a speedup of 35 7 This speedup is actually 2 1 better than the equations predict for the theoretical results This is shown despite OpenMP
14. equal two dimensional blocks Thus each block or Region would have one third of the dataset which would in turn get assigned to a cluster where each node gets only one MPI pro cess However a question this thesis addresses is What happens if each node is not equal in processing power What if instead of the processing power of the cluster being as evenly divided as the data is in figure IV 1 such that node 0 is twice as powerful as the other two nodes and so therefore can run in 10 seconds whatever nodes 1 or 2 can run in 20 seconds In this case the partitioning of the dataset should look like the one in figure IV 2 By modifying the DOCK library so that the domain is partitioned non uniformly according to how well each node really performs rather than a uniform partitioning the program can be run on the cluster more optimally In the previous chapter I discussed this partitioning scheme for use with the redblack3D version that runs with hand tuned MPI and Pthreads and described 39 Node 0 Node 1 Node 2 Figure IV 2 Two dimensional dataset partitioned so that node 0 gets twice as much data to work with as either node 1 or node 2 how the relative power of a node can be determined by comparing the inverse of its time with the sum of the inverse of the times forming a ratio This ratio is multiplied by the total amount of work available to determine the size of the block to give to a particular node IV E API Des
15. of optimizations more than just reparti tioning or adjusting the number of threads e Working with many other different types applications Contact Inforamtion The author can be contacted at the email address peisert sdsc edu This thesis can be downloaded in full in PDF format at http www sdsc edu peisert 59 Appendix A User s Guide A A Introduction Sputnik runs in two stages In the first stage the routines gather in formation about how the program actually runs on each multiprocessor node Once it gets timings for each separate node it calculates the fraction of the workload that each multiprocessor node should run This ratio looks like this time total time work total work in a cluster with N multiprocessor nodes Additionally the first stage determines the optimal number of OpenMP threads to run per node This particular optimization of the ideal number of threads per node to run is only one possible optimization that we could be doing Other possible optimizations might include tiling for cache and making predictions about dynamic optimizations that the program might need after the original partitioning In the second stage the routines partition the problem based on the calculated fractions and finally make a final run of the program at the peak speed possible based on the chosen partition Sputnik expands upon KeLP1 by adding to the API with two new func tions and one new technology Ope
16. on balder and varying numbers of threads on aegir V 5 Complete redblack3D timings with 48 threads on balder and varying numbers of threads on aegir bo one es ue BRR A dont V 6 Speedup and predicted timings for redblack3D with 48 threads on balder and varying numbers of threads on aegir 51 V 7 Complete redblack3D timings for large numbers of threads per system 53 V 8 Speedup and predicted timings for redblack3D with large numbers of threads per system a er des a A Se A LIST OF FIGURES I 1 Diagram of a heterogeneous cluster of multiprocessor nodes 2 IL1 Diagram of a multiprocessor 9 11 2 Diagram of a distributed memory machine 12 11 3 Diagram of an SGI Origin2000 Courtesy of SGI s Origin2000 and Onyx2 Performance Tuning and Optimization Guide IRIX 6 5 14 II 4 Diagram of a cluster of symmetric multiprocessors 14 II 1 Hierarchy of software layers for KeLP2 20 111 2 redblack3D with heterogeneous partitioning using hand coded MPI and Pthreads N 300 PEO has 2 threads and PET and PE2 have Teacha Gace se ad th ee eS SRA ae ge oes BH ese 26 IIL3 OpenMP Fork Join Example 28 11 4 OpenMP scalabilitytest code 4 44 1 La tps due bin moe 29 II 5 OpenMP Scaling Test Timings on an SGI Origin2000 with 250 MHz PROCESSORS tc De a eit ek oon Ge len Od oO ae AA TE 30 II 6 OpenMP Scaling Test Spe
17. optimal number of threads for each node When it has found the optimal times 63 it writes them to a file called stats and makes a final run of the program using the optimal number of threads per node and the optimal decomposition e The hetero argument is ignored 2 mpirun np 2 mpiprog testSMPS 1 maxthrO 10 maxthri 20 e With a maximum number of 10 threads on node 0 and 20 threads on node 1 Sputnik will test each node with the kernel of the program to determine the optimal number of threads for each node When it has found the optimal times it writes them to a file called stats and makes a final run of the program using the optimal number of threads per node and the optimal decomposition e The hetero argument is ignored 3 mpirun np 2 mpiprog testSMPS O hetero O numthreads 10 e Sputnik runs on 2 nodes with exactly 10 threads per node with homo geneous decomposition 4 mpirun np 2 mpiprog testSMPS 0 hetero 0 numthrO 10 numthri 20 e Sputnik runs on 2 nodes with exactly 10 threads on node 0 and 20 threads on node 1 with homogeneous decomposition Use of this set of options is comparable to running with KeLP2 5 mpirun np 2 mpiprog testSMPS 1 hetero 0 numthr0 10 numthri 20 e Sputnik runs on 2 nodes with exactly 10 threads on node 0 and 20 threads on node 1 but with heterogeneous decomposition based on a single test run made before the final run 64 6 mpirun np 2 mpipr
18. that helped shape who I am and bring me to this point And for a lot of fun And to Alex Eric Kent Laura P J Steve and Tage too I have such fantastic friends My girlfriend Cathy who has given me inspiration when things proved to be most frustrating elsewhere Ms Jean Isaac with whom I had my first computer class in the First Grade at St Mark s School in Marin County California who inspired me to explore computers more Mrs Judy Farrin my great friend and teacher who made computers and chemistry at Redwood High School in Marin County California fun and interesting by believing in me and letting me explore on my own And for letting me blow things up in the chemistry lab Dr Paul H J Kelly Dr Jarmo Rantakokko and Dr Daniel Shalit for their help and friendship especially when things were going very awry Uppsala University in Sweden for generously allowing me use of the Yg gdrasil DEC Alpha cluster The National Center for Supercomputing Applications NCSA and their extremely knowledgeable and helpful staff for the extremely generous allocation of computing time on the Silicon Graphics Origin2000 machines there and their help in using them When I couldn t find any other machines to use NCSA heroically bailed me out In particular I would like to thank Faisal Saied Louis Hoyenga Susan John Yiwu Chen Roy Heimbach Scott Koranda Dave McWilliams and Michael Pflugmacher xii UCSD and the UCSD Computer Sci
19. 6 presents my conclusions along with future work possibilities The appendices contain source code and present a user s guide to the Sputnik API Chapter II Clusters of Multiprocessors II A Multiprocessors A multiprocessor is a machine with two or more processors that all share the same main memory as shown in Figure II 1 Some multiprocessors contain processors that are equidistant from the main memory This is referred to specif ically as a symmetric multiprocessor or SMP which has uniform memory access or UMA By contrast some multiprocessors including the SGI Origin2000 have internal networks which cause the processors to have non uniform memory access or NUMA because the amount of time it takes for a processor to retrieve mem ory from two other processors might differ The SGI Origin2000 actually has a NUMA variation called cache coherent NUMA or ccNUMA Typically although main memory is shared on a multiprocessor each processor in the node is sepa rated from main memory by one two or sometimes even three levels of individual cache Multiprocessors are starkly different from their vector supercomputer and massively parallel multicomputer ancestors First by definition a multiproces sor uses shared memory whereas in an multicomputer all processors have their own separate main memories making them distributed memory also called shared nothing The SGI Origin2000 the principle machine used in obtaining the re Mai
20. ClusterOptimization component appears to work This can certainly be extended in the future to function well with problems that may benefit not only from a partitioning or balancing of the problem but possibly from other optimizations including cache tiling or focusing specific sections of the program on specific machines that have unique characteristics from which the sections would benefit The idea could certainly also be adapted to work in a dynamic environ ment as well where instead of sampling just once at the beginning testing and sampling could happen continuously throughout the run of the program to opti mally execute long running programs tuning throughout the run of the program The Model might also be brought to address nodes of multiple architec tures in the same cluster For example if we define our cluster to be an SGI Cray T3E an SGI Cray T90 and IBM SP connected by a high bandwidth low latency network we will have a phenomenally heterogeneous cluster or PHC as opposed to a slightly heterogenous cluster or SHC A problem that is able to make use of all of these machines and their unique characteristics would be rare but it is entirely possible that a program might have some loops that are easily vectoriz able and should best be directed to the T90 and parts of the program that simply should be farmed out to as many processors as possible on the T3E and SP 5 A 58 possible step for the ClusterDiscovery stage w
21. I Fortran and C Processors 256 128 Main Memory 128 GB 64 GB Peak Theoretical Performance 128 GFLOPS 64 GFLOPS Table V 1 Specifications for the two Origin2000 machines balder and aegir Although I experimented with a variety of environment variables in many possible combinations I found the optimal settings for DSM_MIGRATION to be ALL_ON and DSM_PLACEMENT to be ROUND ROBIN The system software configurations that I used and their versions are shown in table V 2 The two Origin2000 machines were connected by an SGI Gigabyte System Network GSN interconnect that support a maximum bandwidth 800 MB per second and have a theoretical latency of less than 30 microseconds Experimental results showed that the actual latency might be much closer to 140 microseconds and the bandwidth less than 100 MB per second Version Special Flags Operating System IRIX 6 5 MIPSpro f77 7 3 1m mp 03 mips4 r10000 64 MIPSpro CC 7 3 1m mp Impi lm lftn lcomplex 03 r10000 64 KeLP 1 3a Table V 2 Software configurations 44 V C Predicted Results Generically the optimal speedup can be computed from the following equations where T is total time For all cases both with heterogeneous and homogeneous partitioning the total time for the program is only as fast as the slowest node Lai slowestnode V 1 Thus for a cluster if one node runs faster than the
22. MPI_INT 0 MPI_COMM_WORLD if maxThr 0 0 amp amp maxThr 1 0 maxThreads maxThr myid int maxMax maxThreads Get the biggest number of threads on all nodes for i 0 i lt nodes i if maxMax lt maxThr i maxMax maxThr il if testSMPS 1 amp amp maxThreads gt 0 double timings new double maxMax 1 Initialize Values to Zero We can t use zero threads so we go all the way up to maxThreads for i 0 i lt maxMax i timings i 0 0 int gotit 0 int a_gotit new int nodes int a_min new int nodes int a_max new int nodes 79 int a_iter new int nodes int minMin 1 for i 0 i lt nodes i bestTimes i 0 0 bestRuns i 0 a_gotitli 0 a_min i 0 a_max i 0 i 1 Find the Max and Min times that the best is in between while i lt maxMax cout lt lt PE lt lt myid lt lt Running with lt lt i lt lt threads n MPI_Barrier MPI_COMM_WORLD a_iter myid i for j 0 j lt nodes j MPI_Bcast void amp a_iter j 1 MPI_INT j MPI_COMM_WORLD assert a_iter j i omp_set_num_threads i Initial run timings i SputnikMain argc argv testSMPS NULL Store the best time for the first run or a lower time if timings il lt bestTime i 1 amp amp gotit amp amp i lt maxThr myid bestRun i bestTime tim
23. P Test Seo ee Node 0 Node 1 Number of Threads for Node 1 30 Figure III 5 OpenMP Scaling Test Timings on an SGI Origin2000 with 250 MHz Processors 31 OpenMP Test Speedup 8 T i T A Speedup al T Parallel Speedup A T 1 32 16 8 4 2 1 Number of Threads for Node 1 Figure II 6 OpenMP Scaling Test Speedup for an SGI Origin2000 with 250 MHz Processors 32 Threads Node 0 Node 1 Speedup 32 1 8800 0 2417 7 8182 16 1 8867 0 2611 7 2263 8 1 8807 0 3280 5 7342 4 1 8733 0 5378 3 4833 2 1 8768 0 9297 2 0186 1 1 8677 1 7462 1 0696 Table 111 2 Table of OpenMP Scaling Test Speedup for an SGI Origin2000 with 250 MHz Processors had an advantage in ease of programming III C 3 KeLP and OpenMP Finally because MPI one of the component technologies that KeLP is built on functioned properly with OpenMP I assumed that KeLP would function properly with OpenMP However prior to adding an API to KeLP directly I decided to test both KeLP and OpenMP s compatibility and also the efficiency of programs built using both technologies I went back to rb3D this time but instead of starting with a hand coded MPI and Pthreads version I began with a KeLP version then added OpenMP directives and code to have the two interact which would be later replaced by calls directly to the new
24. R B Frost D Shalit KeLP User Guide Version 1 3 De partment of Computer Science and Engineering University of California San Diego September 25 1999 Baden S B and S J Fink The Data Mover A Machine independent Abstraction for Managing Customized Data Motion LCPC 99 August 1999 93 12 13 14 15 16 17 18 19 20 23 24 94 Baden S B and S J Fink Communication overlap in multi tier parallel algorithms SC98 Orlando FL Nov 1998 Baden S B and S J Fink A Programming Methodology for Dual Tier Multicomputers submitted for publication Bader David A and Joseph Ja Ja SIMPLE A Methodology for Program ming High Performance Algorithms on Clusters of Symmetric Multiproces sors SMPs Tech Rep CS TR 3798 Univ of Maryland Inst for Advanced Computer Studies Dept of Computer Sci Univ of Maryland May 1997 Barney Blaise M POSIX Threads Programming Maui High Performance Computing Center October 29 1998 Becker D J T Sterling D Savarese J E Dorband U A Ranawake and C V Packer Beowulf A Parallel Workstation for Scientific Computation Cappello F and O Richard Performance characteristics of a network of commodity multiprocessors for the NAS benchmarks using a hybrid memory model PACT 99 July 20 1999 Carter L Single Node Optimization NPACI Parallel Computing Ins
25. UNIVERSITY OF CALIFORNIA SAN DIEGO A Programming Model for Automated Decomposition on Heterogeneous Clusters of Multiprocessors A thesis submitted in partial satisfaction of the requirements for the degree Master of Science in Computer Science by Sean Philip Peisert Committee in charge Professor Scott B Baden Chair Professor Larry Carter Professor Jeanne Ferrante Professor Sidney Karin 2000 Copyright Sean Philip Peisert 2000 All rights reserved The thesis of Sean Philip Peisert is approved and it is acceptable in quality and form for publication on miero film University of California San Diego 2000 Dedicated to My parents and grandparents who raised me wonderfully and gave me everything I needed to figure out how to succeed in life including my Opa who is no longer here to see me present this thesis but who first set me in front of a typewriter an experience that without I would probably never have taken to the computer much less the supercomputer It is an old maxim of mine which states that once you have eliminated the impossible whatever remains however improbable must be the truth Sherlock Holmes The Sign of Four IT Ill TABLE OF CONTENTS Signature Page c oo eh ay A a ee A ew ek 111 Dedication s ams ode Se a de ae o LEE A iv O E E A So v Table o Contents me pls me data a a BN ae Geste vi Listsot Tables ai ase as wa a A ee e ix Distok Figure
26. We actually do it twice once with communication and once without if mpNodes gt 1 amp amp testSMPS for int k 0 k lt reps k STATS_RESET STATS_START STATS_ITERATION 91 if k 0 start MPI_Wtime for int i 1 i lt niter i Exchange boundary data with neighboring processors U gt fillGhost Perform the local smoothing operation on the RED Points ComputeLocal U si sj RED rhs Exchange boundary data with neighboring processors U gt fillGhost Perform the local smoothing operation on the RED Points Perform the local smoothing operation on the BLK Points ComputeLocal U si sj BLK rhs if k 0 finish MPI_Wtime middle finish start middle y STATS_STOP STATS_ITERATION times k STATS_TIME STATS_ITERATION if testSMPS cout lt lt SPUTNIK TIME WITH COMMUNICATION lt lt middle lt lt endl middle 0 0 for int k 0 k lt reps k STATS_RESETO STATS_START STATS _LOCAL if k 0 92 start MPI_Wtime for int i 1 i lt niter i Perform the local smoothing on the RED Points ComputeLocal U si sj RED rhs Perform the local smoothing on the BLACK Points ComputeLocal U si sj BLK rhs STATS_STOP STATS_LOCAL timesLoc k STATS_TIME STATS_LOCAL if k 0 finish MPI_Wtime middle finish start middle Re
27. ad something to do with memory distribution as well as thread distribution having all threads spawned on the same processor On the Origin 2000 threads and memory can be distributed through out the system There may be processors spawned on one part and memory placed on another using OpenMP Despite memory and thread migration and round robin memory distribution having all the OpenMP threads realize good memory access speed does not seem to be trivial 59 Due to the fact that the program did improve with greater numbers of threads the thread distribution idea could be discounted What was left was memory distribution and this was never solved I decided in the end though that since Sputnik demonstrated the results of my thesis that performance results could be analyzed and that action could be taken in the case of Sputnik adjusting the number of threads per system and re partitioning the dataset according to processing power that my thesis had been proven The Sputnik API is currently designed to work with a commodity cluster with a good MPI and OpenMP implementation It will take more work on other vendors systems with other vendors implementations of OpenMP to determine exactly what the cause of the problems with OpenMP on the Origin2000 are Another problem presumably related to the OpenMP issues as well had to do with scaling First it turned out that the loop that was most obviously the one to put the OpenMP
28. ads and hetero arguments and that it should test the speed of each individual multiprocessor node before making a final run with the optimal number of threads per node If equal to the integer 0 the hetero argument is checked If hetero is equal to 1 then the stats file is read for its timing information and threads are allocated to the nodes based on the timings from stats and the value of maxthreads If hetero is equal to 0 then the stats file and maxthreads are ignored and numthreads threads per node are used rather than reading the timings from the stats file All other conditions will produce an error 2 hetero Whether or not to run heterogeneously based on the stats file if the multiprocessor nodes are not tested individually 3 maxthreads If given is the maximum number of threads to be used on each node If not given a default value is used 4 numthreads The number of threads per node to allocate if testSMPS is 0 false A C Examples of Command Line Options An example of each possible running mode with the Sputnik command line arguments follows Each example runs an MPI program mpiprog with a certain number of threads on two multiprocessor nodes 1 mpirun np 2 mpiprog testSMPS 1 maxthreads 10 e With a maximum number of 10 threads on both nodes Sputnik will test each node with the kernel of the program to determine the
29. anomalies encountered In fact all of the results show that Sputnik gives a speedup within 5 2 of the theoretical optimum and the majority of the results are within 2 Some in fact are up to 2 3 better than the theoretical optimum Figure V 5 shows graphically really how close Sputnik comes to perfect speedup with the available optimizations An application written with KeLP can be converted to Sputnik very easily as long as a good OpenMP implementation exists and the kernel that one is trying to parallelize can in fact be taken and optimized by OpenMP well As noted by other researchers this is not always possible 21 44 45 This makes code for heterogeneous clusters almost as easy to program as homogeneous clusters by using 56 57 Sputnik instead of KeLP which was one of my primary goals Following the Sputnik Model the Sputnik API library could certainly be adapted to work with different component technologies For instance instead of KeLP1 and OpenMP it could be built on top of the one of the already existing multi tier API s described in my related work section The intent would be to continue to make developing Sputnik based scientific code supporting heteroge neous clusters of multiprocessors even easier than Sputnik currently provides while still achieving at least the speedup that was demonstrated on the Origin2000 s at NCSA Regardless of the component technologies used however the idea of a ClusterDiscovery and
30. before the program only runs as fast as the slowest system Therefore the time for the slowest system is also the time for the whole program The goal of Sputnik is to make the program run faster A side benefit of this is that it also increases machine utilization by not having a system or systems remain idle while waiting for the slower system or systems to catch up In table V 3 we can see the original 49 Redblack3D Using 32 Threads on balder 90 80F 70F a o T time seconds Original Computation Time 10 New Computation Time Predicted Time 0 ME ME MM 16 20 24 28 32 Number of Threads on aegir Figure V 1 Important redblack3D timings with 32 threads on balder and varying numbers of threads on aegir unmodified run with 16 threads on aegir causing balder to remain idle for up to 40 seconds while waiting for aegir to catch up As can be seen clearly from figure V 2 Sputnik does provide a speedup over the version of the program without the heterogeneous Sputnik partitioning As expected the speedup tapers off as the number of threads per system becomes closer together on the pair of Origin2000 s Not only does Sputnik provide but as shown it demonstrates improved system utilization because one system does not remain idle for nearly as long as it originally had At its best Sputnik shows a 34 9 speedup when 32 threads are used on balder and 16 thread
31. blocks Block decomposition facilitates the implementation of features including tiling for cache and repartitioning The Sputnik Model is different from the Sputnik API In the Sputnik Model I discuss how an API like KeLP can discover resources and act appropriately to refine the program to run heterogeneously In my API I have made specific choices and assumptions about what to implement Although as I have mentioned many possible optimizations could have been made I specif ically chose repartitioning and thread adjustment as methods for optimizing for heterogeneous clusters IV D 2 Heterogeneous Partitioning KeLP provides a mechanism for doing automated decomposition of a Grid object into Domain objects and further into FloorPlan objects which include assignments of each Region to processors This mechanism is contained in a library called DOCK DecOmposition Classes for KeLP Among other uses DOCK will take a Grid and partition it into equal size Regions with slight tolerance and 38 variation when the number of Regions to divide the Grid into does not evenly divide the amount of columns or rows index to be partitioned Since a Region or Grid can be multi dimensional DOCK can partition in multiple dimensions as well Roughly the one dimensional partitioning of a two dimensional Grid into three two dimensional Regions might look like the partitioning in figure IV 1 Figure IV 1 Two dimensional dataset partitioned into three
32. d method for many current cutting edge clusters of multiproces sors including the IBM Blue Horizon machine at the San Diego Supercomputer Center and ASCI Blue Pacific at Lawrence Livermore National Labs 37 62 23 finely tuned implementation of MPI for running optimally on their hardware but there are freely available versions of MPI as well including MPICH 7 Of further interest beyond MPI however was to be able to use a message passing library that supports both heterogeneity and block decomposition and also assists in hiding most of the details of heterogeneity especially data decomposition KeLP is a library that supports these requirements and has its communication mechanisms built on top of MPI 10 The experiments that lead up to my combination of OpenMP a thread library easier to use than Pthreads that is programmed with compiler directives and API calls and KeLP follow The first question I attempted to answer in the following experiments was whether or not the idea of repartitioning the data helped address the heterogeneous cluster I did this by first working with a benchmark built with MPI and Pthreads combined to work on a multi tier machine The second question I wanted to answer was whether MPI and OpenMP the two underlying technologies that I ultimately wanted to use were interoperable At the same time I wanted to assess the scalability of OpenMP on the Origin2000 Finally I tested KeLP and OpenMP tog
33. d well In this case I varied the number of threads on aegir from 24 up to 48 Table V 5 shows the complete list of timings as does figure V 3 The important timings showing only the slowest system from each run are indicated in figure V 4 The speedup which is very good is shown in figure V 5 and table V 6 In this case the results for running with 48 threads on balder and 24 threads on aegir are even slightly better than the case with 32 and 16 showing 35 7 speedup As with the runs with a 32 thread maximum per system these runs also differed from the predicted runs somewhat but to a lesser extent In the cases when I ran with 48 threads per system at best Sputnik produced results 1 67 52 Redblack3D with 48 Threads on balder and 24 Threads on aegir 60 50 OD 2 40 Q o 2 oO E 30F n Total balder Original Total aegir Original Computation balder 20 Original Computation aegir Balanced Total balder Balanced Total aegir 10 Balanced Computation balder Balanced Computation aegir Theoretical Balanced OS DE D ey A Figure V 3 Complete timings for redblack3D with 48 threads on balder and vary ing numbers of threads on aegir using the Sputnik library better than predicted and at worst it produced results 1 03 worse than predicted V E 3 Large numbers of threads per system With the very large numbers of threads per Origin2000 system
34. ded MPI and Pthreads N 300 PEO has 2 threads and PE1 and PE2 have 1 each pas 111 C 2 MPI and OpenMP Following successful results with MPI and Pthreads I combined MPI and OpenMP in a single program The idea if successful would make it easier for programmers to create multi tier programs since OpenMP is inherently easier to program with than Pthreads due to its higher level constructs I was not certain however if thread binding in OpenMP and compatibility with MPI would function properly on the Origin 2000 system Also based on the fork join model that OpenMP is built on as opposed to the parked threads concept that I discussed earlier in relation to Pthreads I was not sure that the MPI OpenMP combination would work with all of the different kernels I wanted it to A fork join model has the threads fork at the beginning of the parallel region declared with compiler directives and join at the end Thus if the parallel region is called many times there might be a significant amount of overhead involved in creating and cleaning up threads Because of the way some kernels are written I suspected that there might be problems with this fork join model For instance a tiny amount of computation in a Fortran kernel which is itself inside many nested C loops would cause a problem because the machine would be inefficient due to the cost of forks and joins at the beginning of the parallel region inside the Fortran code with each itera
35. e processes even need to communicate with each other One wants to make sure for instance that the processes or threads are as close by as possible to the others that they communicate with Whereas a multi tier API can ensure an order to the dis tribution of the processes or threads a scattering of MPI processes cannot Culler 42 Fink amp Baden 24 26 and Fink 23 employ higher dimensional partitionings to solve this problem Finally and most importantly although multiprotocol MPI can use all the processors and thus make some heterogeneous clusters function as if they are 18 homogeneous this is only true when all processors are the same speed If the processors are all different speeds then merely utilizing all the processors and the shared memory hardware is not enough One needs to make sure that slower processors are processing less data Although multiprotocol MPI is reasonable for a programmer to use to quickly run existing MPI based parallel software on a cluster of multiprocessors and realize improved performance starting from the ground up with a multi tier program written in Sputnik or migrated from KeLP to Sputnik the programmer can avoid the problems that I have just mentioned III A 2 Multi Tier APIs SIMPLE SIMPLE is an API that is based on two lower level technologies 14 and primarily addresses collective communication One is a message passing layer and the other is an SMP node layer The principle req
36. e node has the same percentage of work to be done relative to the total amount of work as the power of the node is relative to the total power of the cluster Thus Tiori a work R V 9 rotatoria WOT Ktotal and so finally the amount of work we give to each node is work R wor kiotal V 10 In the case of round off issues some work chunks will have 1 added to them to make sure all data is accounted for Plugging this back into our original equation for predicting the new time newamountofdatafornodei La imal Tiori ES v 11 pone md originalamountofdatafornodei work in e A V 12 A WOT ki orig R k ota Ti orig ui WOT Etotal V 13 WOT Ki orig Pi x worktota T orig Pta a V 14 wor ki orig 46 a T Ti orig N 1 yal Do Ti WOT ki orig k 0 De orig WOT ktotal se a T N 1 Te ey Poo T 2 0r19g k 0 This N 1 2 wor ktotal p 20 T V 17 N 1 WOT Ki orig o nin Tj k 0 WOT Ktotal 7 Tone V 15 V 16 done wor ki orig Tk orig V D Experiments The purpose of each of the experiments that I ran was to determine how close to the optimal time that the run could come by repartitioning with the Sputnik library regardless of the numbers of threads actually used or the size of the problem The experiment is designed to establish artificially various levels of heterogeneity in order to detect any sensitivity in Sputnik The results show
37. ed that Sputnik was insensitive to this parameter and so the degree of heterogeneity as long as no node is less than half as fast as any other node is not relevant As I have said Sputnik can find the optimal number of OpenMP threads per system to use Alternatively these can be manually and individually set Principally I manually set the amount of OpenMP threads per system so as to focus more on finding the right partition than finding the right number of threads per system I ran with many different configurations of threads per system however so in effect I was able to manually try to find the optimal number of threads per system The reason for this is that using a system with 128 256 threads per system one might have to do 30 or more runs to find the optimal number of threads per system and the time on the Origin2000 s was not available Additionally do to scaling issues I did not use the full number of proces sors on each Origin2000 By manually setting the number of threads I created a virtual cluster The virtual cluster had the effect of simulating a heterogeneous cluster since the full Origin2000 cluster could not be used and no other commodity 47 clusters as the Sputnik API was designed for were available To that end I ran with several different values First I fixed the size of the problem to N 761 761x761x761 unknowns and the number of threads on balder to 32 I then ran redblack3D once for each diff
38. edup for an SGI Origin2000 with 250 MHz IP TOCESSOLS A ce eats eee sank doe eA ee a ed he Nes bt ee e 31 IV 1 Two dimensional dataset partitioned into three equal two dimensional E A E cod Se a ee i 38 IV 2 Two dimensional dataset partitioned so that node 0 gets twice as much data to work with as either node 1 or node 2 39 V 1 Important redblack3D timings with 32 threads on balder and vary ing numbers of threads on aegir 49 V 2 Speedup for redblack3D with 32 threads on balder and varying num bers of threads on aegir using the Sputnik library 50 V 3 V 4 V 5 A l Complete timings for redblack3D with 48 threads on balder and varying numbers of threads on aegir using the Sputnik library Important redblack3D timings with 48 threads on balder and vary ing numbers of threads on aegir using the Sputnik library Speedup for redblack3D with 48 threads on balder and varying num bers of threads on aegir using the Sputnik library Hierarchy of software layers for Sputnik Xl 52 ACKNOWLEDGEMENTS I would like to thank the following people and institutions for their generous sup port and encouragement during my life and the research and writing process of this thesis All my friends especially my best friend Stephen Shore for many enlight ening discussions about life the universe everything and beyond for the past twelve years
39. efore the solution is to partition the dataset in a way that the nodes would each finish their runs at the same time 4 The problem of how to partition the dataset to do this was then created 5 The solution to the partitioning I decided was to find the relative speeds of each multiprocessor node and calculate the fraction of the power of the whole cluster that each individual multiprocessor node had Then one would assign the same fraction of the dataset to a node as the fraction of power of the node has in the whole cluster Power for a node is defined to be the inverse of the time that a node takes to run a benchmark relative to the sum of the inverses of the timings of the same benchmark on every node in the cluster III B 2 Requirements As I have discussed existing programs that use a hybrid of a message passing and shared memory model as tools to program multi tier machines do not work effectively on heterogeneous machines The feasibility studies I made and how they were modified in some cases from existing programs to support heterogeneous clusters follow in the next section An MPI based messaging library was an obvious choice to use as the inter node messaging component The reason is that MPI is standardized across all of the multiprocessor platforms with each vendor creating their own implementation adhering to the MPI standard 6 Not only does each vendor specifically have a The currently recommende
40. ence and Engineering department for five and a half great years of college and graduate education and a very good time The San Diego Supercomputer Center SDSC and the National Part nership for Advanced Computing Infrastructure NPACI and so many people there I certainly can t list every person who has been so wonderful in their friend ship and support but some of the people who have been there most have been Jeanne Aloia Harry Ammons Mike Bailey Chaitan Baru Gracie Cheney Parsons Cheryl Converse Rath Sandy Davey Richard Frost Jon Genetti Anke Kamrath Sid Karin Greg Johnson Dan Jonsson Amit Majumdar Yuki Marsden Jen nifer Matthews Steve Mock Reagan Moore Arcot Rajasekar Wayne Schroeder Teri Simas Allan Snavely Ken Steube Shawn Strande Peggy Wagner and Bill Zamora Without my first internship at SDSC doing MacSupport I never would have become involved with supercomputers and be where I am today Everyone in the PCL especially Shelly Strout for her friendship and mak ing writing a thesis more fun and helping to keep me sane My fantastic professors especially Professor Larry Carter with his en lightening challenging interactive classes that made algorithms and performance programming fun Finally Professor Scott Baden my friend and thesis advisor who gave me the opportunity to explore parallel and scientific computation and write a Master s thesis Despite many other life experiences I had never done any
41. ent in sending large frequent communications over a network with high con tention and varying latencies He proposes that collective MPI operations such as MPI Reduce might well first reduce within each SMP node then within each MPP and finally across MPPs 28 Non Uniform 2 D Grid Partitioning Crandall investigates different partitioning schemes for heterogeneous com puting 19 20 Unlike the Sputnik API which does a one dimensional decompo sition Crandall s work suggests a multi dimensional decomposition using a variety of different schemes including block strip and Fair Binary Recursive Decomposi tion The advantage of this work over a plain one dimensional strip decomposi tion is that the cache miss rate can theoretically be improved In a simple block decomposition for instance by adjusting the dimensions of the block s edges one can attempt to fit the rows or columns depending on the programming language 21 used in cache thus improving the performance by reducing the number of expen sive cache misses occurring Also the amount of data that needs to be transmitted but not necessarily the amount of sends and receives can be constrained Crandall claims that this trade off can lead to an overall savings in communication time Whereas strip decomposition might be extremely straightforward irregular blocks can be much more complicated however Crandall also works with a decomposition advisory sy
42. er D Dalrymple St Peter s College Oxford University England XIV ABSTRACT OF THE THESIS A Programming Model for Automated Decomposition on Heterogeneous Clusters of Multiprocessors by Sean Philip Peisert Master of Science in Computer Science University of California San Diego 2000 Professor Scott B Baden Chair Clusters of multiprocessor nodes are becoming common in scientific com puting As a result of the expandability of clusters faster nodes are frequently added and older nodes are gradually removed making the cluster heterogeneous As heterogeneity increases traditional methods for programming clusters of mul tiprocessors become less optimal because they do not account for the fact that a cluster will only run as fast as the slowest node Sputnik is a programming methodology and software library that addresses the problem of heterogeneity on a dedicated cluster of multiprocessors Sputnik uses a two stage process for running applications on a cluster of multiprocessors The first stage assesses the relative performance of each node by running the program individually on each node determining from the run times both the performance and application specific optimization Using the timings obtained from stage one the second stage partitions the dataset non uniformly according to the relative speed of each node All future runs of the program use the optimal partitionings and number of threads per node Sputnik is imp
43. erent number of threads for aegir that I wanted to test with For aegir I started with 16 threads and ran again with 20 24 28 and 32 The reason that I started with 16 is that I made the decision ahead of time that if one system was less than half as powerful as the other that it probably would not even be worth using the slower system Therefore I scaled my problem from one half of the fixed number of threads on balder up to the same number of threads on aegir After doing the experiments with 32 threads on balder I ran with the identical problem size with 48 threads on balder this time scaling from 24 to 30 36 42 and 48 Finally I ran just a few very large problem sizes with up to 128 threads on balder and 96 on aegir as I will discuss below In the tables and graphs that follow I include many different types of times Although one can compare the original and new total times communication plus computation to get the most realistic real world times comparing the computation times alone shows the results better The reason for this as I will show in a few limited runs is that when times are observed when communication is measured for homogeneously partitioned runs the times for both nodes will be nearly identical The reason for this is that they need to synchronize at various points and so both maintain a sort of lock step with each other at certain barriers Even though the times are similar however they are both very high Th
44. essor which inspired this thesis because it uses distributed shared memory Each node on the Origin2000 consists of two processors which have locally shared memory The nodes are all connected together in a complex structure to achieve 128 to 256 processors per system A diagram of this is shown in figure 11 3 This large system since it has distributed shared memory can be used to simulate a multiprocessor although one could argue that an Origin2000 itself is really a collection of tightly coupled SMPs Therefore since in these test cases I used a large part of the Origin2000 and a node is only a small part of the system in this results chapter I will refer specifically to a system where I have referred interchangeably to either a node or multiprocessor in previous chapters Likewise since before I ran with one MPI process per multiprocessor node here I will run with one MPI process per Origin2000 system Unlike an SMP this distributed shared memory system is not an UMA machine Instead it is a NUMA derivative called cache coherent non uniform memory access or ccNUMA I obtained my results by performing experiments on the two machines 43 balder and aegir shown in Table V 1 balder aegir Processor Type and Clock Speed 250 MIPS R10000 Cycle Time 4 0 ns Processor Peak Performance 500 MFLOPS L1 Cache Size 32 KB L2 Cache Size 4 MB Operating System IRIX 6 5 Compilers and Linkers Native SG
45. ether to make sure that by using KeLP instead of MPI even though KeLP is built on top of MPI there were no new incompatibilities of performance bugs when used with OpenMP HI C Maulti Tier Experiments III C 1 MPI and Pthreads In order to determine whether a heterogeneous API for KeLP might work I decided to use a hand coded multi tier program in a simulated heterogeneous hardware environment To do this I used a piece of software called Red Black 3D hereafter referred to as rb3D or redblack3D 23 11 According to its own documentation The rb3D program is a 3D iterative solver which solves Poisson s 24 equation using Gauss Seidel s method with Red Black ordering 58 The program is described in more detail in chapter 5 This particular implementation of the rb3D algorithm is hand coded using MPI and Pthreads This implementation partitions the data twice once for the node level and once for the processor level MPI is used to pass messages between multiprocessors and Pthreads are used to communicate within a multiprocessor node via shared memory The Pthread model is different than the one OpenMP uses in many ways but one important issue is what happens to the threads dur ing the run of the program between iterations Pthreads uses parked threads meaning that instead of the threads being destroyed between iterations or sections of the program they are temporarily parked for future use The benefit of
46. etowner pMap P proc void setDomain const RegionX amp D _domain D DOO GOO RA Ka a distribution methods DOO EEK void distribute const PointX amp D const ProcessorsX amp P double Times void distribute ArrayIndexArguments const ProcessorsX amp P double Times distribute PointX ArrayIndices P Times void distribute const PointX amp D double Times distribute D ProcessorsX Times void distribute ArrayIndexArguments double Times distribute PointX ArrayIndices ProcessorsX Times private simple access functions int procExtents int dim const return _Map extents dim int procLower int dim const return _Map lower dim distribution functions void distributeBlock1 int dim double Times void distributeBlock2 int dim endif B B DecompositionX C m4 B B 1 distribute POCO o kkk kkk kkk kkk kk kkk kk kkk kk kkk k kk kkk kkk k k k k k void DecompositionX distribute const PointX amp D const ProcessorsX amp P 73 Distribute a decomposition across the logical processor array BAO AGAR I I ED I A A A 1 21 21 21 kkk kkk 4 24 24 24 void DecompositionX distribute const PointX amp D const ProcessorsX amp P double Times initialize the _Map array _Map resize P region resize _Map size int i 0 for_point_X p _Map _Map p i itt end_for if domainEmpty return for int di
47. ey are clearly demonstrating the fact that a program can run only as fast as its slowest node It is more interesting and informational however to see the times for exclusively computation 48 V E Results of redblack3D V E 1 Up to 32 threads per system Threads Original Compute New Total New Compute aegir balder aegir balder aegir balder aegir 16 47 7 87 7 66 415 65 7 65 60 4183 20 48 4 74 4 63 0848 63 8 61 3 58 0324 24 51 62 7 58 4 59 5019 56 55 4033 28 50 55 6 54 409 53 8 50 3 51 4442 32 51 2 48 7 51 6 52 5732 48 50 5865 Table V 3 Complete redblack3D timings with 32 threads on balder and varying numbers of threads on aegir Threads New Compute Predicted Speedup Theoretical balder aegir balder aegir Speedup 32 16 65 60 4138 61 7916 1 3492 1 4193 32 20 61 3 58 0324 58 6476 1 2137 1 2686 32 24 56 55 4033 56 2480 1 1196 1 1147 32 28 50 3 51 4442 52 6515 1 0808 1 056 32 32 48 50 5865 49 9187 1 012 1 0257 Table V 4 Speedup and predicted timings for redblack3D with 32 threads on balder and varying numbers of threads on aegir Figure V 1 table V 3 and table V 4 show the results for the runs of redblack3D with 32 threads on balder and 16 to 32 threads on aegir The important numbers to compare are the slowest of the original times for computation to the slowest of the new times for computation This is because as mentioned
48. filling Curves January 10 1995 Baden S B RedBlack 3D 1999 Ridge D B Becker P Merkey and T Sterling Beowulf Harnessing the Power of Parallelism in a Pile of PCs Schopf J M and F Berman Performance Prediction Using Intervals UCSD CSE Dept Technical Report CS97 541 May 1997 Schopf J M and F Berman Performance Prediction in Production Envi ronments UCSD CSE Dept Technical Report CS97 558 September 1997 San Diego Supercomputer Center SDSC lt http www sdsc edu gt SGI Cray Models of Parallel Computation SGI Cray Origin2000 and Onyx2 Performance Tuning and Optimization Guide IRIX 6 5 lt http techpubs sgi com library tpl cgi bin browse cgi coll 0650 amp db bks amp cmd toc amp pth SGI_Developer Or0n2_PfTune gt Simon H D and S H Teng How Good is Recursive Bisection SIAM J S C 1995 Smallen S W Cirne J Frey F Berman R Wolski M H Su C Kesselman S Young M Ellisman Combining Workstations and Supercomputers to Support Grid Applications The Parallel Tomography Experience October 12 1999 Sun Microsystems Inc Sun Servers lt http www sun com servers gt Sun Microsystems Inc UltraSPARC II Products lt http www sun com microelectronics UltraSPARC II index html gt 98 69 Wolski R N Spring and C Peterson Implementing a Performance Fore casting System for Metacomputing T
49. g ASCI Blue Pacific lt http www llnl gov asci platforms bluepac gt Lawrence Livermore National Labs The ASCI Program lt http www 11n1 gov asci gt 39 40 41 42 43 44 45 46 47 48 49 90 51 52 96 Grimshaw A et al lt http legion virginia edu gt Lenoski D J Lauden K Gharachorloo W D Weber A Gupta J Hen nessy M Horowitz and M S Lam The Stanford Dash Multiprocessor Stanford University March 1992 Lim B H P Heidelberger P Pattnaik and M Snir Message Proxies for Efficient Protected Communication on SMP Clusters Proc Third Int Symp on High Performance Computer Architecture San Antonio TX Feb 1997 IEEE Computer Society Press pp 116 27 Lumetta S S A M Mainwaring and D E Culler Multi protocol active messages on a cluster of SMPS in Proc SC97 Nov 1997 Majumdar A Parallel Monte Carlo Photon Transport NPACI Parallel Computing Training 1999 May J B de Supinski B Pudliner S Taylor and S Baden Final Report Programming Models for Shared Memory Clusters Lawrence Livermore Na tional Labs 99 ERD 009 January 13 2000 May J and B R de Supinski Experience with Mixed MPI Threaded Pro gramming Models Lawrence Livermore National Labs UCRL JC 133213 Mitchell N L Carter J Ferrante and K Hogstedt Quantifying the Multi Level Na
50. gram estimates to be optimal from ClusterDiscovery decomposes the data non uniformly based on the relative powers of the nodes 3 Runs the program one last time using the optimizations and decompositions from ClusterDiscovery and ClusterOptimization IV B Sputnik Model IV B 1 ClusterDiscovery The ClusterDiscovery performs an estimation It searches the param eter space somewhat intelligently for the available optimizations and seeks a performance gain in the program it is optimizing The ClusterDiscovery works transparently to the user Since one of the goals of the Sputnik Model is to allow the user to program as if the cluster is heterogeneous the ClusterDiscovery runs through the possible optimizations automatically and finds the best optimizations and the timings for those runs using the optimizations The kernel runs inside a kind of shell so that Sputnik has access to running it whenever it needs to Not only does the ClusterDiscovery save the user from manually searching for all the optimal configuration parameters but there is no firm limit on the amount of permutations that can be searched since the search all happens transparently inside its shell Optimizations could include adjusting the number of OpenMP threads as is done in the Sputnik API doing cache tiling sending vectorizable code to a vector computer in the cluster and parallelizable code to an MPP in the cluster and a huge number of other possible variat
51. gram repeatedly on each individual multiprocessor node varying the number of threads until the optimal number of threads per node is reached The kernel runs in a kind of shell that Sputnik creates using SputnikMain described below 3 Run the kernel with the optimal number of thread per node In order to use the Sputnik library there are three principle changes that need to be made to the KeLP code in addition to adding OpenMP directives First a new function needs to be defined by the programmer SputnikMain Second all calls to the distribute function from the user code need to have an additional argument added Finally the main Sputnik function SputnikGo needs to be called by the programmer from within main 66 A E 1 SputnikMain SputnikMain is the code that is called over and over again while trying to find the optimal number of threads per multiprocessor node It is not just the kernel of the program but is also everything that the program needs to call before running the kernel such as the initialization of values in arrays SputnikMain returns a double The value of that double should be the time it takes for the kernel to run For example double SputnikMain int argc char argv double SputnikTimes double start finish lt declaraitions initializations gt start MPI_Wtime start timing kernel call the kernel function finish MPI_Wtime finish timing return finish s
52. he Network Weather Service Proceed ings of Supercomputing 1997
53. i 0 for_point_X p _Map dimOffset p dim PLower 75 if Times NULL amp amp P gt 1 ceiling if i gt e 0 low lse low ceilings i aHigh i 1 1 domainLower dim ceiling dimOffset aHigh i low ceiling 1 i else low domainLower dim ceiling dimOffset setlower _Map p dim low setupper _Map p dim MIN low ceiling 1 domainUpper dim end_for B C Sputnik h void SputnikGo int argc char x argv double SputnikMain int argc char xargv int testSMPS double SputnikTimes B D Sputnik C include include include include include include include include define define define define lt iostream h gt Sputnik h lt omp h gt lt mpi h gt lt stdlib h gt lt stdio h gt lt string h gt lt assert h gt def _numThreads 10 def _maxThreads 0 def _testSMPS 0 def _hetero 0 76 void SputnikGo int argc char argv int numThreads int hetero int maxThreads int testSMPS int maxThr 2 numThr 2 int i j k int myid nodes char procName 80 MPI_Comm_rank MPI_COMM_WORLD amp myid MPI_Comm_size MPI_COMM_WORLD amp nodes if myid 0 testSMPS def_testSMPS testSMPS def_hetero numThreads def_numThreads maxThreads def_maxThreads maxThr 0 maxThr 1 numThr 0 numThr 1 E 0 0 0 0 E E fo
54. ign The API is designed so that the main routine of a program is moved mostly to a user defined routine called SputnikMain The real main does initialization and calls a routine called SputnikGo SputnikGo acts as a kind of shell that calls SputnikMain over and over to determine the optimal number of threads per node and make the final run with the optimized configuration The repartitioning one of the primary features of the Sputnik API is a modification of the distribution functions in the DOCK library Although no new functions are added the distribution functions are mostly rewritten to support non uniform 40 partitioning IV F Assumptions and Limitations There are assumptions this API makes First it assumes that no node is more than twice as fast as another node This assumption also helps to ensure that communication time does not overwhelm a particular node because it is doing so much less computation than another node Second because the API is running the entire kernel on each separate node it assumes that the speed of the node will not change when the node is given only a portion of the entire computational domain to run as is done after the repartitioning for the final run Finally the repartitioning that Sputnik does is only one dimensional Although DOCK and therefore KeLP support multi dimensional decomposition for simplicity Sputnik does not One reason to support multi dimensional decompositi
55. ings i If the time goes up if timings i gt bestTime amp amp gotit 0 a_min myid i 4 80 a_max myid i gotit 1 a_gotit myid 1 keep increasing number of threads if i 2 gt maxMax amp amp i maxMax if gotit 0 2nd highest power of 2 before maxMax a_min myid i 2 maxMax a_max myid maxMax i maxMax else 1 2 Check to see if everyone s time went up int allgotit 0 for j 0 j lt nodes j MPI_Bcast void amp a_gotit j 1 MPI_INT j MPI_COMM_WORLD allgotit a_gotit j if a_max myid maxThr myid a_min myid maxThr myid 2 gotit 1 a_gotit myid 1 minMin a_min myid maxMax a_max myid for j 0 j lt nodes j 4 MPI_Bcast void amp a_min j 1 MPI_INT j MPI_COMM_WORLD if a_min j lt minMin minMin a_min j 81 for j 0 j lt nodes j MPI_Bcast void amp a_max j 1 MPI_INT j MPI_COMM_WORLD if a_max j gt maxMax maxMax a_max j if minMin lt 0 minMin 1 i minMin Step through from the min to the max to find the best while i lt maxMax if timings i 0 0 i continue cout lt lt PE lt lt myid lt lt Running with lt lt i lt lt threads n omp_set_num_threads i timings i SputnikMain argc argv testSMPS NULL if timings i lt bestTime amp amp i lt maxThr myid
56. io PU Model ese mer AA Res a AA EA ae 34 1 Cl sterDiscovery Line Een ta e da dt 34 2 Cluster Optimiser ade Boe Se da UN RS di a 35 De MO Le rt ig mn A et Se Se nl sinh gg Sa NP awk Des ae 30 C Sputnik API O a 36 EEO E E A RU he A REE 36 D Decomposition 2 ct Se aN eee RN eee ec US 37 1 Block Decomposition ada Wien Sa Le ere A DR 2e 37 2 Heterogeneous Partitioning 5 Lu le di a ce Ss 37 E APEDESI OI nee ent WO arc are Gt eh ae Ge PER te Mani 39 F Assumptions and Limitations o 0a a STS 40 V Validation Aix aoe a a a ly A ee ee a Be 41 Ae Ti OGWEHOW ds ake Sk Mee a aura ROR Sf Des oe Gi DASS A 41 Lis R Black 3D ta ne se A A RUE RAA 41 Bl ACL ETS te ie ere EE ADA al dl ae 42 C Predicted Results Shoe are eae te bos a bow dB be a 44 D Experiments ca SE AN ee eee de Se oe ee 46 Ex Results of redblack3D su AAA A une die dk BS 48 1 Up to 32 threads per system A hat li dk 48 2 Up to 48 threads per system 51 3 Large numbers of threads per system 52 4 Anomalies se LU A Rd Red Leds pelo Lc de 53 VI Conclusions and Future Work tease E Ea Gif dem be 56 Contact Information s mens ai iva nan Lune io 59 Appendices A MUS E CHIT Ce Fe gs es TA 60 A Introduction AA et Te dt ten GNT a I a et nt ae 60 B Major Changes in Usage from KeLP1 61 C Examples of Command Line Options 62 D Example of Intra Node Parallel
57. ions in the entire parameter space of possible optimizations for scientific code 39 This shell is run separately on each separate node in the cluster so that each node is optimized individually with a distinct parameter set and timing results are returned to the shell for those optimizations One does not need to know anything about the characteristics of each node in the cluster which may or may not even be multiprocessors prior to running a program written using the Sputnik Model IV B 2 ClusterOptimizer The ClusterOptimizer uses the optimizations found in the ClusterDiscov ery stage and the best timings of each node to decompose the computation of the problem according to the performance of each node in the cluster A node discov ered to have better performance than other nodes will therefore work on a larger chunk of the problem Depending on the size of the problem as a whole the cache sizes and the amount of communication taking place there are a variety of differ ent decomposition schemes available which are described below The important thing however is to make sure that which ever decomposition scheme is used computation must be balanced out so that each node finishes at the same time Finally one must make sure that after decomposition communication does not overwhelm computation To the degree that the original problem does not have this issue and we insist that no node is slower than half the speed of the fastest
58. ism 64 E Sp tnik Usage ek ite IA RI SE Lee US 65 l SPUR MAM enc odes ate E Ae ke a eS 66 F Sputnik Implementation 2 4 408 8 4 84468 e fae AES 67 vil B Source Code to Sputnik rt Som me A ee Boas Be hos Pati de 70 A DecompositionX h m4 ir Ya a 70 B Deco posto XI oa AA D RES a Pew mie LA n 72 Le distributes ar ton a toe ih poe ss hee st hel dez 72 2 distributeBlock aoaaa a 73 Es SRE ss y en e ey Me oko ete Ae vst a Su AP xis aada vee Os ot 75 P Sputnik ks oe ce ee ee tS cae OS ae acta oe 75 C Source code to redblack3D with Sputnik 85 Fig DD er NU es ee ane Se Se QL he he ess bee ee bee eee ES 85 Be BSO Gale ls e Wake infos Ie et oh da A 87 a A ee Sh ae Bi NE 93 vill LIST OF TABLES II 1 Growth of connections in a crossbar switched machine II 1 redblack3D MFLOPS rates with heterogeneous partitioning using hand coded MPI and Pthreads N 300 PEO has 2 processors and PET and PE2 have each cunde ridad ia 111 2 Table of OpenMP Scaling Test Speedup for an SGI Origin2000 with 250 MHZ Processors cp ay ate Se Wy oe oa ADA V 1 Specifications for the two Origin2000 machines balder and aegir V 2 Software configurations lada DA ee PO ee ke te V 3 Complete redblack3D timings with 32 threads on balder and varying numbers of threads on aegir ei Late ue da ee eee ee het V 4 Speedup and predicted timings for redblack3D with 32 threads
59. l 1 or level 2 cache of varied sizes is certainly also possible The emphasis is however that the ClusterDiscovery and ClusterOptimization are separate processes that can function in tandem easily and are not just limited to multiprocessors Other architectures and possible optimizations are beyond the scope of this thesis and are not addressed in this incarnation of the API that I have developed Finally I make certain assumptions about the condition of the cluster 1 Multiprocessor nodes are connected by a uniform local dedicated network 2 The program running has dedicated access to the cluster hardware 3 The cluster is set up to have many more processors per node on average than nodes in the whole cluster A cluster with many nodes of very few processors including ASCI Blue Pacific may be better off with a single tiered approach including MPI because the shared memory aspect of Sputnik will be much less relevant 4 No node in the cluster runs less than half as fast as any other node in the cluster 5 The problem does not fit in entirety into memory cache on any of the processors IB Heterogeneity I take the concept of some of the existing API s for programming clusters of multiprocessors one step further by directly supporting heterogeneous clusters Acting on the assumption that clusters of multiprocessors are the immediate future of high performance parallel computing I decided that programming clusters
60. lemented on top of the KeLP infrastructure to handle ir regular decomposition and data motion It enables code to be written for a het erogeneous cluster as if the cluster is homogeneous Sputnik can run scientific XV applications on a heterogeneous cluster faster with improved utilization than a nearly identical program written in KeLP alone Experimental results from a pair of SGI Origin2000 s indicate that Sputnik can improve run time of an iterative solver for Poisson s equation by 35 percent xvl Chapter I Introduction IA Motivation The computer hardware used for scientific and technical computing is continually evolving In the most recent generations of supercomputing hardware there have been vector supercomputers made by companies including Cray as well as multicomputer style massively parallel processors MPPs made by many companies including Cray IBM Intel Hewlett Packard and SGI Clusters of mul tiprocessors however are increasing in popularity replacing older mainframes As a result of mass production of the components used monetary costs for purchas ing clusters of multiprocessors are dropping and therefore use of the technology has been spreading from business computing to scientific and technical computing replacing vector supercomputers and multicomputer MPPs in science where they replaced mainframes in industry Unfortunately although multiprocessors and multiprocessor clusters are attractive
61. ly finely grained communication might be inappropriate for the current iteration of the API because extremely tight communication imposes a kind of synchronization on the program that might negate the speedup that Sputnik can provide The API is packaged as a C library and is built on top of the KeLP infrastructure 23 35 The library allows users to write scientific programs that run effectively on heterogeneous clusters of multiprocessors where the component nodes may all run at different speeds It is intended as something of a proof of concept about what steps are needed to make heterogeneous clusters run more efficiently I also present a broader theory of which the API is merely a subset The broader theory the Sputnik Model is not limited to multiprocessors stencil type applications or an just two optimization techniques This thesis introduces a two stage process for optimizing performance on a heterogeneous cluster Though Sputnik has been targeted for multiprocessors some or all of the nodes may be uniprocessors The first stage the ClusterDiscovery stage performs a resource discovery to understand how the application or perhaps a part of the application runs on each individual node in the cluster The second stage ClusterOptimization makes specific optimizations based on what the first stage has discovered Depending on the hardware and the type of problem there can be many possible types of opti mizations In this
62. m 0 dim lt NDIM dim switch D dim case BLOCK1 distributeBlocki dim Times break case BLOCK2 distributeBlock2 dim break default break do processor assignments for_point_X p _Map setowner _Map p P p end_for B B 2 distributeBlock1 J K Kk k k 2k 2k 2k 2k ak ak k k k k K 2k 2K 2K 2K FK FK 2K 2K 2K 2K 2K K K K 2K 2K 2K 2K 2K 2K 2K 2K 2K K 2K K 2K K 2K 2K FK FK FK FK 2K 2K 2K 2K K 2K K 2K 2K 2K 2K 2K 2K FK void DecompositionX distributeBlock1 int dim 74 In a BLOCK1 distribution each procesor gets exactly ceiling N P elements If N doesn t divide P this will result in a load imbalance BRC OO I E k k kk kkk k kkk k kkk k kkk 24 LL LL LES LL 2k k kkk kkk void DecompositionX distributeBlocki int dim double Times int N domainExtents dim int P pExtents dim int PLower pLower dim int dimOffset low int ceiling int 1 double tTotal 0 0 double invTimes new double P int ceilings new int P int aHigh new int P if Times NULL amp amp P gt 1 for i 0 i lt P i Get the inverses of the total times tTotal 1 0 Times i invTimes i 1 0 Times i for i 0 i lt P i Get the ratios and even it out if necessary ceilings i floor invTimes i tTotal N if NAP 0 ceilings i 1 F else ceiling N P N P 1 N P _distType dim BLOCK1
63. m partitioning and have a previously generated file containing the timing results prepared the program can partition based on any arbitrary data The idea is that the program would be run on each individual multiprocessor the times would be recorded in order and put into the file Then in the future the program could be run without re running the resource discovery part of the program The results of this implementation are shown in figure 111 2 and table II 1 As can be seen with the exception of the first trial the results are positive and the heterogeneous version performs better than the homogeneous version and close to the theoretical best Essentially the code works by first being told that it has three nodes and that one node can run the program twice as fast as the other two Instead of breaking the problem into three equal partitions one for each multiprocessor as would normally happen the program divides up the data such that node 0 gets half of the work since it has half of the combined processors of the cluster and nodes 1 and 2 each get a quarter of the work Node 1 and 2 then can work on their parts of the workload without spawning off extra threads and can communicate with each other and node 0 with MPI Node 0 first spawns two threads and performs shared memory communication with Pthreads to communicate between the two threads and MPI to communicate with nodes 1 and 2 In this manner the program has been able to be sped u
64. memory may or may not necessarily be physically shared on the hardware This means that one processor might be able to access the memory of a processor on a completely different ma chine because the software has been built to allow that style of access This is called a non uniform memory access or NUMA One of the principle challenges in future development of multiprocessors is determining what an optimal configuration is Due to the nature of the con struction of a multiprocessor it is possible that putting too many processors in one multiprocessor could cause congestion on the bus Additionally too many 11 threads all trying to perform read and write accesses to the same place in mem ory can create a bottleneck due to each process having to wait for other process s mutex locks to unlock before they can go and set their own mutex lock 21 44 45 I B Multicomputers In parallel programs where one processor works on data and then needs to exchange data with another processor there must be communication between processors In a shared nothing architecture distributed memory machine with out shared memory the only way to do this is to pass messages between the processors In message passing one processor communicates with other proces sors through a basic set of message passing primitives including send receive broadcast scatter and gather Using send and receive one processor sends a mes sage across the communications ne
65. multiprocessors are also sometimes called hierarchical ma chines or multi tier machines because one level of communication is possible within the node generally shared memory and one level of communication is possible be tween nodes themselves The two tiers of processor and node level communication is why they are considered to be hierarchical It is certainly possible to take the model beyond two tiers as well to three or more by connecting a two tiered cluster of multiprocessors to another two tiered cluster of multiprocessors thus creating a three tiered cluster of multiprocessors or a cluster of clusters of multiprocessors Clusters of more than two tiers are outside of the scope of this thesis Clusters of multiprocessors are attractive because the manufacturers are able to scale the systems to large number of processors much easier and more inexpensively than if they tried to solve the problem of networking every processor together as in a multicomputer or share memory between all processors as in an 14 nnf N nf 4 processor n system 8 processor system 16 processor system 32 processor 64 processor system system Figure 11 3 Diagram of an SGI Origin2000 Courtesy of SGI s Origin2000 and Onyx2 Performance Tuning and Optimization Guide IRIX 6 5 Network Hub or Switch SMP Node 0 SMP Node 1 SMP Node 2 SMP Node 3 Figure 11 4 Diagram of a cluster of symmetric multiprocessors 15 multi
66. n Memory Memory Bus L2 Cache L2 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Cache Processor 0 Processor 1 Processor 2 Processor 3 Bus Figure 11 1 Diagram of a multiprocessor sults presented in this thesis might be considered a hybrid between the multipro cessor and multicomputer because it uses distributed shared memory Second the multiprocessors that compete in today s market with modern multicomputers are typically built partially from commodity parts for the purpose of reducing cost by means of increased economies of scale A recent multiprocessor incarnation from IBM for instance forms the basis of each node comprising the the ASCI Blue Pacific machine at Lawrence Livermore National Labs and Blue Horizon at the San Diego Supercomputer Center 37 62 These machines use hundreds of IBM s PowerPC processors a chip family that powers all modern Apple Macintosh com puters 33 37 38 Similarly the chip inside Sun s Enterprise servers the Sparc also powers every Sun workstation that comes off the line 67 68 The large IBM systems despite being massively parallel are still essentially large clusters of mul tiprocessors As discussed below many multicomputers are built using expensive specialty communications hardware significantly more complex than that of an multiprocessor 10 In a multiprocessor parallel programs can typically be run either using message passing across the bu
67. nMP Further Sputnik modifies the existing KeLP distribution by adding new partitioning algorithms to the distribute and distributeBlock1 functions in the DOCK DecompositionX class The goal of the 60 61 r User Program OpenMP Figure A 1 Hierarchy of software layers for Sputnik API designed to be consistent with the original KeLP1 goals of making scientific program easier while making the overall performance greater has resulted in an API that requires only a few minor modifications to an existing KeLP1 program to work in Sputnik A good progression for writing a Sputnik program would be to first write the serial program then modify it to be a KeLP1 program then add OpenMP directives and then finally modify the KeLP1 program to be a Sputnik program using the Sputnik API calls A B Major Changes in Usage from KeLP1 The two new routines are as follows which are described in detail later in this chapter void SputnikGo int argc char argv double SputnikMain int argc char xargv double SputnikTimes Sputnik accepts more arguments than those that may just be passed into the application or to MPI A typical MPI program called mpiprog in this example is started like this on four nodes mpirun np 4 mpiprog Sputnik takes four additional arguments 62 1 testSMPS If equal to the integer 1 this tells the program that it should pay attention to the maxthreads argument ignore the numthre
68. nly run as fast as the slowest component node Thus to the degree that any node finishes early and idles while waiting we will under utilize the hardware and nodes will be forced to idle rather than spending their time productively computing Ideally therefore we want all nodes to finish at the same time Though the techniques for programming multiprocessors and homoge neous clusters of multiprocessors have been explored in detail and have achieved some level of sophistication programming heterogeneous clusters of multipro cessors for use with scientific computation is still a difficult challenge 8 9 10 11 12 13 14 23 24 25 26 27 35 42 Existing software technologies may be used to program heterogeneous clusters of multiprocessors however the process of doing so and still achieving good performance through load balancing can be extremely difficult The goal of my research presented in this thesis is to investigate a way to enable scientific programs to run faster and effectively utilize a heterogeneous cluster of multiprocessors while allowing the user to write the program as if they were running on a homogeneous cluster This thesis introduces an API called Sputnik designed to assist in greater performance and utilization on a heterogeneous cluster for certain types of appli cations Current applications for which Sputnik has been proven to function well with include stencil based programs Applications with extreme
69. node there should not necessarily be any inherent restrictions on which type of decomposing to use to partition the data for the heterogeneous cluster IV B 3 Running Given the optimal parameters and partitioning we know enough to run the program on a heterogeneous cluster and can expect it to utilize the cluster and perform significantly better Because results of ClusterDiscovery are saved on disk future runs of the program on the same cluster will not have to re discover 36 the cluster each time and can simply run with the optimal settings IV C Sputnik API IV C 1 Goals The Sputnik API is based on the Sputnik Model It implements a specific subset of the ideas generalized in the Model 1 ClusterDiscovery Runs the kernel of the program repeatedly on each sep arate node to determine the timings and relative performance varying the threads per node given to determine the optimal number of threads per node 2 ClusterOptimization Using the parameters which the program estimates to be optimal from ClusterDiscovery it decomposes the data non uniformly based on the relative powers of the nodes 3 Runs the program one last time using the optimizations and decompositions from ClusterDiscovery and ClusterOptimization The first optimization that Sputnik performs is the determination of the optimal number of OpenMP threads per node to run with The optimal number of threads might be equal to the number of processo
70. of multiprocessors was a problem worth investigating I decided that I could build my API on top of either two existing API s one that could handle message passing between nodes and one that could handle shared memory communication within a node or one existing API that already supported multi tier communication In choosing a basis for my API I decided to use KeLP1 as my message passing layer because unlike MPI and PVM it has built in support for data de scription and general blocked decomposition as well as transparent message passing communication This makes the job of the programmer using the API much easier for complex parallel programming 23 35 Such levels of abstraction have been shown to come without a performance penalty 11 The question then was what technology would be best used to take advantage of the shared memory architec ture for intra node communication I use OpenMP as my shared memory layer to handle intra node parallelization because at the moment it is the easiest technol ogy to use and is an emerging standard as an alternative to Pthreads My API is built on top of KeLP and OpenMP KeLP itself is built on top of MPI OpenMP can be based on a variety of sub technologies depending on the particular vendor s implementation The Origin2000 which I run on uses SGI s sprocs though the IBM SP systems build OpenMP on top of Pthreads 48 Hopefully by extending KeLP1 s API instead of simply MPI
71. og testSMPS O hetero 1 numthreads 10 e Sputnik runs on 2 nodes with exactly 10 threads per node with hetero geneous decomposition based on the times stored in the stats file 7 mpirun np 2 mpiprog testSMPS O hetero 1 numthr0 10 numthri 20 e Sputnik runs on 2 nodes with exactly 10 threads on node 0 and 20 threads on node 1 with heterogeneous decomposition based on the times stored in the stats file A D Example of Intra Node Parallelism Additionally OpenMP directives must be put around the loops that the programmer wishes to parallelize The directives function either in C C or Fortran kernels and can be as simple as those used in this code from a KeLP version of Red Black 3D OMP PARALLEL PRIVATE jj ii k j i jk do jj uli 1 uhi 1 sj do ii ul0 1 uh0 1 si OMP DO SCHEDULE STATIC do k ul2 1 uh2 1 do j jj min jj sj 1 uh1 1 jk mod j k 2 do i iitjk min ii jk si 1 uh0 1 2 u i j k c 2 u i 1 j k u i 1 j k u i j 1 k 3 u i j 1 k u i j k 1 u i j k 1 4 c2 rhs i j k 65 end do end do end do OMP END DO end do end do OMP END PARALLEL The code above scales well and with the OpenMP directives in place can show parallelism well depending on the size of the problem A E Sputnik Usage Sputnik works in this manner 1 Initialize MPI and KeLP 2 Set the number of threads on each multiprocessor node and run the kernel of the pro
72. on since half the number of processors are working on the problem As can be seen from the charts in Figure III 5 figure II 6 and table 111 2 for at least up to 8 threads the kernel scales very well but for this size problem does not improve significantly with 16 or 32 threads Scaling is good but not great but MPI and OpenMP are shown by the results to interoperate without any problems After I ran experiments to determine whether MPI and OpenMP as well as C and Fortran OpenMP directives would coexist and function correctly I set out to obtain timings to see how well OpenMP would parallelize scientific code to see if OpenMP would be close to Pthreads in terms of efficiency since it already 29 for j 64 j gt 0 j j 2 if myid 0 omp_set_num_threads j start MPI_Wtime pragma omp parallel shared numthreads if 0 omp_get_thread_num numthreads omp_get_num_threads pragma omp for schedule static for i 0 i lt LONG i arr i arrli 1 i 1 0001 finish MPI_Wtime else start MPI_Wtime numthreads omp_get_num_threads for i 0 i lt LONG i arr i arr i 1 i 1 0001 finish MPI_Wtime cout lt lt PE lt lt myid lt lt threads lt lt numthreads lt lt time lt lt finish start lt lt j lt lt j lt lt n MPI_Barrier MPI_COMM_WORLD Figure 111 4 OpenMP scalability test code 0 2 OpenM
73. on is if Sputnik will be running on a cluster with a huge number of nodes so that memory and cache tiling could be built into the decomposition and message passing communication could be made automatically more efficient Since the initial release of the API has focused on much smaller clusters with 2 4 nodes I assumed that one dimensional decomposition was adequate and the possible downsides would be negligible The Sputnik API also requires that the application is written in C at least as a wrapper though the kernel s of the program may be written in either C C or Fortran and linked in Finally Sputnik depends on the fact that the cluster has a thread safe implementation of MPI installed as well as OpenMP for both Fortran and C C and all technical requirements for both MPI and OpenMP programs to run are adhered to The validation of these results and of the model appear in the next chap ter Chapter V Validation V A Introduction Redblack3D was the program I chose to run on a pair of SGI Origin2000 supercomputers at the National Center for Supercomputing Applications NCSA to determine the success of the performances aspect of the Sputnik API 48 47 Whether it succeeds also in its goal of being able to allow the user to program a heterogeneous cluster as if it is homogeneous is much harder to measure although there are not very many changes that have to be made and I estimate that someone familiar with OpenMP could make
74. ould be to recognize this and direct the ClusterOptimizer to divide not necessarily just the data or the computation as a whole but to separate the problem into specific tasks that could be assigned to each unique hardware architecture according to the machines specialties I would like the readers of this thesis to bring away with them the fol lowing points 1 Sputnik is an API which demonstrates that heterogeneous clusters of multi processors need not be difficult to program As clusters of multiprocessors appear to be the near term future for supercomputing ways are needed to address the evolution of these machines Sputnik is one of these ways 2 Sputnik is an API which demonstrates that programs need not waste any available processing power on a heterogeneous cluster of multiprocessors By adjusting the number of threads per multiprocessors and repartitioning the problem so that each multiprocessors node is time balanced the program can run as if the entire cluster was homogeneous 3 Sputnik can be extended in the future as I intend to do myself in many ways in future research with the San Diego Supercomputer Center including but not limited to a Running on a variety of vendor platforms b Performing dynamic optimizations c Running on clusters of varying architectures including vector machines and MPPs not just varying speed memory network or cache size of multiprocessors d Performing varying types
75. p via heterogeneous partitioning Original Balanced Theoretical Balanced Speedup Theoretical Speedup 136 978 173 233 182 637333 26 47 33 33 Table 111 1 redblack3D MFLOPS rates with heterogeneous partitioning using hand coded MPI and Pthreads N 300 PEO has 2 processors and PET and PE2 have 1 each To analyze what the theoretical best possible speedup might be through heterogeneous decomposition consider N to be the total amount of data in the 26 problem In the homogeneous run each processor gets N 3 data Node 0 finishes in time T while nodes 1 and 2 finish in time 2 x T because they only run half as fast So two processors are wasted for a time of T If the data is repartitioned optimally so that node 0 gets N 2 data and nodes 1 and 2 get N 4 data each then the time that each node would take is Tx N 2 N 3 3 2 T assuming optimal efficiency the equations for obtaining the new times are described in chapter 5 Although the timings did not indicate optimal efficiency they approached it and at least gave indications that not only that MPI and Pthreads worked well together but more importantly that the repartitioning concept is valid REDBLACK 3D MPI PTHREADS 2005 180 160 140 1207 MFLOPS a o 40 Original Balanced Theoretical Balanced Figure 111 2 redblack3D with heterogeneous partitioning using hand co
76. portTimings times timesLoc reps chk_freq niter N si sj gi gj catch KelpErr amp ke ke abort return middle Bibliography 1 Alpern B L Carter J Ferrante Modeling Parallel Computers as Memory Hierarchies 1993 Conference on Programming Models for Massively Parallel Computers Alpern B and L Carter Towards a Model for Portable Performance Exposing the Memory Hierarchy 1992 Alpern B L Carter E Feig and T Selker The Uniform Memory Hierarchy Model of Computation IBM Watson Resarch Center November 2 1992 Ammon J Hypercube connectivity with CC NUMA architecture SGI Cray White paper 1999 Anglano C J Schopf R Wolski and F Berman Zoom A Hierarchical Representation for Heterogeneous Applications UCSD CSE Technical Re port CS95 451 January 5 1995 Argonne National Laboratories MPI The Message Passing Interface Stan dard lt http www unix mcs anl gov mpi gt Argonne National Laboratories MPICH A Portable Implementation of MPI lt http www unix mcs anl gov mpi mpich gt Baden S B Tradeoffs in hierarchical algorithm design on multi tier archi tectures Baden S B Programming Abstractions for Dynamically Patitioning and Coordinating Localized Scientific Calculations Running on Multiprocessors SIAM J Sci Stat Comput Vol 12 No 1 pp 145 157 January 1991 Baden S B
77. pow erful class of machines for parallel programmers to use the KeLP style technology on Whereas KeLP1 was an interface on top of the MPI 6 technology un derneath KeLP2 was an interface on top of both MPI and Pthreads as shown in Figure III 1 As described previously the idea was that although MPI could be used for both inter and intra node communication in theory this is not the optimal communication method because MPI does not utilize the shared memory hardware Instead the designers of KeLP2 decided that MPI would be used for inter node communication and Pthreads would communicate within an SMP node using fast fine grained shared memory accesses that Pthreads are particularly good for KeLP2 adds support the multi tier nature of a cluster of SMPs if the SMP nodes are different however one sees inefficiencies The program will only run as fast as the slowest node and so if one node finishes in 13 seconds and another in 5 the program will take 13 seconds to run Although KeLP2 is a very good multi tier implementation at the time of writing the Sputnik API it was not running on my platform of choice the Origin2000 so I built my Sputnik API on top of KeLP1 MPI instead of KeLP2 multi tier with MPI and Pthreads 20 User Program KeLP2 PThreads Figure 111 1 Hierarchy of software layers for KeLP2 III A 3 Heterogeneity Work Grid Enabled MPI In Grid Enabled MPI Foster attempts to reduce communication time inher
78. processor The principal downside of a cluster of multiprocessors is that they are difficult to program Memory within a node is only shared between processors within a given multiprocessor node Therefore a shared memory program cannot be used across the entire cluster Message passing can be used across the entire cluster but this does not take advantage of the unique shared memory hardware that exists within a multiprocessor which is is what often makes communication in a multiprocessor fast So message passing throughout the entire cluster is not optimal either The solution currently being adopted to program clusters of multipro cessors is to use a dual tier approach of combining message passing and shared memory programming as if blending multicomputer and multiprocessor program ming techniques respectively In this manner within a multiprocessor intranode communication can be achieved through shared memory Pthreads OpenMP and between multiprocessor nodes internode communication can be achieved through message passing e g MPI and PVM 15 50 52 51 6 One can either hand write the code that combines one method from each paradigm or use a library that specif ically supports multi tier machines including KeLP2 or SIMPLE 35 23 14 Since programming a piece of software with MPI in mind can easily double the size of the code and programming and debugging in Pthreads can be a challenging task as well the prospect of
79. processor to every other processor directly As the number of processors grows the number of connections grows by p p 1 gt 11 1 Number of Processors Number of Connections 2 1 3 6 10 15 21 NOD OF KB W Table 11 1 Growth of connections in a crossbar switched machine 13 II C Clusters of Multiprocessors A cluster of multiprocessors is the composite of two distinct hardware technologies for building parallel machines multicomputers and multiprocessors A cluster of multiprocessors is a possible solution to the lack of expand ability of a lone multiprocessor and the expensive nature of expanding a multi computer This solution has been adopted in high end server lines and even super computers from nearly all major computer hardware manufacturers including Sun IBM HP and Compaq DEC SGI has created a machine called the Origin2000 which uses shared memory but has a very advanced hypercube interconnection structure to support scalability of up to 256 processors per Origin2000 An image of what the Origin2000 s architecture looks like can be seen in figure 11 3 Some of the computers with the highest theoretical peak speed in the world such as ASCI Red at Sandia National Labs Intel ASCI Blue Pacific at Lawrence Livermore Na tional Labs IBM and the Blue Horizon machine at the San Diego Supercomputer Center IBM are all essentially clusters of multiprocessors 33 62 Clusters of
80. puter Science and Engineering University of California San Diego XK XA A A XA A A A XX XA XX KF BRC ladilla EE E E k kk k kkk k kkk k kkk kkk k kkk 2k kkk k kkk kkk include Sputnik h include j3D h include XArray3 h include Grid3 h include Mover3 h include timer h include manhattan h include lt omp h gt extern int testSMPS void cmdLine int argc char argv int amp L int amp M int amp N double amp eps int amp niter int amp chk_freq int amp reps int amp si int amp sj int amp gi int amp gj void ReportTimings double times double timesLoc int reps int chk_freq int niter int N int si int sj int gi int gj void InitGrid IrregularGrid3 amp grid grid fill 1 0 grid assignGhost 0 0 XK XA XA XA XX O 88 void ComputeLocal IrregularGrid3 amp grid int si int sj const int color IrregularGrid3 amp rhs K k k k ak 2k kak 3k 2k k ak 2k 2k 2k ak 2k 2k aK 2K 2K 2K K FK 2K 2K K FK 2K 2K 2K FK 2K K FK FK 2K K FK 2K 2K 2K FK 2K 2K K FK 2K K K FK 2K 2K K FK 2K K FK 2k 2K 2k K 2k ak main main takes one argument N the number of points along each axis BRC OO RO EE E k I RI k kkk k kkk k ak LL LCL LES LL 2k k 2k 2k 2k kkk int main int argc char argv MPI_Init amp argc amp argv SputnikGo argc argv MPI_Finalize return 0 double SputnikMain int argc char argv int testSMPS double SputnikTimes
81. r int arg 1 arg lt argc arg 4 if strcmp testSMPS argv arg testSMPS atoi argv targ else if strcmp hetero argvlarg hetero atoi argv targ else if strcmp numthreads argv arg numThreads atoi argv ttarg else if strcmp maxthreads argv arg maxThreads atoi argv targ else if strcmp maxthr0O argv arg maxThr 0 atoi argv targ else if strcmp maxthri argv arg maxThr 1 atoi argv targ else if strcmp numthr0O argv arg numThr 0 atoi argv arg else if strcmp numthri argv arg numThr 1 atoi argv targ cout lt lt INPUTS lt lt endl cout lt lt ttestSMPS lt lt testSMPS lt lt endl TT cout lt lt thetero lt lt hetero lt lt endl cout lt lt tnumthreads lt lt numThreads lt lt endl cout lt lt tmaxthreads lt lt maxThreads lt lt endl cout lt lt tmaxthreads for SMP 0 lt lt maxThr 0 lt lt endl cout lt lt tmaxthreads for SMP 1 lt lt maxThr 1 lt lt endl cout lt lt tnumthreads for SMP 0 lt lt numThr 0 lt lt endl cout lt lt tnumthreads for SMP 1 lt lt numThr 1 lt lt endl for i 0 i lt nodes i k MPI_Get_processor_name procName amp j if k cout lt lt tname of SMP lt lt i lt lt lt lt procName lt lt endl if testSMPS 1 amp amp maxThreads gt 0 4 cout lt lt
82. rest the fastest node is wasting time while it waits for the slower nodes to finish Thus the wasted time can be computed as follows where N is the total number of nodes and the times for the nodes are ranked from 0 to N 1 T TimeOnNodei V 2 N 1 T Tuastea MAX T Y V 3 i 0 Therefore the solution is to repartition the data After optimally repar titioning the program will run in a time in between that of the faster and slower nodes newamountofdatafornodei AET To imal fi ori ae pa VA piuma orig originalamountofdatafornodez 2 N du Therefore the speedup will be ieee MAXI Speedup ted T _ 1 V 5 Toptimal Toptimal This speedup is based on a repartitioning of the dataset Where in a homogeneously partitioned run each node would be assigned 1 N work in a het erogeneously partitioned run the work assigned is a bit more complex If the node originally ran in time Ti orig then first we want to find out what the speed of this 45 node is relative to all the others and the total We can do this by assigning a power P to the node equal to the inverse of the time We can use this then to produce a power ratio R Xio Ty Pa A V6 ia N 1 N 1 se T Pioislel ster y Py 5 ees V 7 k 0 k 0 k orig z pay a Du E R _ 25 AS Lio 1 V 8 Pi otalcluster Tiori k 0 T k rig The goal of the software is to assign a percentage of the total work of the problem to a node such that th
83. rs in the node but may be less if the problem is not large enough to warrant the overhead and costs of shared memory communication for that many threads More threads would not speed up communication and may in fact slow it down because two or more threads would be competing for one processor s time and memory The second optimization that Sputnik performs is decomposition One of the appealing aspects of using KeLP aside from the fact that it is based on MPI which has a standardized interface across all major parallel platforms is that it supports block decomposition 37 IV D Decomposition IV D 1 Block Decomposition Block decomposition allows for manipulating certain types of data in a much easier way than with standard C datatypes Also rather than describing data in terms of C arrays and having to program complex MPI communi cations in KeLP one can describe the domain of the data called a Grid and multidimensional blocks within the domain called Regions Rather than forcing the application programmer to write long sequences of loops to pass ghost cells boundary conditions between the blocks between processors KeLP handles this automatically with built in functions to expand and shrink any given Region Further a Region can be intersected with another Region and since it is merely a subset of the overall Grid it can overlap with another Region and cover a domain which may as a whole be very irregular with neat individual
84. s are used on aegir This speedup is good though still 7 less than the theoretical best for the thread configuration with 32 threads on balder and 16 threads on aegir The New Compute timings for each system should be close to identi cal As the timings from table V 3 indicate the speeds are not perfectly identical The reasons for this are not completely clear but there are a number of possible explanations First speed on the Origin2000 depends highly on thread scheduling 50 Speedup for redblack3D Using 32 Threads on balder Speedu 1 4 5 mt Theoretical Speedup 16 20 24 28 32 Number of Threads on aegir Figure V 2 Speedup for redblack3D with 32 threads on balder and varying num bers of threads on aegir using the Sputnik library and memory placement both of which seem to be massive issues affecting per formance and can vary widely even within the run of a program as in this case Since memory is stored throughout the entire machine within the two processor nodes a poor distribution could certainly affect the timings and cause the kind of variation that we see in table V 3 Because these timings were done in a dedicated environment competition for processor memory or network bandwidth is not an issue Based on the equations presented above we can predict from the original timings and the number of threads per system that we are using the theoretical or predicted time for a re par
85. s cuasi be e De HPA db hk x Acknowledgements iras RS ANS OP ae E xii Vita Publications and Fields of Study xiv PTAC Ge A eels ca Mead der cde as D ae sores uh die Dinde XV ITTOCUEMON a a ee Bh ees oe a a 1 As MO tIN AION Se FR ES noth att pote MUR ae ey DE es Na 1 B Heterogeneity the god Sn a eS eee de el a ee 5 C Organization of the Thesis aunado SAS AS 6 Clusters of Multiprocessors ye Ra Se NM des ee g D E 8 A Multiprocessors E A ee ee ee A ee ES 8 B Mu lticompu ters Seti Se Sy Se ohne E vst es Eden Sh Se A 11 C Clusters of Multiprocessors 13 Heterogeneous Multi Tier Programming 16 Previous Work s at sac en ee ae a ee te es En ad a ad ee A 16 1 Multi Protocol MPI oscars ae eee od ae eed 17 E Mut Per APIs es haaa Ds hen eam dl Biel ee Gas a 18 3 Heterogeneity Work usa 4 De IOS 2 NL OOS Ee 20 B Heterogeneous Multi Tier Programming 21 ly Problemi iuas ta ho Soe Be Be Gea aces ab se gs oe os Pg ah kg 21 2 Requirements edb a a et eae ag be Ba Ee 22 C Multi Tier Experiments AAA ee dd 23 l VET aid Pthreads si duce Sh ait Gee A est hee ath ah Saal ad de Te 23 2 MPrand OpenMP yi gsc dae ose whe de lewd sale ne ace 27 3 KeLP and OpenMP Te ef sks ns Bale os Ge atte gs ee wale A 32 vi IV The Sputnik Model and Theory of Operation 33 A Introduction i a ee LS RS ak Be Reed neni oe ponent he eae ath 33 B
86. s that connects the processors or by interacting using their shared memory The latter is often accomplished by way of using a threads library such as Pthreads short for POSIX Threads a POSIX compliant thread library that supports mutual exclusion on pieces of memory so that one thread can hold exclusive access to a piece of memory while it is writing to that piece so that other threads get only the final value when the thread that has the lock is finished writing to it 6 15 50 51 52 The API calls for Pthreads are starkly different from MPI Instead of communication primitives there are commands that do thread manipulation e g create fork and join and commands that do memory locking so that no more than one thread has access to a critical section of code simultaneously e g pthread_mutex_lock and pthread_mutex_unlock A program running on a multiprocessor using shared memory can often achieve significantly better timing results because it is specifically using the shared mem ory hardware that a multiprocessor is designed to use rather than a congested processor bus It is possible to simulate message passing with shared memory and also possible to simulate shared memory with message passing Machines that are called distributed shared memory type machines are often examples of the lat ter A distributed shared memory machine is one that would have a library that would allow shared memory style calls even though the
87. saging protocols depending on where the message is in the machine If the message needs to be passed within an SMP for instance the message is passed using shared memory Within an MPP the message is passed using a native messaging layer Outside the MPP between MPPs the message is passed using TCP IP Lumetta finds that the shared memory protocol can achieve five times the bandwidth of the networking protocol for instance While multi method or multiprotocol communication especially when running across MPI can theoretically run effectively on a heterogeneous cluster there is also a strong possibility that it may underperform an API that has been specifically designed in a multi tier fashion The reason for this is that although the multiprotocol capable API can use all of the processors and even take advantage of whatever special shared memory and networking hardware exists in the cluster it is entirely possible that the runtime system will distribute the MPI processes in an nonoptimal manner that requires significantly more communication than is necessary An API that has been specifically designed to partition on two levels as the machine itself is built on a hardware level can specify exactly how all the processes should be arranged For example the MPI processes may be scattered throughout the entire cluster when the program executes Some processes might be able to share memory within the cluster but there is no guarantee that thos
88. stem which func tions like Sputnik in some respects choosing an optimal decomposition system based on pre known aspects of the computational demands Zoom Anglano Schopf Wolski and Berman investigated a method for describ ing heterogeneous applications in terms of structure implementation and data The motivation is that not every machine existing e g multicomputers vector supercomputers multiprocessors is adept at solving all problems The Zoom rep resentation attempts to solve this problem by allowing the user to abstract the program such that each particular segment in the abstraction can be sent to the machine or class of machine that it is best suited for 5 Such a technology would be a fantastic improvement to Sputnik in the future See the Proceedings of the Heterogeneous Computing Workshop 1995 and Proceedings of the 1992 Hetero geneous Workshop IEEE CS Press for more information on related technology IITLB Heterogeneous Multi Tier Programming III B 1 Problem 1 As discussed previously the goal of my research presented in this thesis is to find a way to make scientific programs run faster and improve utilization on heterogeneous clusters of multiprocessors and still allow the user to write 22 the program as if they were running on a homogeneous cluster 2 The problem is that they currently have imbalanced loads if running homo geneous partitions on heterogeneous multiprocessor nodes 3 Ther
89. tart Essentially everything that was in main can now be in SputnikMain with the addition of timing calls A bare bones main might now look like this int main int argc char argv MPI_Init amp argc amp argv Initialize MPI InitKeLP argc argv Initialize KeLP Call Sputnik s main routine which in turn will 67 then call SputnikMain SputnikGo argc argv MPI_Finalize Close off MPI return 0 The call that sets the number of threads per node to use is actually set in SputnikGo and so does not need to be used by the programmer The number of threads actually being used in a loop should be tested by using the OMP_GET MAX _THREADS call In this way the programmer can determine whether OpenMP is doing a good job of parallelizing the kernel that the program mer would like to speed up A number of factors can affect OpenMP s paralleliza tion including striding of loops and data dependencies A F Sputnik Implementation The overall process of what Sputnik does has been discussed previously in this thesis I will discuss the specifics here Sputnik has many command line options but in this following section I will discuss just the part where we test the strength of the multiprocessor nodes testSMPS 1 which is the most important and unique part of Sputnik s functionality After SputnikGo is called probably from within main the program follows something similar
90. the modifications to a program that already runs in KeLP in an hour As already discovered in several papers however some scientific codes optimizes better than others 21 44 45 V A 1 Red Black 3D Redblack3D it is a scientific program written using Sputnik which itself comprised of MPI KeLP and OpenMP It is mostly in C except for the kernel of the program which is written in Fortran and linked in with the rest of the code The OpenMP directives are in the Fortran part of the code The program itself solves Poisson s equation a differential equation using the Gauss Seidel method 41 42 Each run was done with one MPI process per Origin2000 and varying numbers of OpenMP threads within each system The program alternates between running a kernel on the black points and the red points which are arranged in a three dimensional grid The reason for this is that redblack3D operates by doing a 7 point stencil calculation right left front back top bottom and itself and needs to make sure that the values next to it don t change in a given iteration So the grid alternates with every other point being a red point and all the others being black with no red immediately adjacent to a black point though orthogonal is fine The entire source code for redblack3D and the kernel both modified with the Sputnik API are in Appendices B and C V B Hardware The Origin2000 is somewhat different than the typical multiproc
91. thesis I make one specific application study with two different types of optimizations repartitioning the amount of data each node works on and adjusting the number of threads that run on each node This thesis does not attempt to solve the problem of scheduling hetero geneous clusters of multiprocessors that are networked over a heavily trafficked wide area network including grids Such problems are best solved through dif ferent methods including dynamic load balancing by use of the Globus Legion Network Weather Service or AppLeS 30 39 66 69 Blending one or more of these technologies with my API is beyond the scope of this thesis The thesis also does not attempt to tackle the problem of machines with a deeper hierarchy than two tiers processor and node A deeper hierarchy might be a cluster of clusters Additionally this thesis and the API it presents are specif ically focusing on clusters of multiprocessors It is not looking at the added level of detail of what happens when several completely different architecture types are clustered together such as vector and multicomputer style MPP supercomputers and parts of the computation run better on different architectures Optimizations including the ones I have just mentioned in addition to many others are possible optimizations that could be done in ClusterOptimization A process whereby the ClusterOptimizer does nothing more than vary the tiling of the problem to fit in leve
92. thing quite like this It s been a unique fun life changing experience that has given me not just an interest scientific computing but an interest and enthusiasm towards academia and research in general This work was supported by UC MICRO program award number 99 007 and Sun Microsystems xiii March 23 1976 1993 1996 1996 1997 1997 1997 1998 1999 1999 1996 2000 2000 1994 Present 2000 Present VITA Born Marin County California Intern Autodesk Inc Freelance Writer TidBITS Study Abroad Oxford University England Programming Intern Pixar Animation Studios Programming Intern Dolby Laboratories Inc B A University of California San Diego Minor in Organic Chemistry Research Assistant University of California San Diego Programming Intern San Diego Supercomputer Center SDSC M S University of California San Diego Computer Consultant Chief Operating Officer Dyna Q Corporation PUBLICATIONS Simulating Neurotransmitter Activity on Parallel Processors S Peisert S Mock and S B Baden UCSD Research Review 1999 Poster Session February 26 1999 FIELDS OF STUDY Major Field Computer Science Studies in Computer Graphics Dr Michael J Bailey Studies in High Performance and Scientific Computing Professor Scott B Baden Studies in George Gershwin Dr Elizabeth B Strauchen Wolfson College Oxford University England Studies in Russian Literature Dr Rog
93. things that occurred in the course of gathering results for redblack3D modified with the KeLP OpenMP Sputnik API libraries All of them appear to stem from anomalies either with OpenMP in general or perhaps the specific OpenMP implementation on the Origin2000 machines Specif ically I found that MPI processes appear to run significantly faster than OpenMP threads For example when I ran with one MPI process with between 2 and 64 threads spawned by that MPI process it ran significantly slower than if I ran with between 2 and 64 MPI processes with 1 OpenMP thread per process Despite 94 Speedup for redblack3D with 48 Threads on balder Speedu eae ase Theoretical Speedup 24 30 36 42 48 Number of Threads on aegir Figure V 5 Speedup for redblack3D with 48 threads on balder and varying num bers of threads on aegir using the Sputnik library Threads New Compute Predicted Speedup Theoretical balder aegir balder aegir Speedup 62 32 71 5 71 8537 72 5068 1 3820 1 3695 128 64 80 3 75 5464 76 8115 1 345 1 4060 128 96 65 5 68 841 69 3257 1 0677 1 0602 Table V 8 Speedup and predicted timings for redblack3D with large numbers of threads per system speaking with the NCSA Consultants who in turn contacted SGI engineers as well as performing a variety of experiments myself the precise cause of this was never solved It was speculated that this too h
94. this is that there is less time overhead in continually creating and destroying threads The downside is that the threads are using system resources when other system processes might need a spare moment on a CPU Instead of a simple static uniform partitioning where each node n of N is assigned the fraction n N of the work my version uses two partitioning schemes The first is an optimized non uniform partitioning based the partitioning on the total number of processors as opposed to nodes It calculates the total number of processors available on all nodes P the total number of nodes N and how many processors each individual node has po ph_1 It then assigns work chunks of the total data set W equal to the fraction of processing power each node has wo Wn 1 Po Pn 1 P Therefore as long as everything is equal in a machine especially the processing power of each individual processor except the number of processors in a node the problem workload will remain balanced and the problem will run optimally fast based on the configuration of the cluster of multiprocessors One cannot be guaranteed an optimal problem as in the first partitioning scheme however More than likely speeds of processors will differ speeds of memory and networks will differ and cache sizes and other important machine 25 characteristics will differ Therefore by having the user issue the proper flag to the program to initiate balanced non unifor
95. tion of the C loops An example of this is shown in Figure III 3 Finally there was the issue of Fortran and C OpenMP compatibility since there were Fortran directives recognized separately by each language s com piler I wrote a small microkernel to test OpenMP scaling as well as C Fortran OpenMP and MPI compatibility on the Origin 2000 as shown in Figure 111 4 The microkernel was written to avoid compiler optimization and precomputation as much as possible The program runs two heavyweight processes One process keeps running just the inner loop for every iteration for i 0 i lt LONG i arr i arrli 1 i 1 0001 28 for int i 0 i lt 40000 i times kernel x y z double kernel double x double y double z Every time that kernel is called the following pragma is called as well If kernel is called 40 000 times as shown in this example overhead of generating threads will be incurred 40 000 times pragma omp parallel shared x y z pragma omp for schedule static for int i 0 i lt MAX_INT i lt mathematical calculations gt Figure 111 3 OpenMP Fork Join Example The other process for each iteration of the outer loop runs with a dif ferent number of threads set with OpenMP commands The program starts at 64 threads and halves the number of threads each time through all the way down to 1 If the program scales well the time should double with each iterati
96. titioned and balanced run of the program At its worst the runs with a maximum of 32 threads per system are 5 19 worse than the predicted results At best the actual runs are 2 29 better than the predicted results Again this variance is presumably due to the same conditions that cause a discrepancy between the timings of the nodes after their workloads have been re balanced 51 V E 2 Up to 48 threads per system Threads Original Compute New Total New Compute aegir balder aegir balder aegir balder aegir 24 37 6 62 9 50 6 49 8617 43 8 46 3449 30 37 6 50 1 45 498 45 5 43 4 42 687 36 36 1 43 5 42 2992 42 3 37 8 39 675 42 36 5 38 4 40 5622 40 7 36 8 35 4458 48 37 4 34 3 41 4 42 3218 33 3 36 2054 Table V 5 Complete redblack3D timings with 48 threads on balder and varying numbers of threads on aegir Threads New Compute Predicted Speedup Theoretical balder aegir balder aegir Speedup 48 24 43 8 46 3449 47 0655 1 3572 1 3364 48 30 43 4 42 687 42 9592 1 1544 1 1662 48 36 37 8 39 675 39 4560 1 0964 1 1025 48 42 36 8 35 4458 37 4259 1 0435 1 0260 48 48 33 3 36 2054 35 7830 1 0330 1 0452 Table V 6 Speedup and predicted timings for redblack3D with 48 threads on balder and varying numbers of threads on aegir As with the runs with a maximum of 32 threads per system runs with a maximum of 48 threads per system also scale
97. titute August 1998 Crandall P E and M J Quinn A Decomposition Advisory System for Het erogeneous Data Parallel Processing Proceeding of the Third International Symposium on High Performance Distributed Computing Crandall P E and M J Quinn Non Uniform 2 D Grid Partitioning for Heterogeneous Parallel Architectures Proceedings of the 9th International Parallel Processing Symposium 1995 de Supinski B R and J May Benchmarking Pthreads Performance Lawrence Livermore National Labs UCRL JC 133263 Donaldson S J M D Hill and D B Skillicorn BSP Clusters High Performance Reliable and Very Low Cost PRG TR 5 98 Oxford University Computing Laboratory 1998 Fink S J A Programming Model for Block Structured Scientific Calcula tions on SMP Clusters UCSD CSE Department Ph D Dissertation June 1998 Fink S J S B Baden and S R Kohn Efficient Run Time Support for Irregular Block Structured Applications Journal of Parallel and Distributed Computing 1998 25 26 27 28 29 30 31 32 33 34 39 36 37 38 95 Fink S J S B Baden and S R Kohn Flexible Communication Mechanisms for Dynamic Structured Applications IRREGULAR 96 Fink S J and S B Baden Runtime Support for Multi Tier Programming of Block Structured Applications on SMP Clusters ISCOPE 97 December 1997 Fink S
98. to organizations with a need for large scale parallel computation well established techniques used to program multicomputer MPPs and vector machines are not always the optimal techniques to program multiprocessors or multiproces sor clusters Further since one of the appealing aspects of clusters of multiproces sors is that many of their components can be built from readily available commer cial hardware solutions such as Sun IBM or SGI multiprocessor workstations it can be cost effective to add in or swap out systems in the cluster at will replacing old components gradually The result is that what was originally a homogeneous cluster of multiprocessors can easily become heterogeneous over time with the addition of newer systems with different processor speeds number of processors memory sizes cache sizes and network speeds an example of which is shown in figure 1 1 Network Hub or Switch Multiprocessor Node 0 OQ QO Multiprocessor Node 2 sis O Multiprocessor Node 1 s OO Processors Nodes with varying numbers of processors Figure 1 1 Diagram of a heterogeneous cluster of multiprocessor nodes In a heterogeneous cluster a uniform partitioning of data across the clus ter is not optimal because some of the nodes will finish before others leaving parts of the cluster idle until the slower nodes terminate This problem can generally be stated to say that a cluster will o
99. to this outline in pseudo code Until we have hit the user defined limit of the maximum number of threads or until the time we get by increasing the number of threads is higher worse performance than the lower number of threads keep increasing the number of threads and running the 68 kernel of the program as contained in SputnikMain without communication This way we can get the individual timings for each multiprocessor node while i lt MAX_THREADS amp amp time last iteration lt time second to last iteration omp_set_num_threads i By passing in NULL for the times we are telling the routine inside the DecompositionX class not to do any special modifications time i SputnikMain int argc char argv NULL i i 2 i iteration before the best we found in the previous loop During the first loop we move quickly to find the optimal solution by doubling i each time This time we only increment by 1 starting with the best estimate from the first while loop while time last iteration lt time second to last iteration omp_set_num_threads i time i SputnikMain int argc char argv NULL i iett Set the optimal number of threads Each node may have a 69 different optimal value omp_set_num_threads optimal number This time pass in the best times and let the DecompositionX routines do partitioning based on the
100. ture of Tiling Interactions LCPC 1997 National Center for Supercomputing Applications NCSA at University of Illinois Urbana Champaign part of The Alliance lt http www ncsa uiuc edu gt lt http www uiuc edu gt lt http www ncsa edu gt NCSA Silicon Graphics Origin2000 lt http www ncsa uiuc edu SCD Hardware Origin2000 gt Nguyen T M M Strout L Carter and J Ferrante Asynchronous Dy namic Load Balancing of Tiles SIAM 99 OpenMP Architecture Review Board OpenMP FAQ lt http www openmp org index cgi faq gt OpenMP Architecture Review Board OpenMP Fortran Application Pro gram Interface 1 0 Oct 1997 OpenMP Architecture Review Board OpenMP C and C Application Program Interface 1 0 Oct 1998 53 54 59 56 57 58 59 60 61 62 63 64 97 Patterson and Hennessy Computer Architecture A Quantitative Approach 2nd Ed Morgan Kaufmann Peisert S S Mock and S Baden Simulating Neurotransmitter Activity on Parallel Processors UCSD Graduate Research Conference 1999 Pfister G F In Search of Clusters The Coming Battle in Lowly Parallel Computing Prentice Hall PTR 1995 Pilkington J R and S B Baden Partitioning with Spacefilling Curves CSE Technical Report Number CS94 349 March 1994 Pilkington J R and S B Baden Dynamic Partitioning of Non Uniform Structured Workloads with Space
101. twork while the other receives it In broadcast and scatter methods one processor sends a message to all other processors in the network In the gather method all processors send to a single processor in the net work It is clear that as more messages are being passed the more congested the network becomes and the more complex the solutions needed to solve the problem of building a low latency high bandwidth network An example of a distributed memory machine can be seen in Figure 11 2 An extremely basic example of a simple distributed memory machine might use a bus to pass messages The problem with this particular design is that the bus that messages travel on is often a bottleneck and would therefore not scale well to larger number of processors due to competing demands on the bus For this reason large machines typically use a more scalable interconnect The advantages of topologies like the crossbar the hypercube or the toroidal mesh as is used in the SGI Cray T3E is that it makes for an extremely fast network and is very expandable to a large numbers of processors Unfortunately toroidal meshes or crossbars are among technologies that are very expensive to 12 Main Memory L2 Cache L2 Cache L2 Cache L2 Cache L1 Cache L1 Cache L1 Cache L1 Cache Interconnection Network Figure 11 2 Diagram of a distributed memory machine construct Using a crossbar switch for instance would involve connecting every
102. uirement is that the message passing library implements a collection of eleven message passing primitives and the node library implements three shared memory primitives These combine to implement a final set of eight SIMPLE primitives barrier reduce broadcast allreduce alltoall alltoallv gather and scatter Although MPI includ ing MPICH works as a messaging layer the architects of SIMPLE discovered that the Internode Communication Library ICL provided superior performance They also decided to use Pthreads as the SMP node layer although as they describe a faster library possibly something vendor specific might work even better The calls in the SIMPLE library allow the programmer to work across multiple SMPs seamlessly instead of having to partition the dataset twice manually a primary partition to determine which data segment runs on each SMP and a secondary partition to determine which part of each data segment runs on each individual processor of each SMP 19 KeLP2 Expanding upon the KeLP1 API that already made programming SMPs easier with support for region calculus and data motion among other parallel pro gramming abstractions KeLP2 supports the unique multi tier structure of clusters of SMPs 23 26 12 8 While KeLP1 aids programmers in making parallel code easier to write without suffering performance penalties KeLP2 does the same thing for multi tier machines Essentially KeLP2 opened up a whole new and more
103. y Scott B Baden for 3D RB Converted to 3D Blocked for Cache RB ordering ORR CORA ARIA I I kk kkk k 0000000000000 Smooth the Red Points subroutine rb7rrelax u ul0 ul1l1 u12 uh0 uh1 uh2 si sj rhs integer ul0 ul1 uh0 uhi ul2 uh2 si sj double precision u u10 uh0 ul1 uh1 ul2 uh2 double precision rhs ul0 uh0 ul1 uh1 ul2 uh2 double precision c h c2 integer i j k ii jj jk c 1 0d0 6 0d0 h 1 0d0 85 86 c2 h h OMP PARALLEL DEFAULT SHARED PRIVATE jj ii k j i jk I OMP DO SCHEDULE STATIC do jj uli i uhi 1 sj do ii ul0 1 uh0 1 si do k ul2 1 uh2 1 do j jj min jj sj 1 uh1 1 jk mod j k 2 do i iitjk min ii jk si 1 uh0 1 2 u i j k c 2 Cu i 1 j k u i 1 j k 3 u i j 1 k u i j 1 k 4 u i j k 1 u i j k 1 5 c2 rhs i j k end do end do end do end do end do OMP END DO NOWAIT OMP END PARALLEL return end c Smooth the black points subroutine rb7brelax u ul0 ul1 ul2 uh0 uh1 uh2 si sj rhs integer ul0 ul1 uh0 uhi ul2 uh2 si sj double precision u ul0 uh0 ul1 uh1 ul2 uh2 double precision rhs ul0 uh0 ul1 uh1 ul2 uh2 double precision c c2 h integer i j k ii jj jk c 1 0d0 6 0d0 h 1 0d0 c2 h h OMP PARALLEL DEFAULT SHARED PRIVATE jj ii k j i jk I OMP DO SCHEDULE STATIC do jj uli i uhi 1 sj do ii ul0 1 uh0 1 si do k ul2 1 uh2 1 do j jj min jj sj 1 uh1 1 jk 1 mod j k 2 do i iitjk min ii jk si 1 uh0 1

Download Pdf Manuals

image

Related Search

Related Contents

Progress Lighting P4610-30 Installation Guide  廃車時のアウ ト ラ ンダー PHEV 駆動 用バッ テリーの取 外し方法  INSTALLATION MANUAL MANUAL DE INSTALACIÓN MANUEL D  5 ton capacity 10 ton capacity professional service jack  Global Machinery Company MOC6L User's Manual  User Guide for the Polycom VVX400  Model 8635-C SUREFLOW Room Pressure Controller  Docentes Registo Biográfico  

Copyright © All rights reserved.
Failed to retrieve file