Home

S. Sriram`s Thesis - Electrical Engineering & Computer Sciences

1. 114 5 4 Synchronization MOE seca sean torondsan dns e oem tats 116 5 4 1 Synchronization protocols ecceeeseeecseceeeeseeeeeteeeeaes 116 5 4 2 The synchronization graph G5 ccenoennenees 118 5 5 Formal problem statement 5 2 lt sccussi4yaases se saddasocoasatecqete Sieedeeaoeees 122 5 6 Removing redundant synchronizations c ceeeeeeeeseeeeeteeeeeeees 124 5 6 1 The independence of redundant synchronizations 125 5 6 2 Removing redundant synchronizations ccceeeee 126 5 6 3 Comparison with Shaffer s approach eeeeeeeeeeee 128 PL ON Nav 09 22 1 0 8 meen eRe eee RE E 129 5 7 Making the synchronization graph strongly connected 131 5 7 1 Adding edges to the synchronization graph 133 5 7 2 ANSETUIOM Ol CElAVS oie e RO eed 137 5 8 Computing buffer bounds from Gg and Gipo 141 5 9 RESYNCHHONIZATION esisiini uiii ia 142 DAO SUMMNALY einas er E EAE AR T 144 G EXTENSION wicscsscecaissdsctcscosacteveddondsgusssscebesnostessossdecsesdscsegeonssadeaiscesveassedeeees 147 6 1 The Boolean Dataflow model 00 ee eee eeeceeseeeeneeceseeneeeeseeeeaees 147 6 1 Scheduling use ee E Seated dls ve Pou wtaa ba ee 148 6 2 Parallel implementation on shared memory machines 152 6 2 General strategy secreties ti 152 6 2 2 Implementation on the OMA sssesesesrseseereesrrereerersese 155 6 2 3 Improved mechanism sssssssesssssses
2. Ny N 7 Y n S x I lt Peo 7 d synch edges e internal edges Figure 5 6 a A multi resolution QMF filter bank used to illustrate the benefits of removing redundant synchronizations b The precedence graph for a c A self timed two processor parallel schedule for a d The initial synchroniza tion graph for c 130 mation together with the invocation counts specified by q we obtain the prece dence relationships specified by the graph of Fig 5 6 b in which the 7th invocation of actor N is labeled N and each edge e specifies that invocation snk e requires data produced by invocation src e delay e iteration periods after the iteration period in which the data is produced A self timed schedule for Fig 5 6 b that can be obtained from Hu s list scheduling method Hu61 described in is specified in Chapter 1 section 1 2 is specified in Fig 5 6 c and the synchronization graph that corresponds to the IPC graph of Fig 5 6 b and Fig 5 6 c is shown in Fig 5 6 d All of the dashed edges in Fig 5 6 d are synchronization edges If we apply Shaffer s method which con siders only those synchronization edges that do not have delay we can eliminate the need for explicit synchronization along only one of the 8 synchronization edges edge A B In contrast if we apply RemoveRedundantSynchs we can detect the redundancy of A B as well as four additional
3. k 2 end src k delay The above relation is identical to Eqn 5 5 and this proves the Theorem QED The above theorem motivates the following definition Definition 5 1 If G v Ein UE and a v EmUE are syn beray 1 Ds chronization graphs with the same vertex set we say that G preserves G if Ve s t e E E we have Po src snk lt delay Thus Theorem 5 1 states that the synchronization constraints of v Eint VUE a3 imply the synchronization constraints of v Eint L E if V En V E x pre serves V Ein D EY Given an IPC graph G and a synchronization graph G such that G preserves Gj suppose we implement the synchronizations corresponding to the synchronization edges of G Then the iteration period of the resulting system is determined by the maximum cycle mean of G MCM G This is because the synchronization edges alone determine the interaction between processors a com munication edge without synchronization does not constrain the execution of the corresponding processors in any way 5 5 Formal problem statement We refer to each access of the shared memory synchronization variable sv e by src e and snk e as a synchronization access to shared memory If synchronization for e is implemented using UBS then we see that on average 4 synchronization accesses are required for e in each iteration period while BB
4. 1 tokens must be produced on e before the kth execution of v can begin Thus the actor src e must have completed its k delay e 1 th execution before v can begin its kth execu tion The 1 s arise because we define start v k for k20 rather than k gt 0 36 This is done for notational convenience Any schedule that satisfies all the precedence constraints specified by edges in an HSDFG G is also called an admissible schedule for G Reit68 A valid execution of an HSDFG corresponds to a set of firing times start vp k that correspond to an admissible schedule i e a valid execution respects all data precedences specified by the HSDFG For the purposes of the techniques presented in this thesis we are only interested in the precedence relationships between actors in the HSDF graph In a general HSDFG one or more pairs of vertices can have multiple edges connecting them in the same direction Such a situation often arises when a multirate SDF graph is converted into a homogeneous HSDFG Multiple edges between the same pair of vertices in the same direction are redundant as far as precedence relation ships are concerned Suppose there are multiple edges from vertex v to Vis and amongst these edges the edge that has minimum delay has delay equal to d Dias Then if we replace all these edges by a single edge with delay equal to d it is easy to verify that this single edge maintains the precedence con
5. 181 G J Olsder J A C Resing R E De Vries and M S Keane Discrete Event Systems with Stochastic Processing Times IEEE Transactions on Automatic Control March 1990 Vol 35 No 3 pp 299 302 Ols89 G J Oldser Performance Analysis of Data driven Networks Systolic Array Processors Contributions by Speakers at the International Confer ence on Systolic Arrays Edited by J McCanny J McWhiter E Swartz lander Jr Prentice Hall New York 1989 pp 33 41 Ous94 J K Ousterhout An Introduction to Tcl and Tk Addison Wesley Publish ing Redwood City CA 1994 Pap90 G M Papadopoulos Monsoon A Dataflow Computing Architecture Suitable for Intelligent Control Proceedings of the 5th IEEE Interna tional Symposium on Intelligent Control 1990 Parhi9 1 K Parhi and D G Messerschmitt Static Rate optimal Scheduling of Iterative Data flow Programs via Optimum Unfolding JEEE Transactions on Computers Vol 40 No 2 pp 178 194 February 1991 Patt90 D A Patterson and J L Hennessy Computer Architecture A Quantitative Approach Morgan Kaufman Publishers 1990 Peter8 1 J L Peterson Petri Net Theory and the Modelling of Systems Prentice Hall Inc 1981 Pin95a 182 J Pino S Ha E A Lee and J T Buck Software Synthesis for DSP Using Ptolemy Journal of VLSI Signal Processing Vol 9 No 1 January 1995 Pin95b J L Pino S S Bhattacharyya an
6. Journal of VLSI Signal Processing No 6 1993 Bhat94 S S Bhattacharyya and E A Lee Memory Management for Dataflow Programming of Multirate Signal Processing Algorithms IEEE Trans on Signal Processing Vol 42 No 5 May 1994 Bier89 J Bier and E A Lee Frigg A Simulation Environment for Multiproces sor DSP System Development Proceedings of the International Confer ence on Computer Design pp 280 283 October 1989 Boston MA Bier90 J Bier S Sriram and E A Lee A Class of Multiprocessor Architectures for Real Time DSP VLSI DSP IV ed H Moscovitz IEEE Press Novem ber 1990 Bils94 G Bilsen M Engels R Lauwereins and J A Peperstraete Static Sched uling of Multi Rate and Cyclo Static DSP Applications VLSI SIgnal Pro cessing VII IEEE Press 1994 Blaz87 J Blazewicz Selected Topics in Scheduling Theory in Surveys in Com binatorial Optimization North Holland Mathematica Studies 1987 Bork88 172 S Borkar et al iWarp An Integrated Solution to High Speed Parallel Computing Proceedings of Supercomputing 1988 Conference Orlando Florida 1988 Bray84 R K Brayton G D Hachtel C T McMullen and A L Sangiovanni Vin centelli Logic Minimization Algorithms for VLSI Synthesis Kluwer Aca demic Publishers 1984 Bryant86 R E Bryant Graph Based Algorithms for Boolean Function Manipula tion IEEE Transactions on Computers C 3
7. Note that writing the con trol token c once to shared memory is enough since the same shared location can schedule for subgraph 1 proc 1 TRUE proc 2 G Branch proc 3 H Soy Fer Fez S1 Fi lt access order for subgraph 1 gt Soi Fel Fez S2 Fos lt access order for subgraph 1 gt FALSE Branch Figure 6 6 Transaction order corresponding to the TRUE and FALSE branches 155 be read by all processors requiring the value of c For the OMA architecture our proposed strategy is to switch between these two access orders at run time This is enabled by the preset feature of the transac tion controller Chapter 3 section 3 4 2 Recall that the transaction controller is implemented as a presettable schedule counter that addresses memory containing the processor IDs corresponding to the bus access order To handle conditional constructs we derive two bus access lists corresponding to each path in the pro gram and the processor that determines the branch condition processor 2 in our example forces the controller to switch between the access lists by loading the schedule counter with the appropriate value address 7 in the bus access sched ule of Fig 6 7 Note from Fig 6 7 that there are two points where the schedule counter can be set one is at the completion of the TRUE branch and the other is a jump into the FALSE branch The branch into the FALSE path is best taken care of by processor 2 si
8. 142 Bhat95a we formally establish that the resynchronization problem is NP hard by deriving a polynomial time reduction from the classic minimal set covering prob lem which is known to be NP hard Garey79 to the pair wise resynchronization problem The complexity remains the same whether we consider a general resyn chronization problem that also attempts to insert edges within SCCs or a restricted version that only adds feed forward edges between SCCs the Resynchronize pro cedure in Bhat95a restricts itself to the latter because in this case it is simpler to ensure that the estimated throughput is unaffected by the added edges Although the correspondence that we establish between the resynchroniza tion problem and set covering shows that the resynchronization problem probably cannot be attacked optimally with a polynomial time algorithm the correspon dence allows any heuristic for set covering to be adapted easily into a heuristic for the pair wise resynchronization problem and applying such a heuristic to each pair of SCCs in a general synchronization graph yields a heuristic for the general not just pair wise resynchronization problem Bhat95a This is fortunate since the set covering problem has been studied in great depth and efficient heuristic methods a b new edge gt synch edges Figure 5 14 An example of resynchronization 143 have been devised for it Corm92 For a certain class of IPC g
9. Thus the OT schedule is essentially an ST schedule with the added transac tion order constraints specified by O After an FS schedule is obtained using the execution time estimates the transaction order is obtained from the function of the FS schedule we simply set the transaction order to O V1 Va V3 Vox_ 1 gt Vog Where o vi So S SO g1 SO a The transaction order can therefore be determined by sorting the set of communi cation actors S U R according to their start times o Fig 3 1 shows an example of how such an order could be derived from a given fully static schedule This FS schedule corresponds to the HSDFG and schedule illustrated in Chapter 1 Fig 1 2 The transaction order is enforced at run time by a controller implemented in hardware The main advantage of ordering inter processor transactions is that it 40 Proc 1 Proc 2 Proc 4 Proc 5 Transaction order Gr Pye Soo a e Sg Vp 55 gt 05 gt 56 gt r Figure 3 1 One possible transaction order derived from the fully static schedule allows us to restrict access to communication resources statically based on the communication pattern determined at compile time Since communication resources are typically shared between processors run time contention for these resources is eliminated by ordering processor accesses to them this results in an efficient IPC mechanism at low hardware cost We have built a prototype four pro
10. architecture when the number of such control constructs is small a reasonable assumption for most DSP algorithms Finally we presented techniques for minimizing synchronization costs in a self timed implementation that can be achieved by systematically manipulating the synchronization points in a given schedule the IPC graph construct was used for this purpose The techniques proposed include determining when certain synchro nization points are redundant transforming the IPC graph into a strongly con nected graph and then sizing buffers appropriately such that checks for buffer overflow by the sender can be eliminated We also outlined a technique we call resynchronization which introduces new synchronization points in the schedule with the objective of minimizing the overall synchronization cost The work presented in this thesis leads to several open problems and direc tions for further research Mapping a general BDF graph onto the OMA to make best use of our abil ity to switch between bus acce chedules at run time is a topic that requires fur ther study Techniques for multiprocessor scheduling of BDF graphs could build upon the quasi static scheduling approach which restricts itself to certain types of dynamic constructs that need to be identified for example as conditional con structs or data dependent iterations before scheduling can proceed Assumptions regarding statistics of the Boolean tokens e g the prop
11. cally in Fig 1 3 c Proc 1 Proc 2 start start v va gt rec C N v v rec Y B vi send i A a HSDFG c Self timed implementation schematic Figure 1 3 Steps in a self timed scheduling strategy An ST strategy is robust with respect to changes in execution times of actors because sender receiver synchronization is performed at run time Such a strategy however implies higher IPC costs compared to the fully static strategy because of the need for synchronization e g using semaphore management In addition the ST strategy faces arbitration costs the FS schedule guarantees mutu 20 ally exclusive access of shared communication resources whereas shared resources need to be arbitrated at run time in the ST schedule Consequently whereas IPC in the FS schedule simply involves reading and writing from shared memory no synchronization or arbitration needed implying a cost of a few pro cessor cycles for IPC the ST strategy requires of the order of tens of processor cycles unless special hardware is employed for run time flow control We discuss in detail how this overhead arises in a shared bus multiprocessor configuration in Chapter 3 Run time flow control allows variations in execution times of tasks in addition it also simplifies the compiler software since the compiler no longer needs to perform detailed timing analysis and does not need to adjust the execution of proces
12. techniques for SDF graphs to BDF model 6 1 1 Scheduling Buck presents techniques for statically scheduling BDF graphs on a single processor his methods attempt to generate a sequential program without a dynamic scheduling mechanism using if then else and do while control constructs where required Because of the inherent undecidability of determining deadlock behaviour and bounded memory usage these techniques are not always 148 SELECT a b Figure 6 2 a Conditional if then else dataflow graph The branch outcome is determined at run time by actor B b Graph representing data dependent iteration The termination condition for the loop is determined by actor D guaranteed to generate a static schedule even if one exists a dynamically sched uled implementation where a run time kernel decides which actors to fire can be used when a static schedule cannot be found in a reasonable amount of time Automatic parallel scheduling of general BDF graphs is still an unsolved problem A naive mechanism for scheduling graphs that contain SWITCH and SELECT actors is to generate an Acyclic Precedence Graph APG similar to the APG generated for SDF graphs discussed in section 1 2 1 for every possible assignment of the Boolean valued control tokens in the BDF graph For example the if then else graph in Fig 6 2 a could have two different APGs shown in Fig 149 6 3 and APGs thus obtained can be scheduled individual
13. G and vertices x y in G we define pg x y to be equal to the path delay of a minimum delay path from x to y if there exist one or more paths from x to y and equal to oo if there is no path from x to y If G is understood then we may drop the subscript and simply write p in place of p g By a subgraph of V E we mean the directed graph formed by any V c V together with the set of edges e E src e snk e V We denote the subgraph associated with the vertex subset V by subgraph V We say that V E is strongly connected if for each pair of distinct vertices x y there is a path directed from x to y and there is a path directed from y to x We say that a subset V c V is strongly connected if subgraph V is strongly connected A strongly connected component SCC of V E is a strongly connected subset V CV such that no strongly connected subset of V properly contains V If V is an SCC then when there is no ambiguity we may also say that subgraph V is an SCC If C and C are distinct SCCs in V E we say that Ci is a predeces sor SCC of C if there is an edge directed from some vertex in C to some vertex in C C is a successor SCC of C if C is a predecessor SCC of C An SCC is a source SCC if it has no predecessor SCC and an SCC is a sink SCC if it has no successor SCC An edge e is a feedforward edge of V E if it is not con tained in an SCC or equivalently if i
14. Prentice Hall 1988 Lam88 M Lam Software Pipelining An Effective Scheduling Technique for VLIW Machines Proceedings of the SIGPLAN 1988 Conference on Pro gramming Language Design and Implementation pp 318 328 June 1988 Lam89 M Lam A Systolic Array Optimizing Compiler Kluwer Academic Pub lishers Norwell Massachusetts 1989 Lamp86 L Lamport The Mutual Exclusion Problem Part I and I Journal of the ACM Vol 33 No 2 pp 313 348 April 1986 Laps9 1 P D Lapsley Host Interface and Debugging of Dataflow DSP Systems M S Thesis Electronics Research Laboratory University of California Ber keley CA 94720 December 1991 Lauw90 R Lauwereins M Engels J A Peperstraete E Steegmans and J Van Ginderdeuren GRAPE A CASE Tool for Digital Signal Parallel Process ing IEEE ASSP Magazine Vol 7 No 2 April 1990 178 Law76 E L Lawler Combinatorial Optimization Networks and Matroids Holt Rinehart and Winston New York pp 65 80 1976 Lee86 E A Lee A Coupled Hardware and Software Architecture for Program mable DSPs Ph D Thesis Department of EECS University of California Berkeley May 1986 Lee87 E A Lee and D G Messerschmitt Static Scheduling of Synchronous Dataflow Programs for Digital Signal Processing JEEE Transactions on Computers February 1987 Lee88a E A Lee Programmable DSP Architectures Part I IEEE Acoustics Speech
15. redundant synchronization edges from a synchronization graph 5 6 1 The independence of redundant synchronizations The following theorem establishes that the order in which we remove redundant synchronization edges is not important therefore all the redundant syn chronization edges can be removed together Theorem 5 2 Suppose that G V Eint Y E is a synchronization graph e and e are distinct redundant synchronization edges in G i e these are edges that could be individually moved to E and G v Eint L e Then e is redundant in G Thus both e and e can be moved into E together Proof Since e is redundant in G there is a path p e in G directed from src e to snk e such that Delay p lt delay e 5 7 Similarly there is a path p e contained in both G and G that is directed from src e to snk e and that satisfies Delay p lt delay e 5 8 Now if p does not contain e then p exists in G and we are done Otherwise let p Xj Xz x observe that p is of the form 125 P Op Yori Ves Cp Veen p gt Ym and define wy PE a Vor Yp ge hp Xo ee Yp Yki ee rr Clearly p is a path from src e to snk e in G Also Delay p delay x y delay y Delay p Delay p delay e lt Delay p from Eqn 5 8 lt delay e from Eqn 5 7 OED Theorem 5 2 tells us that we can avoid impleme
16. shared I O resources much as we order access to the shared bus and memory We also experimented with application of the ordered memory access idea to run time parameter control By run time parameter control we mean controlling parameters in the DSP algorithm gain of some component bit rate of a coder pitch of synthesized music sounds etc while the algorithm is running in real time on the hardware Such a feature is obviously very useful and sometimes indispens able Usually one associates such parameter control with an asynchronous user input the user changes a parameter ideally by means of a suitable GUI on his or her computer and this change causes an interrupt to occur on a processor and the interrupt handler then performs the appropriate operations that cause the parameter change that the user requested For the OMA architecture however unpredictable interrupts are not desir able as was noted earlier in this chapter on the other hand shared I O and IPC are relatively inexpensive owing to the OT mechanism To exploit this trade off we implemented the parameter control in the following fashion The S 56X host han dles the task of accepting user interrupts whenever a parameter is altered the DSP56000 on the S 56X card receives an interrupt and it modifies a particular location in its memory call it M The OMA board on the other hand receives the contents of M on every schedule period whether M was actually modified or not
17. 185 1991 Hu61 T C Hu Parallel Sequencing and Assembly Line Problems Operations Research Vol 9 1961 Huis93 J A Huisken et al Synthesis of Synchronous Communication Hardware in a Multiprocessor Architecture Journal of VLSI Signal Processing Vol 6 pp 289 299 1993 Kala93 A Kalavade and E A Lee A Hardware Software Codesign Methodol ogy for DSP Applications IEEE Design and Test September 1993 Vol 10 No 3 pp 16 28 Karp66 R M Karp and R E Miller Properties of a Model for Parallel Computa tions Determinacy Termination Queueing SIAM Journal of Applied Math Vol 14 No 6 November 1966 Karp78 R M Karp A Characterization of the Minimum Cycle Mean in a Digraph Discrete Mathematics Vol 23 1978 Koh90 W Koh A Reconfigurable Multiprocessor System for DSP Behavioral Sim ulation Ph D Thesis Memorandum No UCB ERL M90 53 Electronics Research Laboratory University of California Berkeley June 1990 Koh75 W H Kohler A Preliminary Evaluation of Critical Path Method for Scheduling Tasks on Multiprocessor Systems IEEE Transactions on 177 Computers pp 1235 1238 December 1975 Kung87 S Y Kung P S Lewis and S C Lo Performance Analysis and Optimi zation of VLSI Dataflow Arrays Journal of Parallel and Distributed Computing Vol 4 pp 592 618 1987 Kung88 S Y Kung VLSI array processors Englewood Cliffs N J
18. Ho Performance Evaluation of Asynchro nous Concurrent Systems using Petri Nets IEEE Transactions on Soft ware Engineering Vol SE 6 No 5 pp 440 449 September 1980 Ram72 C V Ramamoorthy K M Chandy and M J Gonzalez Optimal Sched uling Strategies in Multiprocessor Systems JEEE Transactions on Com puters Feb 1972 Reit68 R Reiter Scheduling Parallel Computations Journal of the Association for Computing Machinery October 1968 Renf8 1 M Renfors and Y Neuvo The Maximum Sampling Rate of Digital Filters Under Hardware Speed Constraints IEEE Transactions on Circuits and Systems CAS 28 3 March 1981 Ritz92 S Ritz M Pankert and H Meyr High Level Software Synthesis for Sig nal Processing Systems Proceedings of the International Conference on Application Specific Array Processors Berkeley August 1992 Sark89 184 V Sarkar Partitioning and Scheduling Parallel Programs for Multiproces sors Research Monographs in Parallel and Distributed Computing Pit man London 1989 Sha89 P L Shaffer Minimization of Interprocessor Synchronization in Multi processors with Shared and Private Memory International Conference on Parallel Processing 1989 Schw85 D A Schwartz and T P Barnwell III Cyclo Static Solutions Optimal Multiprocessor Realizations of Recursive Algorithms VLSI Signal Pro cessing IT IEEE Special Publications pp 117 12
19. Thus the OMA processors never see a user created interrupt they in essence update the parameter corresponding to the value stored in M in every iteration of 12 the dataflow graph Since reading in the value of M costs two instruction cycles the overhead involved in this scheme is minimal An added practical advantage of the above scheme is that the tcl tk Ous94 based GUI primitives that have been implemented in Ptolemy for the S 56X see CG56 Domain in Volume 1 of Ptol94 can be directly used with the OMA board for parameter control purposes 3 7 Application examples 3 7 1 Music synthesis The Karplus Strong algorithm is a well known approach for synthesizing the sound of a plucked string The basic idea is to pass a noise source in a feedback loop containing a delay a low pass filter and a multiplier with a gain of less than one The delay determines the pitch of the generated sound and the multiplier gain determines the rate of decay Multiple voices can be generated and combined by implementing one feedback loop for each voice and then adding the outputs from all the loops If we want to generate sound at a sampling rate of 44 1 KHz com pact disc sampling rate we can implement 7 voices on a single processor in real time using the blocks from the Ptolemy DSP96000 code generation library CG96 These 7 voices consume 370 instruction cycles out of the 380 instruction cycles available per sample period Using
20. components which are then decimated by a factor of two The low pass compo nent is then decomposed again into low pass and high pass components and this process proceeds recursively The synthesis bank performs the complementary operation of upsampling filtering and combining the high pass and low pass com ponents this process is again performed recursively to reconstruct the input signal Fig 3 18 a shows a block diagram of a synthesis filter bank followed by an analy sis bank QMF filter banks are designed such that the analysis bank cascaded with the synthesis bank yields a transfer function that is a pure delay i e has unity response except for a delay between the input and the output Such filter banks are also called perfect reconstruction filter banks and they find applications in high quality audio compression each frequency band is quantized according to its 75 energy content and its perceptual importance Such a coding scheme is employed in the audio portion of the MPEG standard We implemented a perfect reconstruction QMF filter bank to decompose audio from a compact disc player into 15 bands The synthesis bank was imple mented together with the analysis part There are a total of 36 multirate filters of 18 taps each This is shown hierarchically in Fig 3 18 a Note that delay blocks are required in the first 13 output paths of the analysis bank to compensate for the delay through successive stages of the ana
21. ey snk e3 src e3 ar snk e _1 src e We say that the path p e 5 contains each e and each subsequence of 3 p is directed from src e to snk e and each member of src e sre ey sre e snk e ison p A path that is directed from a vertex to itself is called a cycle and a fundamen 33 Given two arbitrary sets Si and Sy we define the difference of these two sets by S S se S s S and we denote the number of elements in a finite set S by S Also if r is a real number then we denote the smallest integer that is greater than or equal to r by r For elaboration on any of the graph theoretic concepts presented in this section we refer the reader to Cormen Leiserson and Rivest Corm92 2 2 Schedule notation To model execution times of actors and to perform static scheduling we associate execution time f v z non negative integer with each actor v in the HSDFG t v assigns execution time to each actor v the actual execution time can be interpreted as t v cycles of a base clock Inter processor communi cation costs are represented by assigning execution times to the send and receive actors The values v may be set equal to execution time estimates when exact execution times are not available in which case results of the computations that make use of these values e g the iteration period T are compile time estimates Recall that
22. int ipc in G are the ones that are finally implemented it is advantageous to calculate the self timed buffer bound By as a final step after all the transformations on G are complete instead of using G pc itself to calculate these bounds This is because addition of the edges in the Convert to SC graph and Resynchronize steps may reduce these buffer bounds It is easily verified that removal of edges cannot change the buffer bounds in Eqn 5 1 as long as the synchronizations in G e ipc ar preserved Thus in the interest of obtaining minimum possible shared buffer sizes we compute the bounds using the optimized synchronization graph The following theorem tells us how to compute the self timed buffer bounds from G Theorem 5 4 If G preserves G and the synchronization edges in G are ipc implemented then for each feedback communication edge e in G the self timed buffer bound of e Bo e an upper bound on the number of data tokens that can be present on e is given by By e Pg snk e src e delay e Proof By Lemma 5 1 if there is a path p from snk e to src e in G then 141 start src e k 2 end snk e k Delay p Taking p to be an arbitrary minimum delay path from snk e to src e in G we get start src e k end snk e k Pg snk e src e That is src e cannot be more that Pg snk e src e iterations ahead of snk e Thus there can ne
23. the actor ordering step and determining when each 13 assignment and ordering performed at run time would in general lead to a more flexible implementation because a dynamic strategy allows for run time variations in computation load and for operations that display data dependencies the over head involved in such a strategy is usually prohibitive and real time performance guarantees are difficult to achieve Lee and Ha Lee89 define two scheduling strategies that perform the assignment and ordering steps at compile time fully static and self timed We use the same terminology in this thesis 1 2 1 Fully static schedules In the fully static FS strategy the exact firing time of each actor is also determined at compile time Such a scheduling style is used in the design of sys tolic array architectures Kung88 for scheduling VLIW processors Lam88 and in high level VLSI synthesi of applications that consist only of operations with guaranteed worst case execution times DeMich94 Under a fully static schedule all processors run in lock step the operation each processor performs on each clock cycle is predetermined at compile time and is enforced at run time either implicitly by the program each processor executes perhaps augmented with nop s or idle cycles for correct timing or explicitly by means of a program sequencer for example A fully static schedule of a simple HSDFG G is illustrated in Fig 1 1 The FS sch
24. vious section support this argument the dataflow architectures in section 1 3 1 use dynamic scheduling but pay a high hardware cost which makes them unsuited for embedded applications In the case of the NEC dataflow chips parallelism is mainly derived through pipelined execution The dataflow model of execution that is implemented in hardware in this chip although elegant is of limited use for the 27 of a programmable systolic array as opposed to a dedicated array designed for one specific application Processors are arranged in a linear array and communicate with their neighbors through FIFO queues Programs are written for this computer in a language called W2 Lam88 The Warp project also led to the iWarp design Bork88 which has a more elaborate inter processor communication mechanism than the Warp machine An iWarp node is a single VLSI component composed of a computation engine and a communication engine the latter consists of a crossbar and data routing mechanisms The iWarp nodes can be connected in various single and two dimensional topologies and point to point message passing type commu nication is supported 1 3 3 Multiprocessor DSP architectures In this section we discuss multiprocessors that make use of multiple off the shelf programmable DSP chips The SMART architecture Koh90 is a reconfigurable bus design com prised of AT amp T DSP32C processors and custom VLSI components for routing data between processo
25. 58 RAM shared address bus count sie _ county saad shared data bus i Decode countup latch BGy latch BGx y uit contains access list address procID a i Addr 0 1 count 5 county_1 Preset operation count h b Figure 3 8 Presettable counter implementation dependency in their execution For example a dataflow graph with a conditional construct will in general require a different access schedule for each outcome of the conditional Two different SDF graphs are executed in this case depending on the branch outcome and the processor that determines the branch outcome can also be assigned the task of presetting the counter making it branch to the access 59 list of the appropriate SDF subgraph The access controller behaves as in Fig 3 8 b We discuss the use of this presettable feature in detail in Chapter 6 3 4 3 Host interface The function of the host interface is to allow downloading programs onto the OMA board controlling the board setting parameters of the application being run and debugging from a host workstation The host for the OMA board connects to the shared bus through the Xilinx chip via one of the shared bus connectors Since part of the host interface is configured inside the Xilinx different hosts 32 bit 16 bit with different handshake mechanisms can be used with the board The host that is being used
26. D C If we set T Tor then the right hand sides of the system of inequalities in 4 6 are integers and the Bellman Ford algorithm yields integer solutions for o v This is because the weights on the edges of the constraint graph which are equal to the RHS of the difference constraints are integers if T is an integer con sequently the shortest paths calculated on the constraint graph are integers Thus S 6 v 0 0 Tor is a valid fully static schedule QED Remark Claim 4 1 essentially states that an FS schedule can be modified by skew ing the relative starting times of processors so that the resulting schedule has itera tion period less than T 1 the resulting iteration period lies within one time unit of its lower bound for the specified processor assignment and actor ordering It is possible to unfold the graph and generate a fully static schedule with average period exactly T r Dut the resulting increase in code size is usually not worth the benefit of at most one time unit decrease in the iteration period Recall that a time unit is essentially the clock period therefore one time unit can usually be neglected 94 For example the static schedule corresponding to Fig 4 1 has Try 11 gt T 57 9 units Using the procedure outlined in Claim we can skew the starting times of processors in the schedule S to obtain a schedule S as shown in 4 5 that has a period equal t
27. E MCM G and others where E T lt E MCM G If the execution times of actors are all bounded t pin lt t v S tnar Vv V e g if all actors have execution times uniformly distributed in some inter val a b then we can say the following MCM G mnax Z E T 2 MCM G ye Z MCM G 4 7 min where Gas V E is same as G except the random actor execution times are v and similarly G in V E is the max mi replaced by their upper bounds t same as G except the random actor execution times are replaced by their lower bounds t pin Equation 4 7 summarizes the useful bounds we know for expected value of the iteration period for graphs that contain actors with random execution times It should be noted that good upper bounds on E T are not known Rajsbaum and Sidi propose upper bounds for exponentially distributed execution times Rajs94 these upper bounds are typically more than twice the exact value of E T and hence not very useful in practice We attempted to simplify the Markov chain model 103 i e reduce the number of states for the self timed execution of a stochastic HSD FG by representing such an execution by a set of self timed schedules of determin istic HSDFGs between which the system makes transitions randomly This representation reduces the number of states of the Markov chain to the number of different deterministic graphs that arise from the stochastic HSDFG We we
28. MIMD principle of Dietz Zaafrani and O Keefe which is a combined hardware and software solution to reducing run time synchronization overhead Dietz92 In 108 this approach a shared memory MIMD computer is augmented with hardware support that allows arbitrary subsets of processors to synchronize precisely with respect to one another by executing a synchronization operation called a barrier If a subset of processors is involved in a barrier operation then each processor in this subset will wait at the barrier until all other processors in the subset have reached the barrier After all processors in the subset have reached the barrier the corre sponding processes resume execution in exact synchrony In Dietz92 the barrier mechanism is applied to minimize synchronization overhead in a self timed schedule with hard lower and upper bounds on the task execution times The execution time ranges are used to detect situations where the earliest possible execution time of a task that requires data from another processor is guaranteed to be later than the latest possible time at which the required data is produced When such an inference cannot be made a barrier is instantiated between the sending and receiving processors In addition to performing the required data synchronization the barrier resets to zero the uncertainty between the relative execution times for the processors that are involved in the barrier and thus enhances the potenti
29. MIT Tagged Token 170 Dataflow Architecture JEEE Transactions on Computers Vol C 39 No 3 March 1990 Arvi9 1 Arvind L Bic and T Ungerer Evolution of Dataflow Computers Advanced Topics in Dataflow Computing Prentice Hall 1991 Bacc92 F Baccelli G Cohen G J Olsder and J P Quadrat Synchronization and Linearity John Wiley amp Sons Inc New York pp 103 154 1992 Bacc92 F Baccelli and Z Liu On a Class of Stochastic Recursive Sequences Arising in Queueing Theory The Annals of Probability Vol 20 No 1 pp 350 374 Barr9 1 B Barrera and E A Lee Multirate Signal Processing in Comdisco s SPW Proceedings of the International Conference on Acoustics Speech and Signal Processing Toronto April 1991 Ben9 1 A Benveniste and G Berry The Synchronous Approach to Reactive and Real Time Systems Proceedings of the IEEE Vol 79 No 9 1991 pp 1270 1282 Bhat95a S S Bhattacharyya S Sriram and E A Lee Optimizing Synchronization in Multiprocessor Implementations of Iterative Dataflow Programs ERL Technical Report UCB ERL M95 2 University of Califor nia Berkeley CA 94720 January 5 1995 Bhat95b 171 S S Bhattacharyya S Sriram and E A Lee Resynchronization for Embedded Multiprocessors Draft XXX to be made into an ERL memo Bhat93 S S Bhattacharyya and Edward A Lee Scheduling Synchronous Data flow Graphs for Efficient Looping
30. S5 Fs Se ro the resulting OT schedule evolves as shown in Fig 4 3 Notice that enforcing this schedule introduces idle time hatched rectangles as a result Tor the average iteration period for the OT sched ule is 10 units which is as expected larger than the iteration period of the ideal ST schedule T 9 units but is smaller than 7 11 units In general Pe Proc2 B ah FI pB 7 B F a F Proc3 c 4l c el y 1t g c 14 cy c Ta Prc4 pb 17 b TY Y b Y b f p Proc a E lt 70 gt Tor 10 I idle time due to ordering constraint Figure 4 3 Schedule evolution when the transaction order of Fig 3 1 is enforced 81 Tres 2To72 Tyr the ST schedule only has assignment and ordering constraints the OT schedule has the transaction ordering constraints in addition to the con straints in the ST schedule whereas the FS schedule has exact timing constraints that subsume the constraints in the ST and OT schedules The question we would like to answer is is it possible to choose the transaction ordering more intelligently than the straightforward one obtained by sorting chosen in Fig 3 1 As a first step towards determining how such a best possible access order might be obtained we attempt to model the self timed execution itself and try to de termine the precise effect e g increase in the iteration period of adding transaction ordering constraints Note again that as the
31. The main contribution of this thesis is to present techniques for mini mizing synchronization and inter processor communication overhead in statically i e compile time scheduled multiproces s where the program is derived from a dataflow graph specification The strategy is to model run time execution of such a multiprocessor to determine how processors communicate and synchronize and then to use this information to optimize the final implementation 1 1 The Synchronous Dataflow model 1 1 1 Background Dataflow is a well known programming model in which a program is rep resented as a directed graph where the vertices or actors represent computation and edges or ares represent FIFO first in first out queues that direct data values from the output of one computation to the input of another Edges thus represent data precedences between computations Actors consume data or tokens from their inputs perform computation on them fire and produce certain number of tokens on their outputs Programs written in high level functional languages such as pure LISP and in dataflow languages such as Id and Lucid can be directly converted into dataflow graph representations such a conversion is possible because these languages are designed to be free of side effects i e programs in these languages are not allowed to contain global variables or data structures and functions in these languages can not modify their arguments Ack82 A
32. a product 158 AND function of the Boolean variables in the dataflow graph and the comple ment of these Booleans Nested conditionals in parallel branches of the graph necessitate bus grants that are enabled by a product function a similar need arises when bus grants must be reordered based on values of the Boolean variables Thus in general we need to implement an annotated bus access list of the form c ProcID c ProcID each bus access is annotated with a Bool ean valued condition c indicating that the bus should be granted to the processor corresponding to ProcID when c evaluates to TRUE c could be an arbitrary product function of the Booleans b b b in the system and the comple ments of these Booleans e g c fies b by where the bar over a variable indicates its complement This scheme is implemented as shown in Fig 6 9 The schedule memory shared address bus shared data bus memory maps the EA aee gee jee to the Sarsi bus lt Condition gt Schedule RAM contains access list lt Condition gt lt ProcID gt h Schedule counter Signal indicating whether to mask current BG or not BG decode Decoded l EERE i BG lines G Enable Figure 6 9 A bus access mechanism that selectively masks bus grants based on values of control tokens that are evaluated at run time now contains two fields corresponding to each bus access lt Condition gt lt Pro
33. a snooping cache controller for exam ple to maintain cache coherence cost of such hardware usually makes it prohibitive in embedded scenarios Thus for the embedded signal processing applications that we are focus sing on we argue that caches do not have a significant role to play and we claim that the OT approach discussed previously provides a cost effective solution for minimizing IPC overhead in implementing self timed schedules 3 2 1 Using the OT approach The OT strategy we recall operates on the principle of determining at compile time the order in which processor communications occur and enforcing that order at run time For a shared bus implementation this translates into deter mining the sequence of shared memory or equivalently shared bus accesses at compile time and enforcing this predetermined order at run time This strategy therefore involves no run time arbitration processors are simply granted the bus according to the pre determined access order When a processor obtains access to the bus it performs the necessary shared memory transaction and releases the bus the bus is then granted to the next processor in the ordered list The task of maintaining ordered access to shared memory is done by a cen tral ordered transaction controller When the processors are downloaded with code the controller too is loaded with the pre determined access order list At run time the controller simply grants bus acces
34. actors in an HSDFG are executed essentially infinitely Each fir ing of an actor is called an invocation of that actor An iteration of the HSDFG corresponds to one invocation of every actor in the HSDFG A schedule specifies processor assignment actor ordering and firing times of actors and these may be done at compile time or at run time depending on the scheduling strategy being employed To specify firing times we let the function start v k iz represent the time at which the kth invocation of the actor v starts Correspondingly the function end v k zZ represents the time at which the kth execution of the actor v completes at which point v produces data tokens at its output edges Since we are interested in the kth execution of each actor for k 0 1 2 3 we set start v k 0 and end v k 0 for k lt 0O as the initial conditions If the kth invocation of an actor vj takes exactly t v then we can claim end v k start v k t F 35 tal cycle is a cycle of which no proper subsequence is a cycle Ifp e Eyes is a path in an HSDFG then we define the path de n lay of p denoted Delay p by Delay p L delay e Since the delays on i l all HSDFG edges are restricted to be non negative it is easily seen that between any two vertices x y V either there is no path directed from x to y or there exists a not necessarily unique minimum delay path between x and y Given an HSDFG
35. are embedded applications that are becoming increasingly important for which programmability is in fact indispensable set top boxes capa ble of recognizing a variety of audio video formats and compression standards multimedia workstations that are required to run a variety of different multimedia software products programmable audio video codecs etc The generalization of such a multiprocessor chip is one that has a collec tion of programmable processors well as custom hardware on a single chip Mapping applications onto such an architecture is then a hardware software code sign problem The problems of inter processor communication and synchroniza tion are identical to the homogeneous multiprocessor case In this thesis when we refer to a multiprocessor we will imply a heterogeneous architecture that may be comprised of different types of programmable processors and may include custom hardware elements too All the techniques we present here apply to such a general system architecture Why study application specific parallel processing in the first place instead of applying the ideas in general purpose parallel systems to the specific applica tion The reason is that general purpose parallel computation deals with a user programmable computing device Computation in embedded applications how ever is usually one time programmed by the designer of that embedded system a digital cellular radio handset for example and is not mea
36. by a number alongside the bullet as in Fig 1 1 a 10 the functional or behavioural level and for synthesis from such a high level speci fication to a software description e g a C program or a hardware description e g VHDL or a combination thereof The descriptions thus generated can then be compiled down to the final implementation e g an embedded processor or an ASIC One of the reasons for the popularity of such dataflow based models is that they provide a formalism for block diagram based visual programming which is a very intuitive specification mechanism for DSP the expressivity of the SDF model sufficiently encompasses a significant class of DSP applications including multi rate applications that involve upsampling and downsampling operations An equally important reason for employing dataflow is that such a specification exposes parallelism in the program It is well known that imperative programming styles such as C and FORTRAN tend to over specify the control structure of a given computation and compilation of such specifications onto parallel architec tures is known to be a hard problem Dataflow on the other hand imposes minimal data dependency constraints in the specification potentially enabling a compiler to detect parallelism The same argument holds for hardware synthesis where it is important to be able to exploit concurrency The SDF model has also proven useful for compiling DSP applications on single proces
37. controller can be implemented with very simple hardware 3 3 2 A modified design In the design proposed above processor to processor communication occurs through a central shared memory two transactions one write and one read must occur over the shared bus This situation can be improved by distrib uting the shared memory among processors as shown in Fig 3 3 where each pro cessor is assigned shared memory in the form of hardware FIFO buffers Writes to each FIFO are accomplished through the shared bus the sender simply writes to the FIFO of the processor to which it wants to send data by using the appropriate shared memory address Use of a FIFO implies that the receiver must know the exact order in which data is written into its input queue This however is guaranteed by the ordered 49 Buffer DSP96002 DSP96002 Schedule Information Local Memory Local Memory sca TA line shared DSP96002 DSP96002 Shared Memory Local Memory Local Memory i Figure 3 3 Modified design transaction strategy Thus replacing a RAM random access memory based shared memory with distributed FIFOs does not alter the functionality of the design The sender need only block when the receiving queue is full which can be accom plished in hardware by using the Transfer Acknowledge TA signal on the DSP96002 a device can insert arbitrary number of wa
38. dependent conditional branch for example but it is of course too simple to capture many real scenarios Dataflow graphs where actors have such random execution times have been studied by Olsder et al Ols89 Ols90 in the context of modeling data driven net works also called wave front arrays Kung87a where the multiply operations in the array display data dependent execution times The authors show that the behav iour of such a system can be described by a discrete time Markov chain The idea behind this briefly is that such a system is described by a state space consisting of a set of state vectors s Entries in each vector s represent the kth starting time of each actor normalized with respect to one any arbitrarily chosen actor 0 start v5 k start v k S start V3 k start v k start v k start v k The normalization with respect to actor v in the above case is done to make the state space finite the number of distinct values that the vector s as de fined above can assume is shown to be finite in Ols90 The states of the Markov 100 chain correspond to each of the distinct values of s The average iteration period which is defined as _ Start v K T lim Ko K can then be derived from the stationary distribution of the Markov chain There are several technical issues involved in this definition of the average iteration period how do we know the limit exist
39. e is the same as snk e we get start snk e k 2 end snk e k delay e Causality implies end v k 2 start v k so we get 120 start snk e k 2 start snk e k delay e 5 4 Substituting Eqn 5 3 in Eqn 5 4 start snk e k 2 end sre e k delay e delay e Continuing along p in this manner it can easily be verified that start snk e k 2 end sre e k delay e delay e _ delay e that is start snk e k 2 end sre e k Delay p QED 2 1 hs a Proof of Theorem 5 1 If E E then the synchronization constraint due to the edge holds in both graphs But for each s t E Eg po we need to show that the constraint due to start snk k gt end src k delay 5 5 holds in ee provided Po src snk lt delay which implies there is at least one path p e Can tas e from src to snk in Ge sre e src and snk e snk e such that Delay p lt delay From Lemma 5 1 existence of such a path p implies start snk e k 2 end src e k Delay p that is start snk k 2 end src k Delay p 5 6 If Delay p lt delay then 121 end src k Delay p 2 end sre k delay Substituting this in Eqn 5 6 we get start snk
40. equivalent to the classical multiprocessor scheduling problem for an Acyclic Pre cedence Graph APG the acyclic precedence graph is obtained from the given 16 Execution Times Proc 1 A B F 3 Proc 2 GH o Proc 3 E A Proc 4 G 2 Proc 5 b Static schedule Tegel Proc 1 E LA kea EJ Proc2 Be kF B EE F B k Proc3 G h ci a a cife Lay Proc 4 ov E dp ot Pros H ma i gt t o 5 10 15 20 c Fully static execution Figure 1 2 Fully static schedule on five processors HSDFG by eliminating all edges with delays on them edges with delays represent dependencies across iterations and replacing multiple edges that are directed between the same two vertices in the same direction with a single edge This replacement is done because such multiple edges represent identical precedence constraints these edges are taken into account individually during buffer gn ment however Optimal multiprocessor scheduling of an acyclic graph is known to be NP Hard Garey79 and a number of heuristics have been proposed for this problem One of the earliest and still popular solutions to this problem is list scheduling first proposed by Hu Hu61 List scheduling is a greedy approach whenever a task is ready to run it is scheduled as soon as a processor is available to run it Ta
41. even worse a processor when granted the bus may not complete its transaction immediately thus blocking all other proces sors from accessing the bus This problem would not arise in normal arbitration schemes because independent shared memory accesses would be dynamically reordered We will quantify these performance issues in the next chapter where we show that when reasonably good estimates of actor execution times are available forcing a run time access order does not in fact sacrifice performance significantly 3 3 Design of an Ordered Memory Access multiproces 47 sor 3 3 1 High level design description We chose Motorola DSP96002 processors for the OMA prototype Although the OMA architecture can be built around any programmable DSP that has built in bus arbitration logic the DSP96002 is particularly suited to our design because of its dual bus architecture and bus arbitration mechanism In addition these processors are powerful DSPs with floating point capability Moto89 A high level block diagram of such a system is depicted in Fig 3 2 Each DSP96002 is provided with a private memory that contains its program code this local memory resides on one of the processor busses the A bus The alternate B bus of all processors are connected to the shared bus and shared memory resides on the shared bus The transaction controller grants access to processors using the bus grant BG lines on the processor A proces
42. execution time of actors Also intuitively we expect an ordered transaction schedule to be more sensitive to changes in execution times than an unconstrained ST schedule In this section we attempt to formalize these no tions by exploring the effect of changes in execution times of actors on the through put achieved by a static schedule Compile time estimates of actor execution times may be different from their actual values at run time due to errors in estimating execution times of actors that otherwise have fixed execution times and due to actors that display run time vari ations in their execution times because of conditionals or data dependent loops within them for example The first case is simple to model and we will show in sec tion 4 6 1 how the throughput of a given self timed schedule changes as a function 96 of actor execution times The second case is inherently difficult how do we model run time changes in execution times due to data dependencies or due to events such as error handling cache misses and pipeline effects In section 4 6 2 below we briefly discuss a very simple model for such run time variations we assume actors have random execution times according to some known probability distribution We conclude that analysis of even such a simple model for the expected value of the throughput is often intractable and we discuss efficiently computable upper and lower bounds for the expected throughput 4 6 1 Det
43. four processors on the OMA board we implemented 28 voices in real time The hierarchical block diagram for this is shown in Fig 3 16 The result ing schedule is shown in Fig 3 17 The makespan for this schedule is 377 instruc tion cycles which is just within the maximum allowable limit of 380 This schedule uses 15 IPCs and is therefore not communication intensive Even so a higher IPC cost than the 3 instruction cycles the OMA architecture affords us would not allow this schedule to execute in real time at a 44 1 KHz sampling rate because there is only a 3 instruction cycle margin between the makespan of this 13 schedule and the maximum allowable makespan To schedule this application we employed Hu level scheduling along with manual assignment of some of the blocks Figure 3 16 Hierarchical specification of the Karplus Strong algorithm in 28 voices 74 377 instruction cycles Figure 3 17 Four processor schedule for the Karplus Strong algorithm in 28 voices Three processors are assigned 8 voices each the fourth Proc 1 is assigned 4 voices along with the noise source 3 7 2 QMF filter bank A Quadrature Mirror Filter QMF bank consists of a set of analysis filters used to decompose a signal usually audio into frequency bands and a bank of synthesis filters is used to reconstruct the decomposed signal Vai93 In the analy sis bank a filter pair is used to decompose the signal into high pass and low pass
44. host needs to be dedicated to spoon feeding the OMA processor board and limits other tasks that could potentially run on the host 3 4 6 Shared memory Space for two shared memory modules are provided so that up to 512K x 32 bits of shared static RAM can reside on board The memory must have an access time of 25ns to achieve zero wait state operation 3 4 7 Connecting multiple boards Several features have been included in the design to facilitate connecting together multiple OMA boards The connectors on either end of the shared bus are compatible so that boards may be connected together in a linear fashion Fig 3 12 As mentioned before the shared data bus goes to the right side connector through the Xilinx chip By configuring the Xilinx to short the external and internal shared data busses processors on different boards can be made to share one contiguous bus Alternatively busses can be cleaved on the Xilinx chip with communication between busses implemented on the Xilinx via an asynchro 65 nous mechanism e g read and write latches synchronized by full and empty flags This concept is similar to the idea used in the SMART processor array Koh90 where the processing elements are connected to a switchable bus when the bus switches are open processors are connected only to their neighbors form ing a linear processor array and when the switched are closed processors are con n
45. implies that the actor v in turn never gets enabled to fire which in turn implies that there must be an edge V3 v that never contains data In this manner we can trace a path p v V _y s gt Y3 V2 Yo V1 for n V back from v to v that never contains data on its edges after a certain sequence of firing of actors in G Since G contains only V actors p must visit some actor twice and hence must contain a cycle C Since the edges of p do not contain data C is a delay free cycle QED Definition 4 2 A schedule S is said to be deadlocked if after a certain finite time at least one processor blocks on a buffer full or buffer empty condition and stays 86 blocked If the specified schedule is deadlock free then the corresponding IPC graph is deadlock free This is because a deadlocked IPC graph would imply that a set of processors depend on data from one another in a cyclic manner which in turn im plies a schedule that displays deadlock Lemma 4 3 The asymptotic iteration period for a strongly connected IPC graph G when actors execute as soon as data is available at all inputs is given by vison C Delay C 4 3 oe l es cycle CinG Note that Delay C gt 0 for an IPC graph constructed from an admissible sched ule This result has been proved in so many different contexts Kung87a Peter8 1 Ram80 Reit68 Renf81 that we avoid presenting another proof of this fact here The quotien
46. in the context of retiming when the assignment of actors to processors is fixed beforehand In this case the objective is to retime the input graph so that the number of communica tion edges that have nonzero delay is maximized and the conversion is performed to constrain the set of possible retimings in such a way that an integer linear pro gramming formulation can be developed The technique generates two dummy vertices that are connected by an edge the sink vertices of the original graph are connected to one of the dummy vertices while the other dummy vertex is con nected to each source It is easily verified that in a self timed execution this 140 scheme requires at least four more synchronization accesses per graph iteration than the method that we have proposed We can obtain further relative savings if we succeed in detecting one or more beneficial resynchronization opportunities The effect of Zivojnovic s retiming algorithm on synchronization overhead is unpredictable since one hand a communication edge becomes easier to make redundant when its delay increases while on the other hand the edge becomes less useful in making other communication edges redundant since the path delay of all paths that contain the edge increase 5 8 Computing buffer bounds from G and G After all the optimizations are complete we have a final synchronization graph G V E UE that preserves G Since the synchronization edges
47. schedule is explicitly designed such that successive iterations in the HSDFG over lap Obviously overlapped schedules often achieve a lower iteration period than blocked schedules In Fig 1 1 for example the iteration period for the blocked schedule is 3 units whereas it is 2 units for the overlapped schedule One might 18 b we simply discard the timing information that is not required and only retain the processor assignment and the ordering of actors on each processor as specified by the FS schedule Fig 1 3 c Each processor is a signed a sequential list of actors some of which are send and receive actors that it executes in an infinite loop When a processor executes a communication actor it synchronizes with the processor s it communicates with Exactly when a processor executes each actor depends on when at run time all input data for that actor are available unlike the fully static case where no such run time check is needed Conceptually the proces sor sending data writes data into a FIFO buffer and blocks when that buffer is full the receiver on the other hand blocks when the buffer it reads from is empty Thus flow control is performed at run time The buffers may be implemented using shared memory or using hardware FIFOs between processors In a self timed strategy processors run sequential programs and communicate when they execute the communication primitives embedded in their programs as shown schemati
48. significant amount of research on general purpose high performance parallel computers These employ expensive and elaborate intercon nect topologies memory and Input Output I O structures Such strategies are unsuitable for embedded DSP applications as we discussed earlier In this section we discuss some application specific parallel architectures that have been employed for signal processing and contrast them to our approach 1 3 1 Dataflow DSP architectures There have been a few multiprocessors geared towards signal processing that are based on the dataflow architecture principles of Dennis Denn80 Notable among these are Hughes Data Flow Multiprocessor Gau85 the Texas Instru ments Data Flow Signal Processor Grim84 and the AT amp T Enhanced Modular Signal Processor Bloch86 The first two perform the processor assignment step at compile time i e tasks are assigned to processors at compile time and tasks igned to a processor are scheduled on it dynamically the AT amp T EMPS per forms even the assignment of tasks to processors at runtime Each one of these machines employs elaborate hardware to implement dynamic scheduling within processors and employs expensive communication networks to route tokens generated by actors assigned to one processor to tasks on other processors that require these tokens In most DSP applications however such dynamic scheduling is unnecessary since compile time predictability makes
49. that make use of the special characteristics of the applica tions they target in order to optimize for the particular metrics that are important for that specific application These techniques adopt a design methodology that tai lors the hardware and software implementation to the particular application Some examples of such embedded computing systems are in robot controllers Sriv92 and real time speech recognition systems Stolz91 in consumer electronics such as future high definition televisions sets compact disk players electronic music synthesizers and digital audio systems and in communication systems such as dig ital cellular phones and base stations compression systems for video phones and video conferencing etc The idea of using multiple proc ing units to execute one program has been present from the time of the very first electronic computer in the nineteen for ties Parallel computation has since been the topic of active research in computer science Whereas parallelism within a single processor has been successfully exploited instruction level parallelism the problem of partitioning a single user program onto multiple such processors is yet to be satisfactorily solved Instruc tion level parallelism includes techniques such as pipelining employed in tradi tional RISC processors vectorization VLIW very large instruction word superscalar these techniques are discussed in detail by Patterson and Henn
50. times of actors on the performance of statically scheduled hardware is inherently difficult to quantify because these vari ations could occur due to a large number of factors conditional branches or data dependent loops within an actor error handling user interrupts etc and because these variations could have a variety of different characteristics from being period ic to being dependent on the input statistics and to being completely random As a result thus far we have had to resort to statements like for a static scheduling strat egy to be viable actors must not show significant variations in execution times In this section we point out the issues involved in determining the effects of variations in execution times of actors 99 A very simple model for actors with variable execution times is to assign to each actor an execution time that is a random variable r v with a discrete proba bility distribution p d f successive invocations of each actor are assumed statis tically independent execution times of different actors are assumed independent and the statistics of the random execution times are assumed to be time invariant Thus for example an actor A could have execution time t with probability w p p and execution time t w p 1 p The model is essentially that A flips a coin each time it is invoked to decide what its execution time should be for that invoca tion Such a model could describe a data
51. 3 b Since the addition of this edge introduces a new cycle in G it has the potential to reduce the estimated throughput to prevent such a reduction By e must be chosen to be large enough so that the maximum cycle mean remains unchanged upon adding the reverse edge with m delays Sizing buffers optimally such that the maximum cycle mean remains unchanged has been studied by Kung Lewis and Lo in Kung87 where the authors propose an integer linear programming formulation of the problem with the number of constraints equal to the number of fundamental cycles in the HSDFG potentially an exponential number of constraints 115 An efficient albeit suboptimal procedure to determine By is to note that if By e gt Erw y MCM Gipe holds for each feedforward edge ss en the maximum cycle mean of the resulting graph does not exceed MCM Then a binary search on By e for each feedforward edge while comput ing the maximum cycle mean at each search step and ascertaining that it is less than MCM G ipc results in a buffer assignment for the feedforward edges Although this procedure is efficient it is suboptimal because the order that the edges e are chosen is arbitrary and may effect the quality of the final solution As we will see in section 5 7 however imposing such a bound By is a naive approach for bounding buffer sizes because such a bound entails an added synchronization cost In section 5 7 we show that there is a bette
52. 5 Corm92 T H Cormen C E Leiserson and R L Rivest Introduction to Algo rithms The MIT Press and the McGraw Hill Book Company Sixth print ing Chapter 25 pp 542 543 1992 Cun79 R Cunningham Green Minimax Algebra Lecture Notes in Economics and Mathematical Systems Vol 166 Springer Verlag 1979 deGroot92 S M H de Groot S Gerez and O Herrmann Range Chart Guided Iter ative Data Flow Graph Scheduling IEEE Transactions on Circuits and Systems pp 351 364 May 1992 DeMich94 174 G De Micheli Synthesis and Optimization of Digital Circuits McGraw Hill Inc New Jersey 1994 Denn80 J B Dennis Dataflow Supercomputers IEEE Computer Magazine Vol 13 No 11 November 1980 Dietz92 H G Dietz A Zaafrani and M T O Keefe Static Scheduling for Barrier MIMD Architectures Journal of Supercomputing Vol 5 No 4 1992 Durr91 R Durett Probability Theory and Examples Wadsworth amp Brooks Cole Statistics Probability Series Pacific Grove CA 1991 Filo93 D Filo D C Ku C N Coelho Jr and G De Micheli Interface Optimi zation for Concurrent Systems Under Timing Constraints IEEE Transac tions on Very Large Scale Integration Vol 1 No 3 September 1993 Gajs94 D J Gajski F Vahid S Narayan and J Gong Specification and Design of Embedded Systems Prentice Hall Englewood Cliffs New Jersey 1994 Garey79 M R Garey and D S John
53. 5 8 pp 677 691 August 1986 Buck93 J T Buck Scheduling Dynamic Dataflow Graphs with Bounded Memory using the Token Flow Model Ph D Thesis Memorandum No UCB ERLM93 69 Electronics Research Laboratory University of California at Berkeley September 1993 Buck94 J T Buck S Ha E A Lee and D G Messerschmitt Ptolemy A Frame work for Simulating and Prototyping Heterogeneous Systems Interna tional Journal of Computer Simulation January 1994 Cam92 J Campos and M Silva Structural Techniques and Performance Bounds of Stochastic Petri Net Models in Advances in Petri Nets 1993 pp 325 349 edited by G Rosenberg Springer Verlag 1993 Chase84 M Chase A Pipelined Data Flow Architecture for Digital Signal Process 173 ing The NEC mPD7281 IEEE Workshop on Signal Processing Novem ber 1984 Chao92 L F Chao and E Sha Unfolding and Retiming Data Flow DSP Pro grams for RISC Multiprocessor Scheduling Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing April 1992 Chre83 P Chretienne Timed Event Graphs A Complete Study of their Controlled Executions International Workshop on Timed Petri Nets July 1985 Cohen85 G Cohen D Dubois J Quadrat A Linear System Theoretic View of Dis crete Event Processes and its use for Performance Evaluation in Manufac turing IEEE Transactions on Automatic Control March 198
54. 6 pin Euro connector PAL 25 ns 16 Seal control local addr decode Memory Local Bus A DSP96002 Port B Host Interface HI Shared Bus B port 32 32 16 Addr Data control Figure 3 10 Processing element The DSP96002 processor has a Host Interface HI on each of its ports The port B HI is memory mapped to the shared bus so that HI registers may be read and written from the shared bus This feature allows a host to download code and control information into each processor through the shared bus Furthermore a processor when granted the shared bus may also access the port B HI of other processors This allows processors to bypass the shared memory while communi cating with one another and to broadcast data to all processors In effect the HI on each processor can be used as a two deep local FIFO similar to the scheme in sec tion 3 3 2 except that the FIFO is internal to each processor 3 4 5 Xilinx circuitry As mentioned previously the XC3090 Xilinx chip is used to implement the transaction controller as well as a simple I O interface It is also configured to pro vide latches and buffers for addressing the Host Interface HI ports on the DSP96002 during bootup and downloading of code onto the processors For this to work the Xilinx is first configured to implement the bootup and download related circuitry which consists of latches to drive the shared address bus and to access the 62 schedul
55. 8 June 1985 Shen92 N Shenoy R K Brayton A L Sangiovanni Vincentelli Graph algo rithms for clock schedule optimization in 1992 IEEE ACM International Conference on Computer Aided Design Digest of Technical Papers Cat No 92CH03183 1 Santa Clara CA pp 132 6 Sih91 G C Sih Multiprocessor Scheduling to account for Interprocessor Com munication Ph D Thesis Department of EECS University of California Berkeley April 1991 Stolz91 A Stolzle A Real Time Large Vocabulary Connected Speech Recognition System Ph D Thesis Department of EECS University of California Ber keley December 1991 Sriv92 M B Srivastava Rapid Prototyping of Hardware and Software in a Uni 185 fied Framework Ph D Thesis Memorandum No UCB ERL M92 67 Electronics Research Laboratory University of California Berkeley June 1992 Thor86 Thor Tutorial VLSI CAD Group Stanford University 1986 Vai93 P P Vaidyanathan Multirate Systems and Filter Banks Prentice Hall 1993 Veig90 M Veiga J Parera and J Santos Programming DSP Systems on Multi processor Architectures Proceedings of the International Conference on Acoustics Speech and Signal Processing Albuquerque April 1990 Yao93 L Yao and C M Woodside Iterative Decomposition and Aggregation of Stochastic Marked Graph Petri Nets in Advances in Petri Nets 1993 pp 325 349 edited by G Rosenberg Springer Verlag 1993 Zaky8
56. 9 A Zaky and P Sadayappan Optimal Static Scheduling of Sequential Loops on Multiprocessors Proceedings of the International Conference on Parallel Processing Vol 3 pp 130 137 1989 Zivo94a V Zivojnovic S Ritz and H Meyr Retiming of DSP programs for opti mum vectorization Proceedings of the International Conference on Acoustics Speech and Signal Processing April 1994 Zivo94b V Zivojnovic H Koerner and H Meyr Multiprocessor Scheduling with 186 A priori Node Assignment VLSI Signal Processing VII IEEE Press 1994 Zivo95 V Zivojnovic J M Velarde C Schlager and H Meyer DSPSTONE A DSP Oriented Benchmarking Methodology Proceedings of ICSPAT 1995 187
57. C80 multi DSP Star Semiconductors SPROC chip Adaptive Solutions CNAPS processor etc Multiple processor DSPs are becoming popular because of variety of reasons First VLSI technology today enables one to stamp 4 5 standard DSPs onto a single die this trend is only going to continue in the coming years Such an approach is expected to become increasingly attractive because it reduces the testing time for the increasingly complex VLSI systems of the future Second since such a device is programmable tooling and testing costs of building an ASIC application specific integrated circuit for each different application are saved by using such a device for many different applications a situation that is going to be increasingly important in the future with up to a tenfold improvement in integration Third although there has been reluctance in adopting automatic compilers for embedded DSP processors such parallel DSP products make the use of automated tools feasible with a large number of processors per chip one can 2 tion in a cellular radio handset involves specific DSP functions such as speech compression channel equalization modulation etc Furthermore embedded applications face very different constraints compared to general purpose computa tion non recurring design costs power consumption and real time performance requirements are a few examples Thus it is important to study techniques that are application specific and
58. E D I G G B We have not included IPC costs in this calcu lation but these can be included in a straightforward manner by appropriately set ting the execution times of the send and receive actors The maximum cycle mean can be calculated in time O IVI E log CIVI D T where D and T are such that delay e lt D Ve Epc and t v lt T Vve V Law76 4 2 Execution time estimates If we only have execution time estimates available instead of exact values and we set t v in the previous section to be these estimated values then we obtain the estimated iteration period by calculating MCM Henceforth we will assume that we know the estimated throughput Par calculated by setting the t v values to the available timing estimates As mentioned in Chapter 1 for most practical sce 88 narios we can only assume such compile time estimates rather than clock cycle ac curate execution time estimates In fact this is the reason we had to rely on self timed scheduling and we proposed the ordered transaction strategy as a means of achieving efficient IPC despite the fact that we do not assume knowledge of exact actor execution times 4 3 Ordering constraints viewed as edges added to G The ordering constraints can be viewed as edges added to the IPC graph an edge v v with zero delays represents the constraint start v k 2 end v k The ordering constraints can therefore be expressed as a set of edges b
59. S 122 implies 2 synchronization accesses per iteration period We define the synchroni zation cost of a synchronization graph G to be the average number of synchroni zation accesses required per iteration period Thus if n if denotes the number of synchronization edges in G that are feedforward edges and denotes the num Ney ber of synchronization edges that are feedback edges then the synchronization cost of G can be expressed as 4n gt 2n ep In the remainder of this paper we develop techniques that apply the results and the analysis framework developed in sections 4 1 and sections 5 2 5 4 to minimize the synchronization cost of a self timed implementation of an HSDFG without sacrificing the integrity of any inter processor data transfer or reducing the estimated throughput We will explore three mechanisms for reducing synchronization accesses The first presented in section 5 6 is the detection and removal of redundant syn chronization edges which are synchronization edges whose respective synchroni zation functions are subsumed by other synchronization edges and thus need not be implemented explicitly This technique essentially detects the set of edges that can be moved from the EF to the set E In section 5 7 we examine the utility of adding additional synchronization edges to convert a synchronization graph that is not strongly connected into a strongly connected graph Such a conversion allows us to implement all syn
60. Subgraph A Subgraphs B amp D Subgraph C Y k iterations of subgraphs B amp D Figure 6 11 Quasi static schedule for the data dependent iteration graph of Fig 6 2 b that A B C and D of Fig 6 2 b are subgraphs rather than atomic actors Such a quasi static schedule can also be implemented in a straightforward fashion on the OMA architecture provided that the data dependent construct spans all the processors in the system The bus access schedule corresponding to the iter ated subgraph is simply repeated until the iteration construct terminates The pro cessor responsible for determining when the iteration terminates can be made to force the schedule counter to loop back until the termination condition is reached 164 This is shown in Fig 6 12 Bus access list bus access list for A bus access Jo Processor that list for the determines termination loop body condition of the iteration B amp D y can also re initialize the schedule counter bus access list for C Figure 6 12 A possible access order list corresponding to the quasi static schedule of Fig 6 11 6 4 Summary This chapter dealt with extensions of the ordered transactions approach to graphs with data dependent control flow We briefly reviewed the Boolean Data flow model and the quasi static approach to scheduling conditional and data dependent iteration constructs We then presented a scheme whereby the Ordered Memory Access board could
61. Table of Contents 1 TINE RODU CT ION wissorsesascesstasceoneassanovanesavoncesencoassencegnscodccessedecsssensceourcovenssesione 1 1 1 The Synchronous Dataflow model ee eee eeseeeseceeeeeeneeesaeeneeeees 7 IT Baekoround isa respinsa a eases 7 1 1 2 Utility of dataflow for DSP 0 eee eeeeeeereeeeeenneees 11 1 2 Parallel scheduling og citissdectssieisasies setadaeaiad a eedeata casted oaricdaus ana 13 1 2 1 Fully static schedules 7 c ctu eis eee eli edad 15 1 2 2 Self timed schedules sneeeeeeeeeeeeseeeseerersersrerrersersreses 19 1 2 3 Execution time estimates and static schedules 21 1 3 Application specific parallel architectures eeseeeeeeeeeeeeeeeeenee 24 1 3 1 Dataflow DSP architectures sessseeeseeeseseeseesreererseee 24 1 3 2 Systolic and wavefront arrays sseeeseeereereseeererrerseee 25 1 3 3 Multiprocessor DSP architectures 0 eee eeeeeeeeeeeees 26 1 4 Thesis overview our approach and contributions 0 0 0 0 eee 27 2 TERMINOLOGY AND NOTATIONS esessessesoesossessossesessossesossossesoososseseoss 33 2 1 HSDF graphs and associated graph theoretic notation 33 22 SSC ME AUIS MOL AUNT sneen asa gis a Shane a e EES 35 3 THE ORDERED TRANSACTION STRATEGY esseseesossescesossessosseseseoss 39 3 1 The Ordered Transactions strategy sseessesssessseeessseessressresseessee 39 J Shared pus archit cturee ceis n E e eee 42 3 2 1 Using the OT approach
62. This is done by exploiting the memoryless property of the exponential distribution when an actor fires the state of the system at any moment does not depend on how long that actor has spent executing its function the state changes only when that ac tor completes execution The number of states for such a system is equal to the num 101 ber of different valid token configurations on the edges of the dataflow graph where by valid we imply any token configuration that can be reached by a se quence of firings of enabled actors in the HSDFG This is also equal to the number of valid retimings Lei91 that exist for the HSDFG This number unfortunately is again exponential in the size of the HSDFG Analysis of such graphs with exponentially distributed execution times has been extensively studied in the area of stochastic Petri nets in Mur89 Murata pro vides a large and comprehensive list of references on Petri nets 315 in all a number of which focus on stochastic Petri nets There is a considerable body of work that attempts to cope with the state explosion problem Some of these works attempt to divide a given Petri net into parts that can be solved separately e g Yao093 some others propose simplified solutions when the graphs have particular structures e g Cam92 and others propose approximate solutions for values such as the expected firing rate of transitions e g Hav91 None of these methods are general eno
63. University of California at Berkeley Buck94 Ptol94 3 1 The Ordered Transactions strategy In the OT strategy we first obtain a fully static schedule using the execu tion time estimates but we discard the precise timing information specified in the fully static schedule as in the ST schedule we retain the processor assignment Oo p and actor ordering on each processor as specified by in addition we also 39 retain the order in which processors communicate with one another and we enforce this order at run time We formalize the concept of transaction order below Suppose there are k inter processor communication points s r s ry oe S r where each S r is a send receive pair in the FS schedule that we obtain as a first step in the construction of a self timed schedule Let R be the set of receive actors and S be the set of send actors i e R rra r and S 5 5 5 We define a transaction order to be a sequence O Vi Va Vas e Vag p Yap gt where Vi Vo gt Vog p Ya SUR each communication actor is present in the sequence O We say a transaction order O as defined above is imposed on a multiprocessor if at run time the send and receive actors are forced to execute in the sequence specified by O That is if O vi Va V3 Vo _ p Vag then imposing QO means ensuring the constraints end v k lt start v3 k end v k lt start V3 k end v _ k Sstart v k Vk20
64. ability with the problem size achieving best possible speedup for a given number of processors etc Several numerical computation problems were found to fall into the RIA category linear algebra matrix operations singular value decomposition etc see Kung88 Leigh92 for interesting systolic array implementations of a variety of different numerical problems Only fairly regular computations can be specified in the RIA form this makes the applicability of systolic arrays somewhat restric tive Wavefront arrays are similar to systolic arrays except that processors are not under the control of a global clock Communication between processors is asynchronous or self timed handshake between processors ensures run time syn chronization Thus processors in a wavefront array can be complex and the arrays themselves can consist of a large number of process rs without incurring the asso ciated problems of clock skew and global synchronization Again similar to FS versus ST scheduling the flexibility of wavefront arrays over systolic arrays comes at the cost of extra handshaking hardware The Warp project at Carnegie Mellon University Anna87 is an example 25 speech recognition applications based on artificial neural networks The RAP sys tem consists of several boards that are attached to a host workstation and acts as a coprocessor for the host The unidirectional pipelined ring topology employed for interprocessor communication
65. ai93 To construct a periodic parallel schedule we must first determine the num ber of times q N that each actor N must be invoked in the periodic schedule Systematic techniques to compute these values are presented in Lee87 Next we must determine the precedence relationships between the actor invocations In determining the exact precedence relationships we must take into account the dependence of a given filter invocation on not only the invocation that produces the token that is consumed by the filter but also on the invocations that produce the n preceding tokens where n is the order of the filter Such dependence can easily be evaluated with an additional dataflow parameter on each actor input that specifies the number of past tokens that are accessed Prin91 Using this infor 1 It should be noted that some SDF based design environments choose to forego parallel ization across multiple invocations of an actor in favor of simplified code generation and scheduling For example in the GRAPE system this restriction has been justified on the grounds that it simplifies inter processor data management reduces code duplication and allows the derivation of efficient scheduling algorithms that operate directly on general SDF graphs without requiring the use of the acyclic precedence graph APG Bil94 129 tt Dye Me Vs 2 Aas Om FOC rO O n Proc 1A Ap B Cp Dp Ep FoF Proc 2A Ay Ba Ezp F3 F4 c
66. al for subsequent timing analysis to eliminate the need for explicit synchronizations The techniques of barrier MIMD do not apply to the problem that we address because they assume that a hardware barrier mechanism exists they assume that tight bounds on task execution times are available they do not address iterative self timed execution in which the execution of successive iterations of the dataflow graph can overlap and even for non iterative execution there is no obvious correspondence between an optimal solution that uses barrier synchroni zations and an optimal solution that employs decoupled synchronization checks at the sender and receiver end directed synchronization This last point is illus trated in Fig 5 1 Here in the absence of execution time bounds an optimal appli cation of barrier synchronizations can be obtained by inserting two barriers one barrier across A and A3 and the other barrier across A and As This is illus trated in Figure 5 1 c However the corresponding collection of directed synchro 109 nizations A to A and A toA is not sufficient since it does not guarantee that the data required by A from A is available before A begins execution In Sha89 Shaffer presents an algorithm that minimizes the number of directed synchronizations in the self timed execution of a dataflow graph How ever this work like that of Dietz et al does not allow the execution of successive iterations of t
67. and I O load host lod run run and load T O interrupt routines on S 56X board synch start all processors synchronously Figure 3 15 Steps required for downloading code tcl script omaDoAll 70 Each processor is programmed through its Host Interface via the shared bus First a monitor program omaMon lod consisting of interrupt routines is loaded and run on the selected processor Code is then loaded into processor mem ory by writing address and data values into the HI port and interrupting the proces sor The interrupt routine on the processor is responsible for inserting data into the specified memory location The S 56X host forces different interrupt routines for specifying which of the three X Y or P memories the address refers to and for specifying a read or a write to or from that location This scheme is similar to that employed in downloading code onto the S 56X card Ariel91 Status and control registers on the OMA board are memory mapped to the S 56X address space and can be accessed to reset reboot monitor and debug the board Tcl scripts were written to simplify commands that used are most often e g change y fff0 0x0 was aliased to omareset The entire code downloading pro cedure is executed by the tcl script omaDoAll see Fig 3 15 A Ptolemy multiprocessor hardware target Chapter 12 Section 2 in Ptol94 was written for the OMA board for automatic partitioning code genera tio
68. and Signal Processing Magazine October 1988 Lee88b E A Lee Recurrences Iteration and Conditionals in Statically Sched uled Block Diagram Languages VLSI Signal Processing III IEEE Press 1988 Lee89 E A Lee and S Ha Scheduling Strategies for Multiprocessor Real Time DSP Globecom Dallas Texas pp 1279 1283 November 1989 Lee90 E A Lee and J C Bier Architectures for Statically Scheduled Dataflow Journal of Parallel and Distributed Computing Vol 10 pp 333 348 December 1990 Lee91 179 E A Lee Consistency in Dataflow Graphs IEEE Transactions on Par allel and Distributed Systems Vol 2 No 2 April 1991 Lee93 E A Lee Representing and Exploiting Data Parallelism Using Multidi mensional Dataflow Diagrams Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing Minneapolis Vol 1 pp 453 456 April 1993 Lee95 E A Lee and T M Parks Dataflow Process Networks Proceedings of the IEEE May 1995 Lei9 1 C E Leiserson and J B Saxe Retiming Synchronous Circuitry Algo rithmica Vol 6 No 1 pp 5 35 1991 Len92 D Lenoski J Laudon K Gharachorloo W D Weber and J Hennessey The Stanford Dash multiprocessor IEEE Computer March 1992 Lew81 H R Lewis and C H Papadimitriou Elements of the Theory of Computa tion Prentice Hall 1982 Liao94 G Liao G R Gao E Altman a
69. ardware in this case Thus the metric for the optimizations we present in this chapter is the total number of accesses to shared memory that are needed for the purpose of synchro nization in the final multiprocessor implementation of the self timed schedule This metric will be defined precisely in section 5 5 5 2 Analysis of self timed execution We model synchronization in a self timed implementation using the IPC graph model introduced in the previous chapter As before an IPC graph Ginc V Eipc is extracted from a given HSDFG G and multi processor schedule Fig 5 2 shows one such example which we use throughout this chapter We will find it useful to partition the edges of the IPC graph in the follow ing manner Epe Eint U Ecomm Where Ecomm are the communication edges shown dotted in Fig 5 2 d that are directed from the send to the receive actors in Gipc and E are the internal edges that represent the fact that actors assigned ipc gt to a particular processor actors internal to that processor are executed sequen 112 Proc 4 Proc 1 a Execution Time Estimates A C H F 2 gt O lt O A i a E Send i Receive Q Idle gt Ecomm d The IPC graph gt En Figure 5 2 Self timed execution tially according to the order predetermined by the self timed schedule A commu nication edge e Ecomm iN Gipc represents two functions 1 read
70. be used when such control constructs are included in the dataflow graph In this scheme bus access schedules are computed for each set of values that the control tokens in the graph evaluate to and the bus access con troller is made to select between these lists at run time based on which set of values the control tokens actually take at any given time This was also shown to be appli cable to data dependent iteration constructs Such a scheme is feasible when the number of execution paths in the graph is small We proposed another mechanism based on masking of bus accesses depending on run time values of control tokens for handling the case when there are multiple conditional constructs in parallel 165 CONCLUSIONS AND FUTURE DIRECTIONS In this thesis we explored techniques that minimize inter processor com munication and synchronization costs in statically scheduled multiprocessors for DSP The main idea is that communication and synchronization in statically sched uled hardware is fairly predictable and this predictability can be exploited to achieve our aims of low overhead parallel implementation at low hardware cost The first technique we looked at was the ordered transactions strategy where the idea is to predict the order of processor accesses to shared resources and enforce this order at run time We applied this idea to a shared bus multiprocessor where the sequence of accesses to shared memory is pre de
71. by the IPC graph in Fig 4 4 critical cycle send receive Figure 4 4 The IPC graph for the schedule in Fig 4 1 The IPC graph has the same vertex set V as G corresponding to the set of actors in G The self timed schedule specifies the actors assigned to each processor and the order in which they execute For example in Fig 4 1 processor 1 executes A and then E repeatedly We model this in Gpe tices corresponding to A and E and placing a delay on the edge from E to A The by drawing a cycle around the ver delay free edge from A to E represents the fact that the kth execution of A pre cedes the kth execution of E and the edge from E to A witha delay represents the fact that the kth execution of A can occur only after the k 1 th execution of E has completed Thus if actors v v5 v are assigned to the same processor in that order then G would have a cycle ipc Vi Va Wa Va o Wn pYn Yp Vy gt with delay v v 1 be cause v is executed first If there are P processors in the schedule then we have P such cycles corresponding to each processor The additional edges due to these constraints are shown dashed in Fig 4 4 83 As mentioned before edges in G that cross processor boundaries after scheduling represent inter processor communication Communication actors send and receive are inserted for each such edge these are shown in Fig 4 1 The IPC graph has the same semantics a
72. cID gt 159 instead of the lt ProcID gt field alone that we had before The lt Condition gt field encodes a unique product c associated with that particular bus access In the OMA prototype we can use 3 bits for lt ProcID gt and 5 bits for the lt Condition gt field This would allow us to handle 8 processors and 32 product combinations of Booleans There can be up to m 3 product terms in the worst case correspond ing to n Booleans in the system because each Boolean b could appear in the product term as itself or its complement or not at all corresponding to a don t care It is unlikely that all 3 possible product terms will be required in practice we therefore expect such a scheme to be practical The necessary product terms c j can be implemented within the controller at compile time based on the bus access pattern of the particular dynamic dataflow graph to be executed In Fig 6 9 the flags b b2 b are 1 bit memory elements flip flops that are memory mapped to the shared bus and store the values of the Boolean control tokens in the system The processor that computes the value of each control token updates the corresponding b by writing to the shared memory location that maps to b The product combinations c c c are just AND functions of the b s and the complement of the b s e g c j could be b by As the schedule counter steps through the bus access list the bus grant is actually
73. c_unit sdl DSP96000 circuitry trans32 sdl proc_unit sdl proc_unit sdlproc_unit sdl Startup circuitry 256ksram sdl conn sdl intmod sdl Local Memory Figure 3 13 Schematics hierarchy of the four processor OMA architecture Table 3 1 OMA board physical specs Dimensions 30 cm x 17 cm Layers 10 including ground and Vcc plane Number of Components 230 parts 170 bypass capacitors sdl code 2800 lines Memory 512K words shared 256K words local 68 Figure 3 14 OMA prototype board photograph 3 5 2 Software interface As discussed earlier we use an S 56X card attached to a Sparc as a host for the OMA board The Xilinx chip on the S 56X card is configured to provide 16 bits of data and 5 bits of address We use the gdm Laps91 software as an interface for the S 56X board gdm is a debugger monitor that has several useful built in routines for controlling the S 56X board for example data can be written and read from any location in the DSP56000 address space through function calls in gdm Another useful feature of gdm is that it uses tcl an embeddable extensible shell like interpreted command language Ous94 Tcl provides a set of built in func tions such as an expression evaluator variables control flow statements etc that can be executed via user commands typed at its textual interface or from a speci fied command file Tcl can be extended with application specific commands in
74. cessed so as to ensure sender receiver synchronization For edges in E however no synchronization needs to be done before accessing the shared buffer Sometimes we will also find it useful to intro duce synchronization edges without actually communicating data between the sender and the receiver for the purpose of ensuring finite buffers for example so that no shared buffers need to be assigned to these edges but the corresponding synchronization protocol is invoked for these edges All optimizations that move edges from E to E must respect the synchro nization constraints implied by G If we ensure this then we only need to ipc implement the synchronization protocols for the edges in E We call the graph G V Eins U Es the synchronization graph The graph G represents the synchronization constraints in G that need to be explicitly ensured and the algorithms we present for minimizing synchronization costs operate on G Before any synchronization related optimizations are performed G G because Ecomm Es at this stage but as we move communication edges from E to E G has fewer and fewer edges Thus moving edges from EF to E can be viewed as removal of edges from G Whenever we remove edges from G we have to ensure of course that the synchronization graph G at that step respects all the synchronization constraints of G because we only implement synchronizations represented by the edges E in G The followi
75. cessor DSP board called the Ordered Memory Access OMA architecture that demonstrates the ordered transactions concept The OMA prototype board utilizes shared memory and a single shared bus for IPC the sender writes data to a par ticular shared memory location that is allocated at compile time and the receiver reads that location In this multiprocessor a very simple controller on the board enforces the pre determined transaction order at run time thus eliminating the need for run time bus arbitration or semaphore synchronization This results in efficient IPC comparable to the FS strategy at relatively low hardware cost As in the ST scenario the OT strategy is tolerant of variations in execution times of actors because the transaction order enforces correct sender receiver synchroniza tion however this strategy is more constrained than ST scheduling which allows the order in which communication actors fire to vary at run time The ordered transactions strategy therefore falls in between fully static and self timed strate gies in that like the ST strategy it is tolerant of variations in execution times and like the FS strategy has low communication and synchronization costs These per formance issues will be discussed quantitatively in the following chapter the rest 41 of this chapter describes the hardware and software implementation of the OMA prototype 3 2 Shared bus architecture The OMA architecture uses a sin
76. chapter is a part of an ongoing research effort in collaboration with Dr Shuvra Bhattacharyya a research scientist at Hitachi America Ltd The ordered transaction strategy can be viewed as a hardware approach that optimizes for IPC and synchronization overhead in statically scheduled multi processor implementations The synchronization optimization techniques on the other hand operate at the level of a scheduled parallel program by altering the syn chronization structure of a given schedule to minimize the synchronization over head in the final implementation 31 cessor assignment and ordering constraints that a self timed schedule specifies In the context of the OT scheme we use the IPC graph construction to determine a transaction order that makes optimal use of actor execution time information avail able to us during the compilation or design phase The idea is to predict the pattern of processor communications using the IPC graph and then to use this pattern to determine the transaction order We outline this procedure in Chapter 4 We also discuss the effect of run time variations in execution times of actors on the perfor mance of a multiprocessor executing a self timed schedule Determining the pattern of processor communications is relatively straight forward in SIMD implementations Techniques applied to systolic arrays in fact use the regular communication pattern to determine an optimal interconnect topol ogy for a given algo
77. chedule on five procesSOms ccccesseceesseeeesteeeesteeees 80 Self timed schedule s icct ecsaciu ccsissaceldiohetni pation Weenie aes 81 Schedule evolution when the transaction order of Fig 3 1 is enforced eaoin eee talon he one ene ele oe eee ade Sen pe teedsh 81 The IPC graph for the schedule in Fig 4 1 oe eeeeeeeeeeeeeeee 83 Transaction Ordering constraints 2 0 0 eeeeceeeseeeeeteeeesteeeeteeeenaeeees 89 Modified shedul Si ienna Sicaldta heen reais E RS 95 EEE E E E S EE E AE E E sides ale sd 97 T e a hic tN 98 Gipc With transaction ordering constraints represented as dashed lines E E E E E E E E E 105 Tote and Totto enironta n Aer a h 105 a An HSDFG b A three pro a An HSDFG b A three processor self timed schedule for a c An illustration of execution under the placement of Darriers eeo ae ae E acs sat deeces ATAS 110 Self timed execution An IPC graph with a feedforward edge a original graph b impos Ine Bounded DUETS eieaa e A E E RTS 115 Xp is an example of a redundant synchronization edge 124 An algorithm that optimally removes redundant synchronization LSS rad cose hase deeds Span E E A Ne esac E E ue Satan ces 127 a A multi resolution QMF filter bank used to illustrate the benefits of removing redundant synchronizations b The precedence graph for a c A self timed two processor parallel schedule for a d The initial synchronization graph for C eccceeesece
78. chronization edges with BBS We address optimization cri teria in performing such a conversion and we will show that the extra synchroni zation accesses required for such a conversion are always at least compensated by the number of synchronization accesses that are saved by the more expensive UBSs that are converted to BBSs Finally in section 5 9 we outline a mechanism which we call resynchronization for inserting synchronization edges in a way that 1 Note that in our measure of the number of shared memory accesses required for synchro nization we neglect the accesses to shared memory that are performed while the sink actor is waiting for the required data to become available or the source actor is waiting for an empty slot in the buffer The number of accesses required to perform these busy wait or spin lock operations is dependent on the exact relative execution times of the actor in vocations Since in our problem context this information is not generally available to us we use the best case number of accesses the number of shared memory accesses required for synchronization assuming that IPC data on an edge is always produced before the cor responding sink invocation attempts to execute as an approximation 123 the number of original synchronization edges that become redundant exceeds the number of new edges added 5 6 Removing redundant synchronizations The first technique that we explore for
79. cle c such that c does not contain e and since all cycles must have positive path delay from Lemma 4 1 the path delay of such a path p must exceed delay e Thus if eg satisfies the inequality in the if statement of RemoveRedundantSynchs and p is a path from snk eo to snk e such that Delay p p snk ep snk e then p cannot contain e This observation allows us to avoid having to recom pute the shortest paths after removing a candidate redundant edge from G From the definition of a redundant synchronization edge it is easily veri fied that the removal of a redundant synchronization edge does not alter any of the minimum delay path values path delays That is given a redundant synchroniza tion edge e in G and two arbitrary vertices x ye V if we let G v EU E e I then Pe x y Pg x y Thus none of the Function RemoveRedundantSynchs Input A synchronization graph G En U Es int Output The synchronization graph G V Em U E E 1 Compute pg x y for each ordered pair of vertices in G 2 DO 3 For each e E For each output edge e of src e except for e If delay e Pg snk e snk e lt delay e Then E EV e Break exit the innermost enclosing For loop Endif Endfor Endfor 4 Return V Eo E E int Figure 5 5 An algorithm that optimally removes redundant synchronization edges 127 minimum delay path values computed in St
80. d E A Lee A Hierarchical Multiproces sor Scheduling System for DSP Applications to appear in JEEE Asilomar Conference on Signals Systems and Computers Pacific Grove CA Octo ber 29 November 1 1995 Pow92 D B Powell E A Lee and W C Newman Direct Synthesis of Opti mized DSP Assembly Code from Signal Flow Block Diagrams Proceed ings of the International Conference on Acoustics Speech and Signal Processing San Francisco March 1992 Prin9 1 H Printz Automatic Mapping of Large Signal Processing Systems to a Parallel Machine Ph D thesis Memorandum CMU CS 91 101 School of Computer Science Carnegie Mellon University May 1991 Prin92 H Printz Compilation of Narrowband Spectral Detection Systems for Linear MIMD Machines Proceedings of the International Conference on Application Specific Array Processors Berkeley August 1992 Ptol94 Ptolemy design group UC Berkeley The Almagest UC Berkeley 1994 Rab91 J M Rabaey C Chu P Hoang and M Potkonjak Fast Prototyping of Datapath Intensive Architectures IEEE Design and Test of Computers 183 Vol 8 No 2 pp 40 51 June 1991 Rajs94 Sergio Rajsbaum and Mosha Sidi On the Performance of Synchronized Programs in Distributed Networks with Random Processing Times and Transmission Delays JEEE Transactions on Parallel and Distributed Sys tems Vol 5 No 9 September 1994 Ram80 C V Ramamoorthy and G S
81. d at the con trol input Actors that follow SDF semantics i e that consume and produce fixed number of tokens on their arcs are clearly a subset of the set of allowed BDF actors SDF actors simply do not have any control inputs Two basic dynamic 147 actors in the token flow model are the SWITCH and SELECT actors shown in Fig 6 1 The switch actor consumes one Boolean valued control token and another input token if the control token is TRUE the input token is copied to the output labelled T otherwise it is copied to the output labelled F The SELECT actor per forms the complementary operation it reads an input token from its T input if the control token is TRUE otherwise it reads from its F input in either case it copies the token to its output Constructs such as conditionals and data dependent itera SWITCH T F control T F control SELECT Figure 6 1 BDF actors SWITCH and SELECT tions can easily be represented in a BDF graph as illustrated in Fig 6 2 The verti ces A B C etc in Fig 6 2 need not be atomic actors they could also be arbitrary SDF graphs A BDF graph allows SWITCH and SELECT actors to be connected in arbitrary topologies Buck Buck93 in fact shows that any Turing machine can be expressed as a BDF graph and therefore the problems of determining whether such a graph deadlocks and whether it uses bounded memory are undecidable Buck proposes heuristic solutions to these problems based on extensions of the
82. d school some times tends to become Of course the Berkeley experience in general the beautiful campus with great views of the San Francisco bay and the Golden Gate the excellent library system the cafe s and the restaurants the CD shops and the used book stores stu dent groups and cacophonic drummers on Sproul plaza the Hateman and the Naked Guy has left me with indelible memories and a wealth of interesting sto ries to tell and has also helped keep my efforts towards a Ph D in perspective Finally I wish to thank my parents for all their support and belief in me and my sister who has a knack for boosting my morale during rough times I dedi cate this thesis to them xiii INTRODUCTION The focus of this thesis is the exploration of architectures and design meth odologies for application specific parallel systems for embedded applications in digital signal processing DSP The hardware model we consider consists of mul tiple programmable processors possibly heterogeneous and multiple application specific hardware elements Such a heterogeneous architecture is found in a num ber of embedded applications today cellular radios image processing boards music sound cards robot control applications etc In this thesis we develop sys tematic techniques aimed at reducing inter processor communication and synchro nization costs in such multiprocessors that are designed to be application specific The techni
83. d synchronization over head Also we briefly outlined the resynchronization procedure which involves adding synchronization points in the schedule such that the overall synchroniza tion costs are reduced Details of resynchronization can be found in Bhat95a and Bhat95b We demonstrated the relevance of our techniques through practical examples The input to our algorithm is an HSDFG and a parallel schedule for it The 144 output is an IPC graph G V Ej which represents buffers as communica tion edges a strongly connected synchronization graph G V E UE gt int which represents synchronization constraints and a set of shared memory buffer sizes Bp e e is an IPC edge in Gpe Fig 5 15 specifies the complete algo rithm A code generator can then accept Gj and G allocate a buffer in shared memory for each communication edge e specified by G of size By e and ipc generate synchronization code for the synchronization edges represented in G These synchronizations may be implemented using the BBS protocol The result ing synchronization cost is 2n where n is the number of synchronization edges in the synchronization graph G that is obtained after all optimizations are com Function MinimizeSynchCost Input An HSDFG G and a self timed schedule S for this HSDFG Output G Gs and Bp e e is an IPC edge in G 1 Extract G from G and S 2 Gs Giae Each communication edg
84. dress decoding is done by the shared address decoder PAL programmable array logic chip A central clock generator provides a common clock signal to all processing elements A Xilinx FPGA XC3090 implements the transaction controller and a sim ple I O mechanism and is also used to implement latches and buffers during bootup thus saving glue logic A fast static RAM up to 32K x 8 stores the bus access order in the form of processor identifications IDs The sequence of proces sor IDs is stored in this schedule RAM and this determines the bus access order An external latch is used to store the processor ID read from the schedule RAM This ID is then decoded to obtain the processor bus grants A subset of the 32 address lines connect to the Xilinx chip for addressing the I O registers and other internal registers All 32 lines from the shared data bus are connected to the Xilinx The shared data bus can be accessed from the external connector the right side connector in Fig 3 5 only through the Xilinx chip This feature can be made use of when connecting multiple OMA boards shared busses from different boards can be made into one contiguous bus or they can be left disconnected with communication between busses occurring via asynchro nous bridges implemented on the Xilinx FPGAs We discuss this further in sec tion 3 4 7 Connectors on both ends of the board bring out the shared bus in its entirety Both
85. e converted into an HSDF graph Lee86 this transformation may result in an exponential increase in the number of actors in the final HSDF graph see Pin95b for an example of an SDF graph in which this blowup occurs Such a transformation however appears to be necessary when constructing periodic multiprocessor schedules from multirate SDF graphs There is some recent work on reducing the complexity of the HSDFG that results from transforming a given SDF graph by applying graph clustering techniques to that SDF graph Pin95b Since we are concerned with multiprocessor schedules in this thesis we assume we start with an application represented as a homogeneous SDF graph henceforth unless we state otherwise This of course results in no loss of generality because a multirate graph is converted into a homogeneous graph for the purposes of multiprocessor scheduling anyway In Chapter 6 we discuss how the ideas that apply to HSDF graphs can be extended to graphs containing actors that display data dependent behaviour i e dynamic actors We note that an HSDFG is very similar to a marked graph in the context of Petri nets Peter81 transitions in the marked graph correspond to actors in the HSDFG places correspond to edges and initial tokens or initial marking of the marked graph correspond to initial tokens or delays in HSDFGs We will repre sent delays using bullets on the edges of the HSDFG we indicate more than one delay on an edge
86. e cycle for loading the schedule counter the total savings in this example is therefore two bus cycles One of the problems with the above approach is that it involves explicit enumeration of all possible combinations of Booleans the complexity of which limits the size of problems that can be tackled with this approach An implicit mechanism for representing all possible execution paths is therefore desirable One such mechanism is the use of Binary Decision Diagrams BDDs which have been used to efficiently represent and manipulate Boolean functions for the purpose of logic minimization Bryant86 BDDs have been used to compactly represent large 163 state spaces and to perform operations implicitly over such state spaces when methods based on explicit techniques are infeasible One difficulty we encountered in applying BDDs to our problem of representing execution paths is that it is not obvious how precedence and ordering constraints can be encoded in a BDD repre sentation The execution paths corresponding to the various Boolean combinations can be represented using a BDD but it isn t clear how to represent the relative order between bus accesses corresponding to the different execution paths We leave this as an area for future exploration 6 3 Data dependent iteration A data dependent iteration construct is shown in Fig 6 2 b A quasi static schedule for such a construct may look like the one in Fig 6 11 We are assuming
87. e ignored since only intra iteration dependencies are sig nificant Thus Shaffer s synchronization graph is acyclic RemoveRedun dantSynchs can be viewed as an extension of Shaffer s algorithm to handle self timed iterative execution of an HSDFG Shaffer s algorithm accounts for self timed execution only within a graph iteration and in general it can be applied to iterative dataflow programs only if all processors are forced to synchronize between graph iterations 128 5 6 4 An example In this subsection we illustrate the benefits of removing redundant syn chronizations through a practical example Fig 5 6 a shows an abstraction of a three channel multi resolution quadrature mirror QMF filter bank which has applications in signal compression Vai93 This representation is based on the general not homogeneous SDF model and accordingly each edge is annotated with the number of tokens produced and consumed by its source and sink actors Actors A and F represent the subsystems that respectively supply and consume data to from the filter bank system B and C each represents a parallel combina tion of decimating high and low pass FIR analysis filters D and E represent the corresponding pairs of interpolating synthesis filters The amount of delay on the edge directed from B to E is equal to the sum of the filter orders of C and D For more details on the application represented by Fig 5 6 a we refer the reader to V
88. e is also a synchronization edge to begin with 3 G lt Resynchronize G 4 G lt Convert to SC graph G 5 G lt DetermineDelays G 6 G lt RemoveRedundantSynchs G 7 Calculate the buffer size Bp e for each communication edge e in Gine a Compute Po src e snk e b By e Pg src e snk e delay e Figure 5 15 The complete synchronization optimization algorithm 145 pleted 146 EXTENSIONS The techniques of the previous chapters apply compile time analysis to static schedules for HSDF graphs that have no decision making at the dataflow graph level In this chapter we consider graphs with data dependent control flow Recall that atomic actors in an HSDF graph are allowed to perform data dependent decision making within their body as long as their input output behaviour respects SDF semantics We show how some of the ideas we explored previously can still be applied to dataflow graphs containing actors that display data dependent firing patterns and therefore are not SDF actors 6 1 The Boolean Dataflow model The Boolean Dataflow BDF model was proposed by Lee Lee91 and was further developed by Buck Buck93 for extending the SDF model to allow data dependent control actors in the dataflow graph BDF actors are allowed to contain a control input and the number of tokens consumed and produced on the arcs of a BDF actors can be a two valued function of a token consume
89. e memory After downloading code onto the processors and downloading the bus access order into the schedule RAM the Xilinx chip is reconfigured to implement the ordered transaction controller and the I O interface Thus the pro cess of downloading and running a program requires configuring the Xilinx chip twice There are several possible ways in which a Xilinx part may be pro grammed For the OMA board the configuration bitmap is downloaded byte wise by the host Sun workstation through the S 56X card The bitmap file generated and stored as a binary file on a workstation is read in by a function implemented in the gdm software discussed in section 3 5 which describes the OMA software interface and the bytes thus read are written into the appropriate memory location on the S 56X card The DSP56K processor on the S 56X then strobes these bytes into the Xilinx configuration port on the OMA board The user can reset and reconfigure the Xilinx chip from a Sun Sparc workstation by manipulating the Xil inx control pins by writing to a Xilinx configuration latch on the OMA board Various configuration pins of the Xilinx chip are manipulated by writing different values into this latch We use two different Xilinx circuits bootup bit and momal bit one during bootup and the other during run time The Xilinx configuration during bootup helps eliminate some glue logic that would otherwise be required to latch and decode address and data f
90. e plane to obtain the best possible schedule The search involved in this process has a worst case complexity expo nential in the size of the input graph although it appears that the complexity is manageable in practice at least for small examples Schw85 1 2 2 Self timed schedules The fully static approach introduced in the previous section cannot be used when actors have variable execution times the FS approach requires precise knowledge of actor execution times to guarantee sender receiver synchronization It is possible to use worst case execution times and still employ an FS strategy but this requires tight worst case execution time estimates that may not be available to us An obvious strategy for solving this problem is to introduce explicit synchroni zation whenever processors communicate This leads to the self timed scheduling ST strategy in the scheduling taxonomy of Lee and Ha Lee89 In this strategy we first obtain an FS schedule using the techniques discussed in section 1 2 mak ing use of the execution time estimates After computing the FS schedule Fig 1 3 19 The heuristics mentioned above ignore communication costs between pro cessors which is often inappropriate in actual multiprocessor implementations An edge of the HSDFG that crosses processor boundaries after the processor assign ment step represents interprocessor communication IPC illustrated in Fig 1 3 a These communication points are usually im
91. ected onto a contiguous bus Thus the SMART array allows formation of clusters of processors that reside on a common bus these clusters then communicate with adjacent clusters When we connect multiple OMA boards together we get a simi lar effect in the shorted configuration processors on different boards connect to a single bus whereas in the cleaved configuration processors on different boards reside on common busses and neighboring boards communicate through an asyn chronous interface Fig 3 12 illustrates the above scheme The highest 3 bits of the shared address bus are used as the board ID field Memory processor Host Interface ports configuration latches etc decode the board ID field to determine if a shared memory or host access is meant for them Thus a total of 8 boards can be hooked onto a common bus in this scheme 3 5 Hardware and software implementation 3 5 1 Board design We used single sided through hole printed circuit board technology for the OMA prototype The printed circuit board design was done using the SIERA sys tem developed at Professor Brodersen s group at Berkeley Sriv92 Under this system a design is entered hierarchically using a netlist language called SDL Structure Description Language Geometric placement of components can be easily specified in the SDL netlist itself A tiling feature is also provided to ease compact fitting of components The SDL files were wr
92. ed arrays of processors that communicate locally onto which a certain class of applications specified in a mathematical form can be sys tematically mapped We discuss systolic arrays further in section 1 3 2 The necessary elements in the study of application specific computer archi tectures are 1 a clearly defined set of problems that can be solved using the par ticular application specific approach 2 a formal mechanism for specification of these applications and 3 a systematic approach for designing hardware from such a specification In this thesis the applications we focus on are those that can be described by Synchronous Dataflow Graphs SDF Lee87 and its extensions we will dis cl s this model in detail shortly SDF in its pure form can only represent applica tions that have no decision making at the task level Extensions of SDF such as the Boolean dataflow BDF model Lee91 Buck93 allow control constructs so that data dependent control flow can be expressed in such models These models are significantly more powerful in terms of expressivity but they give up some of the useful analytical properties that the SDF model has For instance Buck shows that it is possible to simulate any Turing machine in the BDF model Buck93 The 5 rather than at run time Given such a schedule that is determined at compile time we can extract information from it with a view towards optimizing the final imple mentation
93. ed by assigning execution times to the send and receive actors Now we can substitute end v k start v p k t v in Eqn 4 1 to obtain 84 start v k 2 start v k delay v v t v V v v Eine 4 2 i j Pi pot p In the self timed schedule actors fire as soon as data is available at all their input edges Such an as soon as possible ASAP firing pattern implies start v k max start v k delay v v t v v v Einot In contrast recall that in the FS schedule we would force actors to fire periodically according to start v k v kT pg The IPC graph has the same semantics as a Timed Marked graph in Petri net theory Peter81 Ram80 the transitions of a marked graph correspond to the nodes of the IPC graph the places of a marked graph correspond to edges and the initial marking of a marked graph corresponds to initial tokens on the edges The IPC graph is also similar to Reiter s computation graph Reit68 The same proper ties hold for it and we state some of the relevant properties here The proofs here are similar to the proofs for the corresponding properties in marked graphs and computation graphs in the references above Lemma 4 1 The number of tokens in any cycle of the IPC graph is always con served over all possible valid firings of actors in the graph and is equal to the path delay of that cycle Proof For each cycle C in the IPC graph
94. ed estimated throughput arise because the conversion to a strongly connected graph must necessarily introduce one or more new fundamental cycles In general a new cycle may be delay free or its cycle mean may exceed that of the critical cycle in G Thus we may have to insert delays on the edges added by Convert to SC graph The location edge and mag nitude of the delays that we add are significant since they effect the self timed buffer bounds of the communication edges as shown subsequently in Theorem 5 4 Since the self timed buffer bounds determine the amount of memory that we allocate for the corresponding buffers it is desirable to prevent deadlock and decrease in estimated throughput in a way that the sum of the self timed buffer bounds over all communication edges is minimized In this section we outline a simple and efficient algorithm for addressing this problem Our algorithm pro new edges synch edges Figure 5 11 A possible solution obtained by applying Convert to SC graph to the example of Figure 5 10 137 duces an optimal result if G has only one source SCC or only one sink SCC in other cases the algorithm must be viewed as a heuristic Fig 5 12 outlines the restricted version of our algorithm that applies when the synchronization graph G has exactly one source SCC Here BellmanFord is assumed to be an algorithm that takes a synchronization graph Z as input and repeatedly applies the Bellman Ford algor
95. ed memory Special hardware for synchronization barriers sema phores implemented in hardware etc would be prohibitive for the embedded multiprocessor machines for applications such as DSP that we are considering Interfaces between hardware and software are typically implemented using mem ory mapped registers in the address space of the programmable processor again a kind of shared memory and synchronization is achieved using flags that can be 111 tested and set by the programmable component and the same can be done by an interface controller on the hardware side Huis93 Under the model above the benefits that our proposed synchronization optimization techniques offer become obvious Each synchronization that we elim inate directly results in one less synchronization check or equivalently one less shared memory access For example where a processor would have to check a flag in shared memory before executing a receive primitive eliminating that synchroni zation implies there is no longer need for such a check This translates to one less shared memory read Such a benefit is especially significant for simplifying inter faces between a programmable component and a hardware component a send or a receive without the need for synchronization implies that the interface can be implemented in a non blocking fashion greatly simplifying the interface control ler As a result eliminating a synchronization directly results in simpler h
96. edges of the IPC graph are potentially buffers of infinite size However from Lemma 4 1 every feedback edge an edge that belongs to a strongly connected component and hence to some cycle can only have a finite number of tokens at any time during the execution of the IPC graph We will call this constant the self timed buffer bound of that edge and for a feedback edge e we will represent this bound by Bp e Lemma 4 1 yields the following self timed buffer bound 114 Br e min Delay C C is a cycle that contains e 5 1 Feedforward edges edges that do not belong to any SCC have no such bound on buffer size therefore for practical implementations we need to impose a bound on the sizes of these edges For example Figure 5 3 a shows an IPC graph where the communication edge s 7 could be unbounded when the execution time of A is less than that of B for example In practice we need to bound the Figure 5 3 An IPC graph with a feedforward edge a original graph b imposing bounded buffers buffer size of such an edge we will denote such an imposed bound for a feedfor ward edge e by By e Since the effect of placing such a restriction includes artificially constraining src e from getting more than By e invocations ahead of snk e its effect on the estimated throughput can be modelled by add ing a reverse edge that has m delays on it where m By e delay e to Gipc grey edge in Fig 5
97. edule is schematically represented as a Gantt chart that indicates the proces sors along the vertical axis and time along the horizontal axis The actors are rep resented as rectangles with horizontal length equal to the execution time of the actor The left side of each actor in the Gantt chart corresponds to its starting time The Gantt chart can be viewed as a processor time plane scheduling can then be viewed as a mechanism to tile this plane while minimizing total schedule length and idle time empty spaces in the tiling process Clearly the FS strategy is via ble only if actor execution time estimates are accurate and data independent or if tight worst case estimates are available for these execution times As shown in Fig 1 1 two different types of FS schedules arise depending 15 actor fires such that all data precedence constraints are met Each of these three tasks may be performed either at run time a dynamic strategy or at compile time static strategy We restrict ourselves to non preemptive schedules i e schedules where an actor executing on a processor can not be interrupted in the middle of its execution to allow another task to be executed This is because preemption entails a significant implementation overhead and is therefore of limited use in embedded time critical applications Lee and Ha Lee89 propose a scheduling taxonomy based on which of the scheduling tasks are performed at compile time and w
98. eeeseseseeessreereesersresrersressesereses 46 3 3 Design of an Ordered Memory Access multiprocessor 6 47 3 3 1 High level design description ssessssesssssessesssesesesesseee 48 3 3 2 A modified design 32142 cc5 cccceds ah opcatstacqecacianccandecezetaatsastacas 49 3 4 Design details of a prototype ssesssessesseesseesserssesssseesseesseesseessee 52 SAA Top level designs sisi assigsaaesigiaaciastdeateseshaateseletsanedecaees 53 3 4 2 Transaction order controller ee eee eeseeeseeeeeeeereeeeeeees 55 3 4 2 1 Processor bus arbitration signals 00 55 3 4 2 2 A simple implementation s ssseeeeeeeeereeeeesesee 57 iii 3 4 2 3 Presettable countet 0 cceeeseseeccccceseeeeeees 58 3 4 3 Host interfaces nsss deen ay een 60 SAA Processing element s ccscaccseisesssdiaspaccessssdesstaavecadeesnerdeatecs 61 3 4 9 XIX CIRC UALR acpi daa hae ert hd cea E delete aes 62 SA el VO Miter ace et don tented apatet eS 64 SAG Shared MEMOLY niensis a anato 65 3 4 7 Connecting multiple boards eee eeeeeeeeseeeeeeeeeeeees 65 3 5 Hardware and software implementation eeeceeeceeeeeeeeeeteeeees 66 DB So USO ald esi hes a arcisstes ccasdenuuts ot sactacsteetneius a 66 3 5 2 Software UME Tl aCe ei cciccssecerlevssieaeteasteea dh aetdecese ade eed 69 3 6 Ordered I O and parameter control eee eeeeeceeeeeeeeeeeceteeeeeteeeees 71 37 Application example Sennan a A E i E 73 3al Mu
99. ens produced consumed by an actor on each output input edge is a fixed number that is known at compile time The arcs in an SDF graph may contain initial tokens which we also refer to as delays Arcs with delays can be interpreted as data dependencies across iterations of the graph this concept will be formalized in the following chapter In an actual implementation arcs represent buffers in physical memory DSP applications typically represent computations on an indefinitely long data sequence therefore the SDF graphs we are interested in for the purpose of signal processing must execute in a nonterminating fashion Consequently we must be able to obtain periodic schedules for SDF representations which can then be run as infinite loops using a finite amount of physical memory Unbounded buffers imply a sample rate inconsistency and deadlock implies that all actors in the graph cannot be iterated indefinitely Thus for our purposes correctly con structed SDF graphs are those that can be scheduled periodically using a finite amount of memory The main advantage of imposing restrictions on the SDF model over a general dataflow model lies precisely in the ability to determine whether or not an arbitrary SDF graph has a periodic schedule that neither dead locks nor requires unbounded buffer sizes Lee87 The buffer sizes required to implement arcs in SDF graphs can be determined at compile time recall that this 9 SDF should not be conf
100. ep 1 need to be recalculated after removing a redundant synchronization edge in Step 3 Observe that the complexity of the function RemoveRedundantSynchs 1s dominated by Step 1 and Step 3 Since all edge delays are non negative we can repeatedly apply Dijkstra s single source shortest path algorithm once for each vertex to carry out Step 1 in ol vi time a modification of Dijkstra s algorithm can be used to reduce the complexity of Step 1 to ol Vlog Vj V El Corm92 In Step 3 E is an upper bound for the number of synchronization edges and in the worst case each vertex has an edge connecting it to every other member of V Thus the time complexity of Step 3 is O V E and if we use the modification to Dijkstra s algorithm mentioned above for Step 1 then the time complexity of RemoveRedundantSynchs is 2 2 o IVtog IV VIIEL IVIIEL Ol Ivi iog IVI IVIIEI 5 6 3 Comparison with Shaffer s approach In Sha89 Shaffer presents an algorithm that minimizes the number of directed synchronizations in the self timed execution of an HSDFG under the implicit assumption that the execution of successive iterations of the HSDFG are not allowed to overlap In Shaffer s technique a construction identical to our syn chronization graph is used except that there is no feedback edge connecting the last actor executed on a processor to the first actor executed on the same processor and edges that have delay ar
101. erministic case Consider the IPC graph in Fig 4 7 which is the same IPC graph as in Fig 4 4 except that we have used a different execution time for actor H to make the ex ample more illustrative The numbers next to each actor represents execution times of the actors We let the execution time of actor C be t C t and we determine the iteration period as a function of given a particular value of to Ts7 t The Figure 4 7 Gipo actor C has execution time t constant over all invocations of C iteration period is given by MCM Gipc the maximum cycle mean The function T 57 to is shown in Fig 4 8 When Osi S1 the cycle 97 11 slope 1 2 t Figure 4 8 Tor tc A 56 565 r Fi E E A is critical and the MCM is constant at 7 when 1 lt tos 9 the cycle B s1 Sp r1 rp E E s4 S4 r4 ry D D 85 s3 r3 r3 C C s3 55 15 rs B is critical and since this cycle has two delays the slope of Top t 1s half in this region finally when 9 lt t the cycle C 55 s5 G G C becomes critical and the slope now is one because on that cycle Thus the effect of changes in execution times of each actor is piecewise lin ear and the slope depends on the number of delays on the critical cycle that the ac tor lies on The slope is at most one when the critical cycle containing the particular actor has a single delay on it The iteration period is a convex function o
102. es how ever do not attempt to make use of the fact that the interprocessor communication pattern in a self timed implementation is fairly predictable In this thesis we explore techniques that optimize the parallel implementation of a self timed sched ule by performing compile time analysis of the schedule to determine the pattern 28 of processor communication and synchronization We assume we are given a self timed schedule i e we are given a processor assignment and actor ordering This could be obtained using any of the techniques discussed in section 1 2 For the actual programs examples that we discuss in this thesis we use one of three sched uling methods that are currently available within the Ptolemy system Sih s sched uling heuristics Sih92 Hu s list scheduling heuristic and manual scheduling The first technique we present is the Ordered Transaction OT strategy In the OT strategy we impose additional constraints on the self timed execution In particular we force an order in which processors communicate Doing this allows us to eliminate arbitration and synchronization costs like the fully static strategy but enables us to retain some of the flexibility of the self timed schedule in that execution times of actors need only be estimates rather than values accurate down to the clock cycle In Chapter 3 we discuss the OT strategy in detail and present the hardware design and implementation of a four processor
103. esign flow Analytical techniques can be used instead to reduce this estimation time for example Li and Malik Li95 have proposed algorithms for estimating the execution time of embedded software Their estimation technique which forms a part of a tool called cinderella consists of two components 1 determin ing the sequence of instructions in the program that results in maximum execution time program path analysis and 2 modeling the target processor to determine how much time the worst case sequence determined in step takes to execute micro architecture modeling The target processor model also takes the effect of instruction pipelines and cache activity into account The input to the tool is a generic C program with annotations that specify the loop bounds i e the maxi mum number of iterations that a loop runs for Although the problem is formu lated as an integer linear program ILP the claim is that practical inputs to the tool can be efficiently analyzed using a standard ILP solver The advantage of this approach therefore is the efficient manner in which estimates are obtained as compared to simulation It should be noted that the program path analysis component of the Li and Malik technique is in general an undecidable problem therefore for these tech niques to function the programmer must ensure that his or her program does not contain pointer references dynamic data structures recursion etc and must pro vide bounds o
104. esseeeesteeeennees The synchronization graph of Fig 5 6 d after all redundant synchro nization edges are TEMOVEM os oss nacs heeds vial vege devices eles 132 An algorithm for converting a synchronization graph that is not vii Figure 5 9 Figure 5 10 Figure 5 11 Figure 5 13 Figure 5 12 Figure 5 14 Figure 5 15 Figure 6 1 Figure 6 2 Figure 6 3 Figure 6 4 Figure 6 5 Figure 6 6 Figure 6 7 strongly connected into a strongly connected graph 0 133 An illustration of a possible solution obtained by algorithm Convert tO S Caray a ys sccasisagsacsassovcave lensascvavavadeisdgvadoaseenaacuseteavenseacadeasdenscees 134 The synchronization graph after redundant synchronization edges are removed induced by a four processor schedule of a music syn thesizer based on the Karplus Strong algorithm eeeeee 136 A possible solution obtained by applying Convert to SC graph to the example of Ficure S TO sneen e te sies ae I EET An example used to illustrate a solution obtained by algorithm Deter mineDelay Sirci r neea e E RE DI Ra A A satan ne An algorithm for determining the delays on the edges introduced by algorithm Convert to SC graph isis ccives cecnseaevtscecsasnceannsbegsnensecetens 139 An example of resynchronization 00 0 0 eeeeeeseeeseceseeeeeeeeneeenaeenes 143 The complete synchronization optimization algorithm 145 BDF actors SWITCH and SELECT ss s
105. essors can be simulated using Frigg This allows functionality of the entire system to be verified by run ning actual programs on the processor simulators We did not use this model for performance evaluation however because with just a four processor system the cycle by cycle Frigg simulation was far too slow even for very simple programs A higher level behavioral simulation would be more useful than a cycle by cycle simulation for the purposes of performance evaluation although we did not attempt such high level simulation of our designs The remainder of this chapter describes hardware and software design details of an OMA board prototype 3 4 Design details of a prototype A proof of concept prototype of the OMA architecture has been designed and implemented The single printed circuit board design is comprised of four DSP96002 processors the transaction controller is implemented on a Xilinx FPGA Field Programmable Gate Array The Xilinx chip also handles the host interface functions and implements a simple I O mechanism A hierarchical description of the hardware design follows 52 3 4 1 Top level design This section refers to Fig 3 5 At the top level there are four processing element blocks that consist of the processor local memory local address decoder and some glue logic Address data and control busses from the PE blocks are con nected to form the shared bus Shared memory is connected to this bus ad
106. essors communicate A straightforward implementation of a self timed schedule would require that for each inter processor communication IPC the sending processor ascertain that the buffer it is writing to is not full and the receiver ascertain that the buffer it is reading from is not empty The processors block suspend execution when the appropriate condition is not met Such sender receiver synchronization can be implemented in many ways depending on the par ticular hardware platform under consideration in shared memory machines such synchronization involves testing and setting semaphores in shared memory in machines that support synchronization in hardware such as barriers special syn chronization instructions are used and in the case of systems that consist of a mix of programmable processors and custom hardware elements synchronization is achieved by employing interfaces that support blocking reads and writes In each type of platform each IPC that requires a synchronization check 107 costs performance and sometimes extra hardware complexity Semaphore checks cost execution time on the processors synchronization instructions that make use of special synchronization hardware such as barriers also cost execution time and blocking interfaces between a programmable processor and custom hardware in a combined hardware software implementations require more hardware than non blocking interfaces Huis93 In this chapter we presen
107. essy in Patt90 Architectures that employ multiple CPUs to achieve task level paral lelism fall into the shared memory message passing or dataflow paradigms The Stanford DASH multiprocessor Len92 is a shared memory machine whereas the Thinking Machines CM S falls into the message passing category The MIT Mon soon machine Pap90 is an example of a dataflow architecture Although the hardware for the design of such multiple processor machines 4 the memory interconnect network IO etc has received much attention software for such machines has not been able to keep up with the hardware devel opment Efficient partitioning of a general program written in C say across a given set of processors arranged in a particular configuration is still an open prob lem Detecting parallelism the overspecified sequencing in popular imperative languages like C managing overhead due to communication and synchronization between processors and the requirement of dynamic load balancing for some pro grams an added source of overhead makes the partitioning problem for a general program hard If we turn away from general purpose computation to application specific domains however parallelism is easier to identify and exploit For example one of the more extensively studied family of such application specific parallel proces sors is the systolic array architecture Kung88 Quin84 Rao85 this architecture consists of regularly arrang
108. etain the processor ignment o and the ordering of actors on each processor as specified by and discard the precise timing information specified in the fully static schedule Although we may start out with setting start v 0 v the subse quent start v k values are determined at runtime based on availability of data at the input of each actor the average iteration period of a self timed schedule is rep resented by Tp We analyze the evolution of a self timed schedule further in Chapter 4 38 THE ORDERED TRANSACTION STRATEGY The self timed scheduling strategy in Chapter 1 introduces synchronization checks when processors communicate such checks permit variations in actor exe cution times but they also imply run time synchronization and arbitration costs In this chapter we present a hardware architecture approach called Ordered Transac tions OT that alleviates some of these costs and in doing so trades off some of the run time flexibility afforded by the ST approach The ordered transactions strategy was first proposed by Bier Lee and Sriram Lee90 Bier90 In this chap ter we describe the idea behind the OT approach and then we discuss the design and hardware implementation of a shared bus multiprocessor that makes use of this strategy to achieve a low cost interprocessor communication using simple hardware The software environment for this board is provided by the Ptolemy sys tem developed at the
109. etween com munication actors For example the constraints transaction ordering constraints critical cycle ai gd Figure 4 5 Transaction ordering constraints O Si Fi 895 Fa S3 F3 Sao Tas S5 F5 S6 Tg applied to the IPC graph of Fig 4 4 89 is represented by the graph in Fig 4 5 If we call these additional ordering constraint edges E OT solid arrows in Fig 4 5 then the graph V Eine UE on represents constraints in the OT schedule as it evolves in Fig 4 3 Thus the maximum cycle mean of V E U Eor represents the effect of adding the ordering constraints The critical cycle C of this graph is drawn in Fig 4 5 it is different from the critical cycle in Fig 4 4 because of the added transaction ordering constraints Ignoring communication costs the MCM is 9 units which was also observed from the evo lution of the transaction constrained schedule in Fig 4 3 The problem of finding an optimal transaction order can therefore be stat ed as Determine a transaction order O such that the resultant constraint edges E or do not increase the MCM i e MCM V Eine MCM V Eine U Egy 4 4 Periodicity We noted earlier that as the ST schedule in Fig 4 2 evolves it eventually settles into a periodic repeating pattern that spans two iterations of the dataflow graph It can be shown that a ST schedule always settles down into a periodic exe cution pattern in Bacc92 the authors show that
110. execution path in the graph and we can estimate the time of occurrence of bus accesses corresponding to each combination from the quasi static schedule For example bus accesses corresponding to one schedule period of the two execution paths in the quasi static schedule of Fig 6 6 may be marked along the time axis as shown in Fig 6 10 we have ignored the bus access sequence corresponding to subgraph 1 to keep the illustration simple The bus access schedules for each of the combinations can now be col lapsed into one annotated list as in Fig 6 10 the fact that accesses for each combi nation are ordered with respect to time allows us to enforce a global order on the O o o Dw H Dw By 25 E m t N o Q Q Oo o eee 2 2 AAA AA Hd c TRUE A aa a pii T a N N N TE E 3S 88 g SAA h A mH H gt t c FALSE c Proc2 c Proc2 c Procl c Procl c Proc3 c Proc3 annotated list c Proc2 c Procl c Proc3 c Proc 1 Figure 6 10 Bus access lists and the annotated list corresponding to Fig 6 6 161 accesses in the collapsed bus access list The bus accesses in the collapsed list are annotated with their respective Boolean condition The collapsed list obtained above can be used as is in the masked controller scheme however there is a potential for optimizing this list Note however that the same transaction may appear in the access list corresponding to different Bool ean combinati
111. execution times are equal to their compile time estimates O was calculated using tc 3 in section 4 5 and sure enough Torte T r to when to 3 104 Tor to Tyr to Figure 4 10 Tsr tc and Tor tc Modeling using random variables for the OT schedule can again be done as before and since we have more constraints in this schedule the expected iteration period will in some cases be larger than that for an ST schedule Figure 4 9 G with transaction ordering constraints represented as dashed lines 105 4 7 Summary In this chapter we presented a quantitative analysis of ST and OT schedules and showed how to determine the effects of imposing a transaction order on an ST schedule If the actual execution times do not deviate significantly from the estimat ed values the difference in performance of the ST and OT strategies is minimal If the execution times do in fact vary significantly then even an ST strategy is not practical it then becomes necessary to use a more dynamic strategy such as static assignment or fully dynamic scheduling Lee89 to make the best use of computing resources Under the assumption that the variations in execution times are small enough so that an ST or an OT strategy is viable we argue that it is in fact wiser to use the OT strategy rather than ST because of the cheaper IPC of the OT strategy This is because we can determine the transaction order O such that the orderi
112. f actor execution times Definition 4 4 A function f x is said to be convex over an interval a b if for every X X a b and O lt A lt 1 f Ax 1 A x SAf x A A Ff x 98 Geometrically if we plot a convex function f x along x a line drawn between two points on the curve lies above the curve but it may overlap sections of the curve It is easily verified geometrically that T t is convex since this func tion is piecewise linear with a slope that is positive and non decreasing a line join ing two points on it must lie above but may coincide with the curve We can also plot Tas a function of execution times of more than one actor e g Tor t tp this function will be a convex surface consisting of intersect ing planes Slices of this surface along each variable look like Fig 4 8 which is a slice parallel to the to axis with the other execution times held constant t A 3 tp 3 etc The modelling described in this section is useful for determining how sen sitive the iteration period is to fixed changes in execution times of actors given a processor assignment and actor ordering We observe that the iteration period in creases linearly with slope one at worst and does not change at all at best when execution time of an actor is increased beyond its compile time estimate 4 6 2 Modeling run time variations in execution times The effect of variations in execution
113. f the semaphores are in their correct state and the processor gets the bus immediately upon request This translates to 8 of the 380 instructions per processor in the example of Chapter 1 section 1 4 that considered processing samples of a high quality audio signal at a sampling rate of 44 KHz on processors running on a 60ns clock Such a high cost of communication forces the scheduler to insert as few interprocessor communication nodes as possible which in turn limits the amount of parallelism that can be extracted from the algorithm One solution to this problem is to send more than one data sample when a processor gets access to the bus the arbitration and synchronization costs are then amortized over several data samples A scheme to vectorize data in this manner has been proposed by Zivo94 where the authors use retiming Lei91 to move delays in the HSDFG such that data can be moved in blocks instead of one sample at a time There are several problems with this strategy First retiming HSDFGs has to be done very carefully moving delays across actors can change the initial state of the HSDFG causing undesirable transients in the algorithm implementa tion This can potentially be solved by including preamble code to compute the value of the sample corresponding to the delay when that delay is moved across actors This however results in increased code size and other associated code gen eration complications Second the work of Zi
114. for the prototype is a DSP56000 based DSP board called the S 56X card manufactured by Ariel Corp Ariel91 The S 56X card is designed to fit into one of the Sbus slots in a Sun Sparc workstation a user level process can communicate with the S 56X card via a unix device driver Thus the OMA board too can be controlled via the S 56X card by a user process run ning on the workstation The host interface configuration is depicted in Fig 3 9 Unlike the DSP56000 processors the DSP96002 processors do not have built in serial ports so the S 56X board is also is used as a serial I O processor for the OMA board It essentially performs serial to parallel conversion of data buff ering of data and interrupt management The Xilinx on the OMA board imple ments the necessary transmit amp receive registers and synchronization flags we discuss the details of the Xilinx circuitry shortly The S 56X card communicates with the Sparc Sbus using DMA direct memory access A part of the DSP56000 bus and control signals are brought out of the S 56X card through another Xilinx FPGA XC3040 on the S 56X For the purpose of interfacing the S 56X board with the OMA board the Xilinx on the S 56X card is configured to bring out 16 bits of data and 5 bits of address from the 60 OMA board cable XC3090 5 bit address 16 bit data XC3040 serial I O SSI port Figure 3 9 Host interface DSP56000 p
115. g The entire operation takes 1 0 ms Thus we achieve a speedup 76 delay blocks b Figure 3 18 a Hierarchical block diagram for a 15 band analysis and synthesis filter bank b Schedule on four processors using Sih s DL heuristic Sin90 of 3 over a single processor This example is communication intensive the throughput is limited by the available bus bandwidth Indeed if all processors had independent access to the shared memory if the shared memory were 4 ported for example we could achieve an ideal speedup of four because each 256 point FFT is independent of the others except for data input and output 11 For this example data partitioning shared memory allocation scheduling and tuning the assembly program was done by hand using the 256 point complex FFT block in the Ptolemy CG96 domain as a building block The Gantt chart for the hand generated schedule is shown in Fig 3 19 write result 1024 complex values 256 complex values read by each processor m A me E e T E e 7 C e T Figure 3 19 Schedule for the FFT example 3 8 Summary In this chapter we discussed the ideas behind the Ordered Transactions scheduling strategy This strategy combines compile time analysis of the IPC pat tern with simple hardware support to minimize interprocessor communication overhead We discussed the hardware design and implementation details of a pro totype shared bus multiprocessor
116. gle shared bus and shared memory for inter processor communication This kind of shared memory architecture is attrac tive for embedded multiprocessor implementations owing to its relative simplicity and low hardware cost and to the fact that it is moderately scalable a fully inter connected processor topology for example would not only be much more expen sive than a shared bus topology but would also suffer from its limited scalability Bus bandwidth limits scalability in shared bus multiprocessors but for medium throughput applications digital audio music etc and the size of the machine we are considering a shared bus is ideal We propose to solve the scalability problem by using multiple busses and hierarchy of busses for which the ideas behind the OMA architecture directly apply We refer to Lee and Bier Lee90 for how the OMA concept is extended to such hierarchical bus structures From Fig 1 3 we recall that the self timed scheduling strategy falls natu rally into a message passing paradigm that is implemented by the send and receive primitives inserted in the HSDFG Accordingly the shared memory in an architec ture implementing such a scheduling strategy is used solely for message passing the send primitive corresponds to writes to shared memory locations and receive primitives correspond to reads from shared memory Thus the shared memory is not used for storing shared data structures or for storing shared program code I
117. granted only if the condition corresponding to that access evaluates to TRUE thus if the entry lt c7 gt lt Procl gt appears at the head of the bus access list and c b by then processor receives a bus grant only if the control token b gt is TRUE and bz is FALSE otherwise the bus grant is masked and the schedule counter moves up to the next entry in the list This scheme can be incorporated into the transaction controller in our existing OMA architecture prototype since the controller is implemented on an FPGA The product terms c C2 c may be programmed into the FPGA at compile time when we generate programs for the processors we can also generate the annotated bus access list a sequence of lt Condition gt lt Proc ID gt entries and a hardware description for the FPGA in VHDL say that implements the required product terms 160 6 2 4 Generating the annotated bus access list Consider the problem of obtaining an annotated bus access list c ProcID c ProcID from which we can derive the sequence of lt Condition gt lt Proc ID gt entries for the mask based transaction controller A straightforward even if inefficient mechanism for obtaining such a list is to use enumeration we simply enumerate all possible combinations of Booleans in the system 2 combinations for n Booleans and determine the bus access sequence sequence of ProcID s for each combination Each combination corresponds to an
118. gth Such a hierar chical scheme effectively handles nested control constructs e g nested condition als One important aspect of quasi static scheduling is determining execution profiles of dynamic constructs Ha Ha92 studies this problem in detail and shows how one can determine optimal profiles for constructs such as conditionals data dependent iteration constructs and recursion assuming certain statistics are known about the run time behaviour of these constructs We will consider only the conditional and the iteration construct here We will assume that we are given a quasi static schedule obtained either manually or using Ha s techniques We then explore how the techniques proposed in the previ ous chapters for multiprocessors that utilize a self timed scheduling strategy apply when we implement a quasi static schedule on a multiprocessor First we propose an implementation of a quasi static schedule on a shared memory multiprocessor and then we show how we can implement the same program on the OMA architec ture using the hardware support provided in the OMA prototype for such an 151 implementation 6 2 Parallel implementation on shared memory machines 6 2 1 General strategy A quasi static schedule ensures that the pattern of processor availability is identical regardless of how the data dependent construct executes at runtime in the case of the conditional construct this means that irrespective of which branch is act
119. he dataflow graph and its schedule that was introduced in Chapter 1 Fig 1 2 and that we enforce the transaction order of Fig 3 1 we reproduce these for convenience in Fig 4 1 a and b If we observe how the scheduled evolves as it is executed in a self timed manner essentially a simulation in time of when each processor executes actors as 79 Execution Times Proc 1 A B F 3 Proc 2 C H 2 Proc 3 D 76 E 4 Proc 4 G 2 Proc 5 Transaction order Gi TE Soo Moe 83013 Sys Tp 855159 S69 r6 a HSDFG G b Schedule and Transaction order Figure 4 1 Fully static schedule on five processors signed to it we get the unfolded schedule of Fig 4 2 successive iterations of the HSDFG overlap in a natural manner This is of course an idealized scenario where IPC costs are ignored we do so to avoid unnecessary detail in the diagram since IPC costs can be included in our analysis in a straightforward manner Note that the ST schedule in Fig 4 2 eventually settles to a periodic pattern consisting of two it erations of the HSDFG the average iteration period under the self timed schedule is 9 units The average iteration period which we will refer to as T for such an idealized zero IPC cost self timed schedule represents a lower bound on the iter ation period achievable by any schedule that maintains the same processor assign ment and actor ordering This is because the only run time constraint on p
120. he dataflow graph to overlap It also avoids having to consider data flow edges that have delay The technique that we present for removing redundant synchronizations can be viewed as a generalization of Shaffer s algorithm to han dle delays and overlapped iterative execution and we will discuss this further in section 5 6 The other major techniques that we present for optimizing synchroni zation handling the feedforward edges of the synchronization graph to be defined in section 5 4 2 discussed in section 5 7 and resynchronization defined and addressed in sections 5 9 and the appendix are fundamentally dif ferent from Shaffer s technique since they address issues that are specific to our As As Proc 1 A A AN A A Proc 2 A A 4 Proc 3 A 5 Ag a b Proc 1 star A gt 3 A r l Proc 2 start A r 3 A r l Proc 3 start gt A F As c Figure 5 1 a An HSDFG b A three pro a An HSDFG b A three proces sor self timed schedule for a c An illustration of execution under the placement of barriers 110 more general context of overlapped iterative execution As discussed in Chapter 1 section 1 2 2 a multiprocessor executing a self timed schedule is one where each processor is assigned a sequential list of actors some of which are send and receive actors which it executes in an infinite loop When a processor executes a commun
121. hen discard the timing informa tion and the restrictions of maintaining a processor availability profile Instead we only retain the assignment of actors to processors the order in which they execute and also under what conditions on the Boolean tokens in the system the actor should execute Synchronization between processors is done at run time whenever processors communicate This scheme is analogous to constructing a self timed schedule from a fully static schedule as discussed in section 1 2 2 Thus the quasi static schedule of Fig 6 4 can be implemented by the set of programs in Fig 6 5 for the three processors Here r 1 e2 Tas ry are the receive actors and Proc 1 Proc 2 Proc 3 A B D receive C fo lt send c S gt receive c Igo if c C if c E if c H receive r4 t send s4 else F G L else else send So K lt code for subgraph 1 gt receive r2 lt code for subgraph 1 gt J lt code for subgraph 1 gt Figure 6 5 Programs on three processors for the quasi static schedule of Fig 6 4 51 51 59 are the send actors The subscript c refers to actors that communi 153 cate control tokens The main difference between such an implementation and the self timed implementation we discussed in earlier chapters are the control tokens Whenever a conditional construct is partitioned across more than one processor the control token s that determine its beha
122. hich at run time we use the same terminology in this thesis To reduce run time computation costs it is advan tageous to perform as many of the three scheduling tasks as possible at compile time especially in the context of algorithms that have hard real time constraints Which of these can be effectively performed at compile time depends on the infor mation available about the execution time of each actor in the HSDFG For example dataflow computers first pioneered by Dennis Denn80 per form the assignment step at compile time but employ special hardware the token match unit to determine at runtime when actors assigned to a particular proces sor are ready to fire The runtime overhead of token matching and dynamic sched uling within each processor is fairly severe so much so that dataflow architectures have not been commercially viable even with expensive hardware support for dynamic scheduling performance of such computers has been unim pressive The performance metric of interest for evaluating schedules is the average iteration period T the average time it takes for all the actors in the graph to be executed once Equivalently we could use the throughput y i e the number of iterations of the graph executed per unit time as a performance metric Thus an optimal schedule is one that minimizes T In this thesis we focus on scheduling strategies that perform both processor assignment and actor ordering at compile time because
123. hould ideally be taken into account during scheduling The methods of Lee and Ha and also prior research related to quasi static scheduling that they refer to in their work do not take this cost into account Static multiprocessor scheduling applied to graphs with dynamic constructs taking costs of distributing control tokens into account is thus an interesting problem for further study 154 6 2 2 Implementation on the OMA Recall that the OMA architecture imposes an order in which shared mem ory is accessed by processors in the machine This is done to implement the OT strategy and is feasible because the pattern of processor communications in a self timed schedule of an HSDF graph is in fact predictable What happens when we want to run a program derived from a quasi static schedule such as the parallel program in Fig 6 5 which was derived from the schedule in Fig 6 4 Clearly the order of processor accesses to shared memory is no longer predictable it depends on the outcome of run time evaluation of the control token c The quasi static schedule of Fig 6 4 specifies the schedules for the TRUE and FALSE branches of the conditional If the value of c were always TRUE then we can determine from the quasi static schedule that the transaction order would be Sei ep Fez S1 F1 lt access order for subgraph 1 gt and if the value of c were always FALSE the transaction order would be So Tops Fez S2 Fo lt access order for subgraph 1 gt
124. how that by adding synchronization edges through a certain simple pro cedure the synchronization graph can be transformed into a strongly connected graph in a way that the overhead of implementing the extra synchronization edges is always compensated by the savings attained by being able to avoid the use of UBS That is our transformations ensure that the total number of synchronization accesses required per iteration period for the transformed graph is less than or equal to the number of synchronization accesses required for the original synchro nization graph Through a practical example we show that this transformation can significantly reduce the number of required synchronization accesses Also we discuss a technique to compute the delay that should be added to each of the new edges added in the conversion to a strongly connected graph This technique com N 1 Y I synch edges internal edges Figure 5 7 The synchronization graph of Fig 5 6 d after all redundant synchro nization edges are removed 132 putes the delays in a way that the estimated throughput of the IPC graph is pre served with minimal increase in the shared memory storage cost required to implement the communication edges 5 7 1 Adding edges to the synchronization graph Fig 5 8 presents our algorithm for transforming a synchronization graph that is not strongly connected into a strongly connected graph This algorithm s
125. ication actor it synchronizes with the pro cessor s it communicates with Thus exactly when a processor executes each actor depends on when at run time all input data for that actor is available unlike the fully static case where no such run time check is needed In this chapter we use processor in slightly general terms a processor could be a programmable com ponent in which case the actors mapped to it execute as software entities or it could be a hardware component in which case actors assigned to it are imple mented and execute in hardware See Kala93 for a discussion on combined hard ware software synthesis from a single dataflow specification Examples of application specific multiprocessors that use programmable processors and some form of static scheduling are described in Bork88 Koh90 which were also dis cussed in Chapter 1 section 1 3 Inter processor communication between processors is assumed to take place via shared memory Thus the sender writes to a particular shared memory location and the receiver reads from that location The shared memory itself could be global memory between all processors or it could be distributed between pairs of processors as a hardware FIFO queues or dual ported memory for example Each inter processor communication edge in our HSDFG thus translates into a buffer of a certain size in shared memory Sender receiver synchronization is also assumed to take place by setting flags in shar
126. ignal remains asserted until an instruction that does not access the shared bus is executed We can therefore use the BR signal to determine when a shared bus owner has completed its usage of the shared bus Fig 3 6 a Processor requests bus Processor accepts bus Bus release a mY a N a gt Bus grant deasserted any number of consecutive bus cycles after the bus is released m e Processor granted bus BG0 BRO BGIBRI BG2BR2 BGn BRn BR common b Figure 3 6 Using processor bus arbitration signals for controlling bus access The rising edge of the BR line is used to detect when a processor releases the bus To reduce the number of signals going from the processors to the control 56 ler we multiplexed the BR signals from all processors onto a common BR signal The current bus owner has its BR output enabled onto this common reverse signal this provides sufficient information to the controller because the controller only needs to observe the BR line from the current bus owner This arrangement is shown in Fig 3 6 b the controller grants access to a processor by asserting the corresponding BG line and then it waits for an upper edge on the reverse BR line On receiving a positive going edge on this line it grants the bus to the next proces sor in its list 3 4 2 2 A simple implementation One straightforward implementation of the above functionality is to use aa counter addressing a RAM that st
127. im ply chains together the source SCCs and similarly chains together the sink SCCs The construction is completed by connecting the first SCC of the source chain to the last SCC of the sink chain with an edge that we call the sink source edge From each source or sink SCC the algorithm selects a vertex that has mini Function Convert to SC graph Input A synchronization graph G that is not strongly connected Output A strongly connected graph obtained by adding edges between the SCCs of G 1 Generate an ordering C C C of the source SCCs of G and similarly generate an ordering D D D of the sink SCCs of G 2 Select a vertex v e C that minimizes t over C 3 For i 2 3 m e Select a vertex v e C that minimizes t over C e Instantiate the edge d v _ v End For 4 Select a vertex w D that minimizes t over D 5 For i 2 3 n e Select a vertex w D that minimizes 1 over D e Instantiate the edge d w _ w End For 6 Instantiate the edge d w Y1 Figure 5 8 An algorithm for converting a synchronization graph that is not strongly connected into a strongly connected graph 133 mum execution time to be the chain link corresponding to that SCC Minimum execution time vertices are chosen in an attempt to minimize the amount of delay that must be inserted on the new edges to preserve the estimated throughput of the original graph In sectio
128. ing and writing of data values into the buffer represented by that edge and 2 synchronization 113 between the sender and the receiver As mentioned before we assume the use of shared memory for the purpose of synchronization the synchronization operation itself must be implemented using some kind of software protocol between the sender and the receiver We discuss these synchronization protocols shortly 5 2 1 Estimated throughput Recall from Eqn 4 3 that the average iteration period corresponding to a self timed schedule with an IPC graph G pe is given by the maximum cycle mean of the graph MCM G If we only have execution time estimates available instead of exact values and we set the execution times of actors t v to be equal to these estimated values then we obtain the estimated iteration period by comput ing MCM Gipc Henceforth we will assume that we know the estimated throughput M CM calculated by setting the t v values to the available timing estimates In all the transformations that we present in the rest of the chapter we will preserve the estimated throughput by preserving the maximum cycle mean of G with each t v set to the estimated execution time of v In the absence of ipc gt more precise timing information this is the best we can hope to do 5 3 Strongly connected components and buffer size bounds In dataflow semantics the edges between actors represent infinite buffers Accordingly the
129. iod before the positive going edge of BR 57 po Schedule counter y address count p 1 X count addr Schedule RAM contains access list address procID a Fh countup latch latch BGx X BGy tase BG decode Figure 3 7 Ordered Transaction Controller implementation arrives wait states may have to be inserted in the processor bus cycle to delay the positive edge of BR We found that a 10 bit wide counter does not require any wait states and allows a maximum of 1024 processor IDs in the access order list 3 4 2 3 Presettable counter A single bus access list implies we can only enforce one bus access pattern at run time In order to allow for some run time flexibility we have implemented the OMA controller using a presettable counter The processor that currently owns the bus can preset this counter by writing to a certain shared memory location This causes the controller to jump to another location in the schedule memory allowing the multiple bus access schedules to be maintained in the schedule RAM and switching between them at run time depending on the outcome of computa tions in the program The counter appears as an address in the shared memory map of the processors The presettable counter mechanism is shown in Fig 3 8 An arbitrary number of lists may in principle be maintained in the sched ule memory This feature can be used to support algorithms that display data
130. it states in the processor memory cycle by de asserting the TA line Whenever a particular FIFO is accessed its Buffer Full line is enabled onto the TA line of the processors Fig 3 4 Thus a full FIFO automatically blocks the processor trying to write into it and no polling needs to be done by the sender Receiver read is local to a processor and does not consume shared bus bandwidth The receiver can be made to either poll the FIFO empty line to check for an empty queue or we one can use the same TA signal mechanism to block processor reads from an empty queue The TA 50 mechanism will then use the local A bus control signals A bus TA signal A address bus etc This is illustrated in Fig 3 4 Use of such a distributed shared memory mechanism has several advan tages First the shared bus traffic is effectively halved because only writes need to go through the shared bus Second in the design of Fig 3 2 a processor that is granted the bus is delayed in completing its shared memory access all other pro cessors waiting for the bus get stalled this does not happen for half the transac tions in the modified design of Fig 3 3 because receiver reads are local Thus there is more tolerance to variations in the time at which a receiver reads data sent to it Last a processor can broadcast data to all or any subset of processors in the sys tem by simultaneously writing to more than one FIFO buffer Such br
131. ithm discussed in pp 94 97 of Law76 to return the cycle mean of the critical cycle in Z if one or more cycles exist that have zero path delay then BellmanFord returns Details of this procedure can be found in Bhat95a Fig 5 13 illustrates a solution obtained from DetermineDelays Here we assume that t v 1 for each vertex v and we assume that the set of communi cation edges are e and e The grey dashed edges are the edges added by Con vert to SC graph We see that MCM is determined by the cycle in the sink SCC of the original graph and inspection of this cycle yields MCM 4 The solution determined by DetermineDelays for Fig 5 13 is one delay on e and one delay on e So 5 1 the resulting self timed buffer bounds of e and e are respec tively 1 and 2 the total buffer sizes for the communication edges is thus 3 sum of the self timed buffer bounds DetermineDelays can be extended to yield heuristics for the general case in which the original synchronization graph G contains more than one source SCC gt amp 1 new edges G52 synch edges o Figure 5 13 An example used to illustrate a solution obtained by algorithm DetermineDelays 138 Function DetermineDelays Input Synchronization graphs G V E and where G is the graph computed by Convert to SC graph when applied to G The ordering of source SCCs generated in Step 2 of Convert to SC graph is denoted C C C Fo
132. itten in a modular fashion 66 sbus carte 4x96002 board 4x96002 board Busses on different boards connected Processors on separate busses with together to have more than four proc handshake between busses Helps in essors on a single bus scalability of the system Figure 3 12 Connecting multiple boards the schematics hierarchy is shown in Fig 3 13 The SIERA design manager DMoct was then used to translate the netlists into an input file acceptable by Racal a commercial PCB layout tool which was then used to autoroute the board in ten layers including one Vcc and one Ground plane The files corresponding to the traces on each layer gerber files were generated using Racal and these files were then sent to Mosis for board fabrication Component procurement was mainly done using the FAST service but some components were ordered bought directly from electronic component distributors Table 3 1 lists the salient physical features of the board and Fig 3 14 shows a photograph of the board 67 top_level sdl root file Manual Global clk Host side 512K x 32 reset cct generator connector SRAM reset sdl clock sdl rightcon sch 256ksram sdl Global Memory Expansion Xilinx XC3090 Shared bus connector circuitry four_proc sdl address buffers leftcon sdl xilinx sdl trans32 sdl left sdl buffers pro
133. ke the one proposed in Schw85 will certainly find the schedule S directly 95 We set the transaction order O to be the transaction order suggested by the modified schedule S as opposed to the transaction order from S used in Fig 3 1 Thus O 584 Fi S3 13s Sys Fos S4 Ty S Te Ss 75 Imposing the transaction or der O as in Fig 4 6 results in Toy of 9 units instead of 10 that we get if the trans action order of Fig 3 1 is used Under the transaction order specified by S T oT Ts7 thus imposing the order O ensures that the average period is within one unit of the unconstrained ST strategy Again unfolding may be required to obtain a transaction ordered schedule that has period exactly equal to Top but ST the extra cost of a larger controller to enforce the transaction ordering outweighs the small gain of at most one unit reduction in the iteration period Thus for all prac tical purposes O is the optimal transaction order The optimality is in the sense that the transaction order O we determine statically is the best possible one given the timing information available at compile time 4 6 Effects of changes in execution times We recall that the execution times we use to determine the actor assignment and ordering in a self timed schedule are compile time estimates and we have been stating that static scheduling is advantageous when we have reasonably good compile time estimates of
134. le of computation where we do not assume precise knowledge of execution times of tasks or even the availability of guaranteed worst case execution times for operatio The second technique we present in this thesis consists of efficient algo rithms that minimize the overall synchronization activity in the implementation of a given self timed schedule A straightforward implementation of a self timed schedule often includes redundant synchronization points i e the objective of a certain set of synchronizations is guaranteed as side effect of other synchroniza tion points in the system In Chapter 5 we present efficient polynomial time algo rithms for detecting and eliminating such redundant synchronization operations We relate our techniques to the algorithm proposed by Shaffer Sha89 for mini mizing directed synchronizations in multiprocessors that are essentially executing self timed schedules and to the work described by Dietz et al Dietz92 for mini mizing the number of barrier synchronizations in a multiprocessor that contains barrier synchronization hardware It is also possible to reduce the overall synchro nization cost of a self timed implementation by adding synchronization points between processors that were not present in the schedule specified originally In Chapter 5 we present a technique called resynchronization for systematically manipulating synchronization points in this manner The work described in this
135. left and right side connectors follow the same format so that multi ple boards can be easily connected together Shared control and address busses are 53 PE OnCE PAL BGO 3 BG decode PAL PE Schedule RAM bidir buffers Left side connector 7 4 BG lines lt 3 7 32 data control RAW TA BA etc 32 address Global clock Pit XC3090 shared Addr decode 32 Data Figure 3 5 Top level schematic of the OMA prototype 54 buffered before they go off board via the connectors and the shared data bus is buffered within the Xilinx The DSP96000 processors have on chip emulation OnCE in Motorola terminology circuitry for debugging purposes whereby a serial interface to the OnCE port of a processor can be used for in circuit debugging On the OMA board the OnCE ports of the four processors are multiplexed and brought out as a single serial port a host may select any one of the four OnCE ports and communi cate to it through a serial interface We discuss the design details of the individual components of the prototype system next 3 4 2 Transaction order controller The task of the transaction order controller is to enforce the predetermined bus access order at run time A given transaction order determines the sequence of processor bus accesses that must be enfo
136. llel conditional constructs more effectively 157 Figure 6 8 Conditional constructs in parallel paths The main idea behind masking is to store an ID of a Boolean variable along with the processor ID in the bus access list The Boolean ID determines whether a particular bus grant is enabled This allows us to combine the access lists of all the nodes f4 through fk and g4 through g The bus grant corresponding to each fj is tagged with the boolean ID of the corresponding b and an additional bit indicates that the bus grant is to be enabled when b is TRUE Similarly each bus grant cor responding to the access list of gj is tagged with the ID of b and an additional bit indicates that the bus grant must be enabled only if the corresponding control token has a FALSE value At runtime the controller steps through the bus access list as before but instead of simply granting the bus to the processor at the head of the list it first checks that the control token corresponding to the Boolean ID field of the list is in its correct state If it is in the correct state i e it is TRUE for a bus grant corresponding to an f and FALSE for a bus grant corresponding to a g then the bus grant is performed otherwise it is masked Thus the run time values of the Booleans must be made available to the transaction controller for it to decide whether to mask a particular bus grant or not More generally a particular bus grant should be enabled by
137. lling bus access 56 Ordered Transaction Controller implementation eeeeeeee 58 Presettable counter implementation eeeeceeeseceeeseeeesteeeeneeeees 59 Hostintertaces eure eevee eek ees eee el E 61 Processing elementir a iarten EEEE SES 62 Xilinx configuration at run time eessseeeeseeeseereesersseseresresseseresresse 64 Connecting multiple boards ssssssessseeessseesseessessseesseeesseeessresseesse 67 Schematics hierarchy of the four processor OMA architecture 68 OMA prototype board photograph cee eeeeeeeseeeceeeceesteeeenseeeenee 69 Steps required for downloading code tcl script omaDoAIll 70 Hierarchical specification of the Karplus Strong algorithm in 28 WGC OS tet a eae ieee aie len ee eta oe Meet aioe lene eae 74 Four processor schedule for the Karplus Strong algorithm in 28 voices Three processors are assigned 8 voices each the fourth Proc 1 is assigned 4 voices along with the noise source eee 75 a Hierarchical block diagram for a 15 band analysis and synthesis filter bank b Schedule on four processors using Sih s DL heuristic vii Figure 3 19 Figure 4 1 Figure 4 2 Figure 4 3 Figure 4 4 Figure 4 5 Figure 4 6 Figure 4 7 Figure 4 8 Figure 4 9 Figure 4 10 Figure 5 1 Figure 5 2 Figure 5 3 Figure 5 4 Figure 5 5 Figure 5 6 Figure 5 7 Figure 5 8 schedule for the FFT xamples 054 Soa aoa 78 Fully static s
138. ls is the computation graph of Karp and Miller Karp66 In their seminal paper Karp and Miller establish that their computation graph model is determinate i e the sequence of tokens produced on the edges of a given computa tion graph are unique and do not depend on the order that the actors in the graph fire as long as all data dependencies are respected by the firing order The authors also provide an algorithm that based on topological and algebraic properties of the graph determines whether the computation specified by a given computation graph will eventually terminate Because of the latter property computation graphs clearly cannot simulate all Turing machines and hence are not as expressive as a general dataflow language like Lucid or pure LISP Computation graphs provide some of the theoretical foundations for the SDF model Another model of computation relevant to dataflow is the Petri net model Peter81 Mur89 A Petri net consists of a set of transitions which are analogous to actors in dataflow and a set of places that are analogous to arcs Each transition has a certain number of input places and output places connected to it Places may contain one or more tokens A Petri net has the following semantics a transition fires when all its input places have one or more tokens and upon firing it produces a certain number of tokens on each of its output places A large number of different kinds of Petri net models have been pr
139. lso since it is possible to simulate any Tur ing machine in one of these languages questions such as deadlock or equivalently terminating behaviour and determining maximum buffer sizes required to implement edges in the dataflow graph become undecidable Several models based on dataflow with restricted semantics have been proposed these models give up the descriptive power of general dataflow in exchange for proper BDF model can therefore compute all Turing computable functions whereas this is not possible in the case of the SDF model We discuss the Boolean dataflow model further in Chapter 6 In exchange for the limited expressivity of an SDF representation we can efficiently check conditions such as whether a given SDF graph deadlocks and whether it can be implemented using a finite amount of memory No such general procedures can be devised for checking the corresponding conditions deadlock behaviour and bounded memory usage for a computation model that can simulate any given Turing machine This is because the problems of determining if any given Turing machine halts the halting problem and determining whether it will use less than a given amount of memory or tape are undecidable Lew8 1 that is no general algorithm exists to solve these problems in finite time In this thesis we will first focus on techniques that apply to SDF applica tions and we will propose extensions to these techniques for applications that can be s
140. ly using a self timed T mm SELECT SELECT Figure 6 3 Acyclic precedence graphs corresponding to the if then else graph of Fig 6 2 a corresponds to the TRUE assignment of the control token b to the FALSE assignment strategy each processor now gets several lists of actors one list for each possible assignment of the control tokens The problem with this approach is that for a graph with n different control tokens there are 2 possible distinct APGs each corresponding to each execution path in the graph Such a set of APGs can be com pactly represented using the so called Annotated Acyclic Precedence Graph AAPG of Buck93 in which actors and arcs are annotated with conditions under which they exist in the graph Buck uses the AAPG construct to determine whether a bounded length uniprocessor schedule exists In the case of multiprocessor scheduling it is not clear how such an AAPG could be used to explore scheduling options for the different values that the control tokens could take without explic itly enumerating all possible execution paths The main work in parallel scheduling of dataflow graphs that have dynamic actors has been the Quasi static scheduling approach first proposed by Lee Lee88b and extended by Ha Ha92 In this work techniques have been devel oped that statically schedule standard dynamic constructs such as data dependent conditionals data dependent iterations and recursion These construct
141. lysis filter bank There are 1010 instruction cycles of computation per sample period in this example Using Sih s Dynamic Level DL scheduling heuristic we were able to achieve an average iteration period of 366 instruction cycles making use of 40 IPCs The schedule that is actually constructed Gantt chart of Fig 3 18 b oper ates on a block of 512 samples because these many samples are needed before all the actors in the graph fire at least once this makes manual scheduling very diffi cult We found that the DL heuristic performs close to 20 better than the Hu level heuristic in this example although the DL heuristic takes more than twice the time to compute the schedule compared to Hu level 3 7 3 1024 point complex FFT For this example input data 1024 complex numbers is assumed to be present in shared memory and the transform coefficients are written back to shared memory A single 96002 processor on the OMA board performs a 1024 point com plex FFT in 3 0 milliseconds ms For implementing the transform on all four pro cessors we used the first stage of a radix four decimation in frequency FFT computation after which each processor independently performs a 256 point FFT In this scheme each processor reads all 1024 complex inputs at the beginning of the computation combines them into 256 complex numbers on which it performs a 256 point FFT and then writes back its result to shared memory using bit reversed addressin
142. machine the Ordered Memory Access Architecture OMA that employs the OT idea to achieve low cost inter processor communication The OMA is a shared bus machine that uses shared memory for IPC The order in which processors access shared memory for the pur pose of communication is predetermined at compile time and enforced by a bus controller on the board resulting in a low cost IPC mechanism without the need for explicit synchronization Reducing IPC costs is particularly important for parallel implementation of embedded real time applications because there is usually a premium on processor cycles in these situations For example if we process audio at a sample rate of 44 KHz on a multiprocessor that cons of processors with a 60 nanosecond cycle time we have about 380 instructions on average per processor to complete all operations on each audio sample IPC can potentially waste precious processor cycles negating some of the benefits of using multiple processors In Chapter 4 we present a graph theoretic scheme for modeling self timed schedules using a structure we call the IPC graph that takes into account the pro 29 cation domain for such a scheme Indeed the GF11 is employed mainly for scien tific computation applications that have a very regular communication and computation structure linear algebra finite element methods FFTs etc Again the techniques we explore in this thesis apply to an MIMD sty
143. mpile time infor mation can be made use of for simplifying the hardware in such interconnects is an interesting problem for further study In the techniques we proposed in Chapter 5 for minimizing synchroniza tion costs no assumptions regarding bounds on execution times of actors in the graph were made A direction for further work is to incorporate timing guarantees for example hard upper and lower execution time bounds as Dietz Zaafrani and O Keefe use in Dietz92 and handling of a mix of actors some of which have guaranteed execution time bounds and some that have no such guarantees as Filo Ku Coelho Jr and De Micheli do in Filo93 Such guarantees could be used to detect situations in which data will always be available before it is needed for con sumption by another processor Also execution time guarantees can be used to compute tighter buffer size bounds As of section 5 8 simple example consider Fig 7 1 Here the analysis Figure 7 1 An example of how execution time guarantees can be used to reduce buffer size bounds 168 throughput may be affected We showed how such effects of run time variations in execution times on the throughput of a given schedule can be quantified The ordered transactions approach also extends to graphs that include con structs with data dependent firing behaviour We discussed how conditional con structs and data dependent iteration constructs can be mapped to the OMA
144. my system and is very effective for the assembly code libraries in which the primitives are written in the assembly language of the target processor Ptolemy currently supports the Motorola 56000 and 96000 processors The programmer can provide a good estimate for blocks written in such a library by counting the number of processor cycles each instruction consumes or by pro filing the block on an instruction set simulator It is more difficult to estimate execution times for blocks that contain con trol constructs such as data dependent iterations and conditionals within their body and when the target processor employs pipelining and caching Also it is difficult if not impossible for the programmer to provide reasonably accurate esti mates of execution times for blocks written in a high level language as in the C code generation library in Ptolemy The solution adopted in the GRAPE system Lauw90 is to automatically estimate these execution times by compiling the block if necessary and running it by itself in a loop on an instruction set simula tor for the target processor To take into account data dependent execution behav iour different input data sets can be provided for the block during simulation Either the worst case or the average case execution time is used as the final esti mate The estimation procedure employed by GRAPE is obviously time consum 22 1 3 Application specific parallel architectures There has been
145. n and execution of an algorithm from a block diagram specification A simple heterogeneous multiprocessor target was also written in Ptolemy for the OMA and S 56X combination this target generates DSP56000 code for the S 56X card and generates DSP96000 multiprocessor code for the OMA 3 6 Ordered I O and parameter control We have implemented a mechanism whereby I O can be done over the shared bus We make use of the fact that I O for DSP applications is periodic sam ples or blocks of samples typically arrive at constant periodic intervals and the processed output is again required by say a D A convertor at periodic intervals With this observation it is in fact possible to schedule the I O operations within the multiprocessor schedule and consequently determine when relative to the other shared bus accesses due to IPC the shared bus is required for I O This 71 allows us to include bus accesses for I O in the bus access order list In our partic ular implementation I O is implemented as shared address locations that address the Tx and Rx registers in the Xilinx chip section 3 4 5 which in turn communi cate with the S 56X board a processor accesses these registers as if they were a part of shared memory It obtains access to these registers when the transaction controller grants access to the shared bus bus grants for the purpose of I O are taken into account when constructing the access order list Thus we order access to
146. n 5 7 2 we discuss the selection of delays for the edges introduced by Convert to SC graph It is easily verified that algorithm Convert to SC graph always produces a strongly connected graph and that a conversion to a strongly connected graph can not be attained by adding fewer edges than the number of edges added by Convert to SC graph Fig 5 9 illustrates a possible solution obtained by algorithm Con vert to SC graph Here the black dashed edges are the synchronization edges con tained in the original synchronization graph and the grey dashed edges are the edges that are added by Convert to SC graph The dashed edge labeled e is the sink source edge Assuming the synchronization graph is connected the number of feedfor ward edges Ne must satisfy Np 2 n 1 where n is the number of SCCs This follows from the fundamental graph theoretic fact that in a connected graph V E must be at least v 1 Now it is easily verified that the new edges synch edges D ny internal edges s g A F b 6 Figure 5 9 An illustration of a possible solution obtained by algorithm Convert to SC graph 134 number of new edges introduced by Convert to SC graph is equal to n_ n 1 where n is the number of source SCCs and n__ is the num STC snk src snk A ber of sink SCCs Thus the number of synchronization accesses per iteration period S that is required to implement the edges intr
147. n a self timed strategy we can further ensure at compile time that each shared mem ory location is written by only one processor one way of doing this is to simply assign distinct shared buffers to each of the send primitives which is the scheme implemented in the Ptolemy environment as a result no atomic test and set instruction needs to be provided by the hardware 42 Let us now consider the implementation of IPC in self timed schedules on such a shared bus multiprocessor The sender has to write into shared memory which involves arbitration costs it has to request access to the shared bus and the access must be arbitrated by a bus arbiter Once the sender obtains access to shared memory it needs to perform a synchronization check on the shared memory location to ensure that the receiver has read data that was written in the previous iteration to avoid overwriting previously written data Such synchronization is typically implemented using a semaphore mechanism the sender waits until a semaphore is reset before writing to a shared memory location and upon writing that shared memory location it sets that semaphore the semaphore could be a bit in shared memory one bit for each send operation in the parallel schedule The receiver on the other hand busy waits until the semaphore is set before reading the shared memory location and resets the semaphore after completing the read oper ation It can easily be verified that this sim
148. n all loops Li and Malik s technique also depends on the accuracy of the processor model although one can expect good models to eventually evolve for DSP chips and microcontrollers that are popular in the market The problem of estimating execution times of blocks is central for us to be able to effectively employ compile time design techniques This problem is an important area of research in itself and the strategies employed in Ptolemy and GRAPE and those proposed by Li and Malik are useful techniques and we expect better estimation techniques to be developed in the future 23 magnitude occasionally occur due to phenomena such as cache misses interrupts user inputs or error handling Consequently tight worst case execution time bounds cannot generally be determined for such operations however reasonably good execution time estimates can in fact be obtained for these operations so that static assignment and ordering techniques are viable For such applications self timed scheduling is ideal because the performance penalty due to lack of dynamic load balancing is overcome by the much smaller run time scheduling overhead involved when static assignment and ordering is employed The estimates for execution times of actors can be obtained by several dif ferent mechanisms The most straightforward method is for the programmer to provide these estimates when he writes the library of primitive blocks This strat egy is used in the Ptole
149. nce it computes the value of the control token c whereas the branch after the TRUE path which bypasses the access list of the FALSE branch is best taken care of by processor 1 since processor 1 already possesses the bus at the time when the counter needs to be loaded The schedule counter load opera tions are easily incorporated into the sequential programs of processors 1 and 2 The mechanism of switching between bus access orders works well when the number of control tokens is small But if the number of such tokens is large then this mechanisms breaks down even if we can efficiently compute a quasi static schedule for the graph To see why this is so consider the graph in Fig 6 8 which contains k conditional constructs in parallel paths going from the input to the output The functions f and g are assumed to be subgraphs that are assigned to more than one processor In Ha s hierarchical scheduling approach each conditional is scheduled independently once scheduled it is converted into an atomic node in the hierarchy and a profile is assigned to it Scheduling of the other conditional constructs can then proceed based on these profiles Thus the scheduling complexity in terms of the number of parallel paths is O k if there are k parallel paths If we implement the resulting quasi static schedule in the 156 bus access list Addr Proc 2 8 1 Proc 1 r1 Proc 3 ro2 if c is FALSE proc 2 forces contr
150. nd V K Agarwal A Comparative Study of DSP Multiprocessor List Scheduling Heuristics technical report School of Computer Science McGill University 1994 Liao95 S Liao S Devadas K Keutzer S Tjiang and A Wang Code Optimiza tion Techniques for Embedded DSP Microprocessors Proceedings of the 180 32nd Design Automation Conference June 1995 Li95 Y S Li and S Malik Performance Analysis of Embedded Software Using Implicit Path Enumeration Proceedings of the 32nd Design Auto mation Conference June 1995 Messer88 D G Messerschmitt Breaking the Recursive Bottleneck in Perfor mance Limits in Communication Theory and Practice J K Skwirzynski editor Chapter 7 Kluwer Academic Publishers 1988 Moll82 Michael K Molloy Performance Analysis Using Stochastic Petri Nets IEEE Transactions on Computers Vol c 31 No 9 September 1982 Moto89 DSP96002 IEEE Floating Point Dual Port Processor User s Manual Motorola Inc 1989 Moto90 DSP96000ADS_ Application Development System Reference Manual Motorola Inc 1990 Mouss92 F Moussavi and D G Messerschmitt Statistical Memory Management for Digital Signal Processing Proceedings of the 1992 IEEE International Symposium on Circuits and Systems Vol 2 pp 1011 14 May 1992 Mur89 T Murata Petri nets Properties Analysis and Applications Proceed ings of the IEEE Vol 77 pp 39 58 January 1989 Ols90
151. need not have explicit synchronization whereas others require synchronization which need to be implemented either using the UBS protocol or the BBS protocol All communication edges also represent buffers in shared mem ory Thus we divide the set of communication edges as follows Ecomm Es U E where the edges E need explicit synchronization operations to be implemented and the edges E need no explicit synchronization We call the edges E synchro nization edges Recall that a communication edge v v of Gipc represents the synchro nization constraint start v k 2 end Y k delay Y v Vk gt delay V v 5 2 Thus before we perform any optimization on synchronizations Ecomm Es and E because every communication edge represents a synchro nization point However in the following sections we describe how we can move certain edges from E to E thus reducing synchronization operations in the final 118 implementation At the end of our optimizations the communication edges of the IPC graph fall into either E or E At this point the edges E U E in G repre sent buffer activity and must be implemented as buffers in shared memory whereas the edges E represent synchronization constraints and are implemented using the UBS and BBS protocols introduced in the previous section For the edges in E the synchronization protocol is executed before the buffers corre sponding to the communication edge are ac
152. ng constraints do not sacrifice performance if the execution times of actors are close to their estimates the OT schedule with O as the transaction order has iteration pe riod close to the minimum achievable period T Thus we make the best possible use of compile time information when we determine the transaction order O We also presented the complexities involved in modeling run time varia tions in execution times of actors even highly simplified stochastic models are dif ficult to analyze precisely We pointed out bounds that have been proposed in Petri net literature for the value of the expected iteration period and concluded that al though a lower bound is available for this quantity for rather general stochastic models using Jensen s inequality tight upper bounds are still not known except for the trivial upper bound using maximum execution times of actors MCM G pax 106 MINIMIZING SYNCHRONIZATION COSTS IN SELF TIMED SCHEDULES The previous three chapters dealt with the Ordered Transactions strategy which is a hardware approach to reducing IPC and synchronization costs in self timed schedules In this chapter we present algorithms that minimize synchroniza tion costs in the final implementation of a given self timed schedule and we do not assume the availability of any hardware support for employing the OT approach Recall that the self timed scheduling strategy introduces synchronization checks whenever proc
153. ng result however tells us that it is always possible to modify any given ful ly static schedule so that it performs nearly as well as its self timed counterpart Stated more precisely Claim 4 1 Given a fully static schedule o 0 0 0 Tp the average iteration period for the corresponding ST schedule as mentioned st let Tor be before Teg 2 Tor Suppose T gt T 7 then there exists a valid fully static schedule S that has the same processor assignment as S the same order of execu tion of actors on each processor but an iteration period of Tor That is 92 v o v 0 0 Tor where if actors v v are on the same processor i e v o then v gt 6 v gt 0 v gt 0 v Further more S is obtained by solving the following set of linear inequalities for 0 o v o 0 lt Ter xd v v t v foreach edge v v in Gipc Proof Let S have a period equal to T Then under the schedule S the kth start ing time of actor v is given by start v k v kT 4 4 Also data precedence constraints imply as in Eqn 4 2 start v k 2 start v k delay v v t v Vv v v Epe 4 5 Substituting Eqn 4 4 in Eqn 4 5 o v kT20 v k delay v v T t v Vp Epe That is o 0 0 STxd v v t0 YOy Eine 4 6 Note that the construction of G are automatically met if v v and v i
154. ng the OT strategy to models more powerful than SDF e g the Boolean dataflow model The strategy here is to assume we have only a small number of control constructs in the SDF graph and explore techniques for this case This allows us to expand the domain of applica bility of our techniques without having to deal with the complexity of tackling a general BDF model Finally we present our conclusions in Chapter 7 and point out future directions for this research In the following chapter Chapter 2 we lay down some of the notation and terminology we follow in the remainder of the thesis 32 TERMINOLOGY AND NOTATIONS In this chapter we introduce terminology and definitions used in the remainder of the thesis We also formalize the scheduling concepts that were pre sented intuitively in the previous chapter 2 1 HSDF graphs and associated graph theoretic nota tion We represent an HSDFG by an ordered pair V E where V is the set of vertices actors and E is the set of edges We refer to the source and sink vertices of a graph edge e by src e and snk e and we denote the delay or the num ber of initial tokens on e by delay e We say that e is an output edge of src e and that e is an input edge of snk e We will also use the notation p v Vj V for an edge directed from v to vje A path in V E is a finite non empty sequence e gt where each e is a member of E and snk e src
155. ng theorem is useful to formalize the concept of when the synchronization constraints represented by one synchroni zation graph G imply the synchronization constraints of another graph Go This theorem provides a useful constraint for synchronization optimization and it underlies the validity of the main techniques that we will present in this chapter 119 Theorem 5 1 The synchronization constraints in a synchronization graph 1 1 ROR l i G v Eint L Es imply the synchronization constraints of the synchroniza tion graph e v EmU E if the following condition holds Ve s t ec E eg E p sre snk lt delay that is if for each edge 2 1 net that is present in G but notin G there is a minimum delay path from src to snk in Ge that has total delay of at most delay Note that since the vertex sets for the two graphs are identical it is meaningful to 7 1 refer to src and snk as being vertices of G even though there are edges 2 1 estec E e E First we prove the following lemma Lemma 5 1 If there is a path p e z in G then start snk e k 2 end src e k Delay p Proof of Lemma 5 1 The following constraints hold along such a path p as per Eqn 4 1 start snk e k 2 end src e k delay e 5 3 Similarly start snk e k 2 end sre e k delay e Noting that svc
156. nitial transient and the au thors show how determining this pattern essentially by simulation leads to efficient loop parallelization In Zaky89 the authors propose a max algebra Cun79 based technique for determining the steady state pattern in the VLIW program In Chre83 the author studies periodic firing patterns of transitions in timed Petri nets The iterative algorithms for determining clock schedules in Shen92 have convergence properties similar to the transients in self timed sched ules their algorithm converges when an equivalent self timed schedule reaches a periodic regime Returning to the problem of determining the optimal transaction order one possible scheme is to derive the transaction order from the repeating pattern that the ST schedule settles into That is instead of using the transaction order of Fig 3 1 if we enforce the transaction order that repeats over two iterations in the evolution of the ST schedule of Fig 4 2 the OT schedule would mimic the ST schedule ex actly and we would obtain an OT schedule that performs as well as the ideal ST schedule and yet involves low IPC costs in practice However as pointed out above the number of iterations that the repeating pattern spans depends on the crit ical cycles of Gine and it can be exponential in the size of the HSDFG Bacc92 91 In addition the transient region before the schedule settles into a repeating pattern can also be ex
157. nsures a above Below we outline the mechanics of the two synchronization protocols defined so far BBS In this mechanism a write pointer wr e for eis maintained on the processor that executes src e a read pointer rd e for e is maintained on the processor that executes snk e and a copy of wr e is maintained in some shared memory location sv e The pointers rd e and wr e are initialized to zero and delay e respectively Just after each execution of src e the new data value produced onto e is written into the shared memory buffer for e at offset wr e 3 wr e is updated by the following operation wr e 4 wr e 1 mod By e and sv e is updated to contain the new value of wr e Just before each execution of snk e the value contained in sv e is repeatedly examined until it is found to be not equal to rd e then the data value residing at offset rd e of the shared memory buffer for e is read and rd e is updated by the operation rd e lt rd e 1 mod By e UBS This mechanism also uses the read write pointers rd e and wr e and these are initialized the same way however rather than maintaining a copy of wr e in the shared memory location sv e we maintain a count ini tialized to delay e of the number of unread tokens that currently reside in the buffer Just after src e executes sv e is repeatedly examined until its value is found to be less than By e then the new da
158. nt to be programmable by the end user The computation in embedded systems is specialized the computa 3 in both these respects the programmable core needs to be verified for correctness only once and design changes can be made late in the design cycle by modifying the software program Although verifying the embedded software to be run on a programmable part is so a hard problem in most situations changes late in the design cycle and indeed even after the system design is completed are much eas ier and cheaper to make in the case of software than in the case of hardware Special processors are available today that employ an architecture and an instruction set tailored towards signal processing Such software programmable integrated circuits are called Digital Signal Processors DSP chips or DSPs for short The special features that these processors employ are discussed by Lee in Lee88a However a single processor even DSPs often cannot deliver the performance requirement of some applications In these cases use of multiple pro cessors is an attractive solution where both the hardware and the software make use of the application specific nature of the task to be performed Over the past few years several companies have been offering boards con sisting of multiple DSP chips More recently semiconductor companies are offer ing chips that integrate multiple CPUs on a single die Texas Instruments the TMS320
159. nting synchronization for all redundant synchronization edges since the redundancies are not interdepen dent Thus an optimal removal of redundant synchronizations can be obtained by applying a straightforward algorithm that successively tests the synchronization edges for redundancy in some arbitrary sequence and since computing the weight of the shortest path in a weighted directed graph is a tractable problem we can expect such a solution to be practical 5 6 2 Removing redundant synchronizations Fig 5 5 presents an efficient algorithm based on the ideas presented in the previous subsection for optimal removal of redundant synchronization edges In this algorithm we first compute the path delay of a minimum delay path from x to y for each ordered pair of vertices x y here we assign a path delay of whenever there is no path from x to y This computation is equivalent to solving an instance of the well known all points shortest paths problem Corm92 Then we examine each synchronization edge e in some arbitrary sequence and determine whether or not there is a path from src e to snk e that does not contain e and that has a path delay that does not exceed delay e This check for redundancy is equivalent to the check that is performed by the if statement in 126 RemoveRedundantSynchs because if p is a path from src e to snk e that con tains more than one edge and that contains e then p must contain a cy
160. o 9 units Fig 4 6 Note that the processor assign ment and actor ordering in the schedule of Fig 4 6 is identical to that of the schedule in Fig 4 1 The values o v are 0 A 9 0 B 0 G 2 Proc a a ee ee E Proc 2 es a procs Fol amp 1a Ser 1 a w c ao k Tie k Js Proc4f gt ae b el Pl Sr Figure 4 6 Modified schedule 0 C 6 0 D 0 0 E 5 0 F 8 and 0 H 3 Claim 4 1 may not seem useful at first sight why not obtain a fully static schedule that has a period Tor to begin with thus eliminating the post process ing step suggested in Claim 4 1 Recall that an FS schedule is usually obtained us ing heuristic techniques that are either based on blocked non overlapped scheduling which use critical path based heuristics Sih91 or are based on overlapped sched uling techniques that employ list scheduling heuristics deGroot92 Lam88 None of these techniques guarantee that the generated FS schedule will have an iteration period within one unit of the period achieved if the same schedule were run in a self timed manner Thus for a schedule generated using any of these techniques we might be able to obtain a gain in performance essentially for free by performing the post processing step suggested in Claim 4 1 What we propose can therefore be added as an efficient post processing step in existing schedulers Of course an ex haustive search procedure li
161. o an SOP representation with minimal number of product terms Suppose the expres sion c C can be minimized into another SOP expression 162 c1 c c where p lt 1 We can then replace the segment ProcID c ProcID c3 ProcIDy c ProcID of the annotated bus access list with an equivalent segment of the form ProcIDy cy ProcID c3 ProcID c ProclD uy We can thus obtain a minimal set of contiguous appearances of a bus grant to the same processor Another optimization that can be performed is to combine annotated bus access lists with the switching mechanism of section 6 2 1 Suppose we have the following annotated bus access list b1 by ProclD b b3 ProcID b b4 bs ProclDy Then by factoring b out we can equivalently write the above list as by by ProclD b3 ProclD b4 bs ProcID Now we can skip over all the three accesses whenever the Boolean b is FALSE by loading the schedule counter and forcing it to increment its count by three instead of evaluating each access separately and skipping over each one individu ally This strategy reduces overhead because it costs an extra bus cycle to disable a bus access when a condition corresponding to that bus access evaluates to FALSE by skipping over three bus accesses that we know are going to be disabled we save three idle bus cycles There is an added cost of on
162. oadcast is not possible with a central shared memory The modified design however involves a significantly higher hardware cost than the design proposed earlier As a result the OMA prototype was built around the central shared memory design and not the FIFO based design In addi tion the DSP96002 processor has an on chip host interface unit that can be used as shared TA line shared addr decode B address data B TA DSPI6002 ATA A Add A Data A local addr decode Local Memory Figure 3 4 Details of the TA line mechanism only one processor is shown 51 a 2 deep FIFO therefore the potential advantage of using distributed FIFOs can still be evaluated to some degree by using the chip host interface even in the absence of external FIFO hardware Simulation models were written for both the above designs using the Thor hardware simulator Thor86 under the Frigg multi processor simulator system Bier89 Frigg allows the Thor simulator to communicate with a timing driven functional simulator for the DSP96002 processor provided by Motorola Inc The Motorola simulator also simulates Input Output I O operations of the pins of the processor and Frigg interfaces the signals on the pins to with the rest of the Thor simulation as a result hardware associated with each processor memories address decoding logic etc and interaction between proc
163. ocisccsesatosacansieansestousdeuenbingsaats 148 a Conditional if then else dataflow graph The branch outcome is determined at run time by actor B b Graph representing data dependent iteration The termination condition for the loop is deter mined by actor Dy aceite thence eentececanmelaueaes 149 Acyclic precedence graphs corresponding to the if then else graph of Fig 6 2 a corresponds to the TRUE assignment of the control token b to the FALSE assignment 0 ceecceeeeeeeeeeceeeeeeeeees 150 Quasi static schedule for a conditional construct adapted from PESOS S R E TE 152 Programs on three processors for the quasi static schedule of Fig OF EE EE ES 153 Transaction order corresponding to the TRUE and FALSE branches Bus access list that is stored in the schedule RAM for the quasi static schedule of Fig 6 6 Loading operation of the schedule counter con 1X Figure 6 8 Figure 6 9 Figure 6 10 Figure 6 11 Figure 6 12 Figure 7 1 ditioned on value of c is also SHOWD eeeeesseeeseceeeeeseeeeaeeeaeeees 157 Conditional constructs in parallel paths 0 0 0 0 eeeeeeseeeeeteeeeneees 158 A bus access mechanism that selectively masks bus grants based on values of control tokens that are evaluated at run time 159 Bus access lists and the annotated list corresponding to Fig 6 6 161 Quasi static schedule for the data dependent iteration graph of Fig DUD att ese ater saat atte t
164. ocks are fine to medium grain in size an atomic actor in the SDF graph may implement any thing from a filtering function to a two input addition operation The final program is then automatically generated by concatenating code corresponding to the blocks in the program according to the sequence prescribed by a schedule This approach is mature enough that there are commercial tools available today for example the SPW and COSSAP tools mentioned earlier that employ this technique Powerful optimization techniques have been developed for generating sequential programs from SDF graphs that optimize for metrics such as memory usage Bhat94 Scheduling is a fundamental operation that must be performed in order to implement SDF graphs on both uniprocessor as well as multiprocess rs Unipro cessor scheduling simply refers to determining the sequence of execution of actors such that all precedence constraints are met and all the buffers between actors cor responding to arcs return to their initial states We discuss the issues involved in multiprocessor scheduling next 1 2 Parallel scheduling We recall that in the execution of a dataflow graph actors fire when suffi cient number of tokens are present at their inputs The task of scheduling such a graph onto multiple processing units therefore involves assigning actors in the HSDFG to processors the processor assignment step ordering execution of these actors on each processor
165. oduced by Convert to SC graph is 2X n n 1 while the number of synchronization accesses S_ eliminated by Convert to SC graph by allowing the feedforward edges of the original synchronization graph to be implemented with BBS rather than UBS equals 2n f zation accesses satisfies It follows that the net change S S_ in the number of synchroni S S8_ 2 n n 5 1 2n 2 n 1 n S2 1 i 1 5 and thus S S_ lt 0 We have established the following result Theorem 5 3 Suppose that G is a synchronization graph and G is the graph that results from applying algorithm Convert to SC graph to G Then the synchroniza tion cost of G is less than or equal to the synchronization cost of G For example without the edges added by Convert to SC graph the dashed grey edges in Fig 5 9 there are 6 feedforward edges which require 24 synchro nization accesses per iteration period to implement The addition of the 4 dashed edges requires 8 synchronization accesses to implement these new edges but allows us to use UBS for the original feedforward edges which leads to a savings of 12 synchronization accesses for the original feedforward edges Thus the net effect achieved by Convert to SC graph in this example is a reduction of the total number of synchronization accesses by 12 8 4 As another example con sider Fig 5 10 which shows the synchronization graph topology after redundant synchronization edges are
166. oller to jump Proc 2 s1 to the FALSE branch Proc 1 r E the Proc 1 if c is TRUE proc 1 forces Proc 3 s2 Access list for the controller to bypass the Proc 1 r2 FALSE branch access list for the FALSE branch Access order for subgraph 1 lao DIA NBIW n CO Figure 6 7 Bus access list that is stored in the schedule RAM for the quasi static schedule of Fig 6 6 Loading operation of the schedule counter conditioned on value of c is also shown manner stated in the previous section and employ the OMA mechanism above we would need one bus access list for every combination of the Booleans bj by This is because each fj and g will have its own associated bus access list which then has to be combined with the bus access lists of all the other branches to yield one list For example if all Booleans b are true then all the f s are executed and we get one access list If b is TRUE and b through b are FALSE then g4 is exe cuted and fp through f are executed This corresponds to another bus access list This implies 2 bus access lists for each of the combination of fi and gi that exe cute i e for each possible execution path in the graph 6 2 3 Improved mechanism Although the idea of maintaining separate bus access lists is a simple mechanism for handling control constructs it can sometimes be impractical as in the example above We propose an alternative mechanism based on masking that handles para
167. ons because a particular Boolean token may be a don t care for that bus access For example the first three bus accesses in Fig 6 10 appear in both execution paths because they are independent of the value of c In the worst case a bus access that is independent of all Booleans will end up appearing in the bus access lists of all the Boolean combinations If these bus accesses appear contigu ously in the collapsed bus access sequence we can combine them into one For example c Proc2 Proc in the annotated schedule of Fig 6 10 can be combined into a single Proc 2 entry which is not conditioned on any control token Consider another example if we get contiguous entries D bz Proc3 and b b Proc3 in the collapsed list we can replace the two entries with a single entry b Proc3 More generally if the collapsed list contains a contiguous segment of the form c ProcID c ProcID c3 ProcID c ProcID we can write each of the contiguous segments as isa Cy Reg he PIOGLD p55 4 where the bus grant condition is an expression c C c which is a sum of products SOP function of the Booleans in the system We can now apply 2 level minimization to determine a minimal representation of each of these expres sions Such 2 level minimization can be done by using a logic minimization tool such as ESPRESSO Bray84 which simplifies a given SOP expression int
168. oposed in the literature for modeling different types of systems Some of these Petri net models have the same expressive power as Turing machines for example if transi tions are allowed to posses inhibit inputs if a place corresponding to such an input to a transition contains a token then that transition is not allowed to fire then a Petri net can simulate any Turing machine pp 201 in Peter81 Others depending on topological restrictions imposed on how places and transitions can be interconnected are equivalent to finite state machines and yet others are simi 8 lar to SDF graphs Some extended Petri net models allow a notion of time to model execution times of computations There is also a body of work on stochastic extensions of timed Petri nets that are useful for modeling uncertainties in compu tation times We will touch upon some of these Petri net models again in Chapter 4 Finally there are Petri nets that distinguish between different classes of tokens in the specification colored Petrinets so that tokens can have information associ ated with them We refer to Peter81 Mur89 for details on the extensive variety of Petri nets that have been proposed over the years The particular restricted dataflow model we are mainly concerned with in this thesis is the SDF Synchronous Data Flow model proposed by Lee and Messerschmitt Lee87 The SDF model poses restrictions on the firing of actors the number of tok
169. ores the access order list in the form of processor IDs We call this counter the schedule counter and the memory that stores the pro cessor IDs is called the schedule RAM Decoding the output of the RAM provides the required BG lines The counter is incremented at the beginning of a processor transaction by the negative going edge of the common BR signal and the output of the RAM is latched at the positive going edge of BR thus granting the bus to the next processor as soon as the current processor completes its shared memory trans action The counter is reset to zero after it reaches the end of the list i e the counter counts modulo the bus access list size This is shown in Fig 3 7 Incre menting the counter as soon as BR goes low ensures enough time for the counter outputs and the RAM outputs to stabilize For a 33MHz processor with zero wait states BR width is a minimum of 60 nanoseconds Thus the counter incrementing and the RAM access must both finish before this time Consequently we need a fast counter and fast static RAM for the schedule memory The width of the counter determines the maximum allowable size of the access list a counter width of size n implies a maximum list size of 2 a wider counter however implies a slower counter If for a certain width the counter implemented on the Xilinx part in our case turns out to be too slow i e the output of the schedule memory will not stabilize at least one latch set up per
170. ortion of TRUE values that a control token assumes during the execution of the schedule would be required for determining multiprocessor schedules for BDF graphs The OMA architecture applies the ordered transactions strategy to a shared bus multiprocessor If the interprocessor communication bandwidth requirements for an application are higher than what a single shared bus can support a more 167 yields a buffer sizeB ry A B 3 since 3 is the minimum path delay of a cycle that contains A B However if t A and B the execution times of actors A and B are guaranteed to be equal to the same constant then it is easily verified that a buffer size of 1 will suffice for A B Systematically applying execution time guarantees to derive lower buffer size bounds appears to be a prom ising direction for further work Another interesting problem is applying the synchronization minimization techniques to graphs that contain dynamic constructs Suppose we schedule a graph that contains dynamic constructs using a quasi static approach or a more general approach if one becomes available Is it still possible to employ the syn chronization optimization techniques we dis ssed in Chapter 5 The first step to take would be to obtain an IPC graph equivalent for the quasi static schedule that has a representation for the control constructs that a processor may execute as a part of the quasi static schedule If we can show that
171. our case these commands correspond to the debugging monitor commands imple mented in gdm as well as commands specific to the OMA Another useful feature of tcl is the scripting facility it provides sequences of commands can be conve 69 niently integrated into scripts which are in turn executed by issuing a single com mand Some functions specific to the OMA hardware that have been compiled into gdm are the following omaxiload fileName bit load OMA Xilinx with configuration specified by file bit omapboot fileName lod proc load bootstrap monitor code into the specified processor omapload fileName lod proc load DSP96002 lod file into the specified processor schedload accessOrder load OMA bus access schedule memory These functions use existing gdm functions for reading and writing values to the DSP56000 memory locations that are mapped to the OMA board host inter face The sequence of commands needed to download code onto the OMA board and run it is summarized in Fig 3 15 proc omaDoAll xload configure S 56X board Xilinx omareset reset OMA board omaxiload bootup bit configure OMA Xilinx for booting procs foreach i 0 1 2 3 omapboot i omaMon lod load bootstrap monitor routine omapload i icode lod load code Ocode lod 1code lod etc schedload access_order data load bus access schedule into schedule memory omaxiload momal bit reconfigure OMA Xilinx to implement Transaction Controller
172. pecified essentially as SDF but augmented with a limited number of control constructs and hence fall into the BDF model SDF has proven to be a useful model for representing a significant class of DSP algorithms several DSP tools have been designed based on the SDF and closely related models Examples of commercial tools based on SDF are the Signal Processing Worksystem SPW developed by Comdisco Systems now the Alta group of Cadence Design Sys tems Pow92 Barr91 and COSSAP developed by Cadis in collaboration with Meyr s group at Aachen University Ritz92 Tools developed at various universi ties that use SDF and related models include Ptolemy Pin95a the Warp compiler Prin92 DESCARTES Ritz92 GRAPE Lauw90 and the Graph Compiler Veig90 The SDF model is popular because it has certain analytical properties that are useful in practice we will discuss these properties and how they arise in the following section The property most relevant for this thesis is that it is possible to effectively exploit parallelism in an algorithm specified in SDF by scheduling computations in the SDF graph onto multiple processors at compile or design time 6 ties that facilitate formal reasoning about programs specified in these models and are useful in practise leading to simpler implementation of the specified computa tion in hardware or software One such restricted model and in fact one of the earliest graph based com putation mode
173. ple protocol guarantees correct sender receiver synchronization and even though the semaphore bits have multiple writ ers no atomic test and set operation is required of the hardware In summary the operations of the sender are request bus wait for arbitra tion busy wait until semaphore is in the correct state write the shared memory location if semaphore is in the correct state and then release the bus The corre sponding operations for the receiver are request bus wait for arbitration busy wait on semaphore read the shared memory location if semaphore is in the correct state and release the bus The IPC costs are therefore due to bus arbitration time and due to semaphore checks Such overhead consumes of the order of tens of instruction cycles if no special hardware support is employed for IPC In addition semaphore checks consume shared bus bandwidth An example of this is a four processor DSP56000 based shared bus system designed by Dolby labs for digital audio processing applications In this machine processors communicate through shared memory and a central bus arbiter resolves bus request conflicts between processors When a processor gets the bus it per 43 forms a semaphore check and continues with the shared memory transaction if the semaphore is in the correct state It explicitly releases the bus after completing the shared memory transaction A receive and a send together consume 30 instruction cycles even i
174. plemented using send and receive primitives that make use of the processor interconnect hardware These primitives then have an execution cost ciated with them that depends on the multiprocessor architecture and hardware being employed Fully static scheduling heuristics that take communication costs into account include Sark89 Sih91 Prin91 Computations in the HSDFG however are iterated essentially infinitely The blocked scheduling strategies discussed thus far ignore this fact and thus pay a penalty in the quality of the schedule they obtain Two techniques that enable blocked schedules to exploit inter iteration parallelism are unfolding and retim ing The unfolding strategy schedules J iterations of the HSDFG together where J is called the blocking factor Thus the schedule in Fig 1 1 b has J 1 Unfolding often leads to improved blocked schedules pp 78 100 Lee86 Parhi91 but it also implies a factor of J increase in program memory size and also in the size of the scheduling problem which makes unfolding somewhat impractical Retiming involves manipulating delays in the HSDFG to reduce the critical path in the graph This technique has been explored in the context of maximizing clock rates in synchronous digital circuits Lei83 and has been proposed for improving blocked schedules for HSDFGs cutset transformations in Lee86 and Hoang93 Fig 1 1 c illustrates an example of an overlapped schedule Such a
175. ponential Consequently the memory requirements for the controller that enforces the transaction order can be prohibitively large in certain cases in fact even for the example of Fig 4 2 the doubling of the controller memory that such a strategy entails may be unacceptable We therefore restrict ourselves to determining and enforcing a transaction order that spans only one iteration of the HSDFG in the following section we show that there is no sacrifice in imposing such a restriction and we discuss how such an optimal transaction order is obtained 4 5 Optimal order In this section we show how to determine an order O on the IPCs in the schedule such that imposing O yields an OT schedule that has iteration period within one unit of the ideal ST schedule T lt Tor Tor Thus imposing the order we determine results in essentially no loss in performance over an unre strained schedule and at the same time we get the benefit of cheaper IPC Our approach to determining the transaction order O is to modify a given fully static schedule so that the resulting FS schedule has T equal to T 7 and then to derive the transaction order from that modified schedule Intuitively it ap pears that for a given processor assignment and ordering of actors on processors the ST approach always performs better than the FS or OT approaches Tes gt Tor T r simply because it allows successive iterations to overlap The followi
176. ques presented in this thesi apply to DSP algorithms that involve sim ple control structure the precise domain of applicability of these techniques will be formally stated shortly Applications in signal processing and image processing require large com puting power and have real time performance requirements The computing engines in such applications tend to be embedded as opposed to general purpose Custom VLSI implementations are usually preferred in such high throughput applications However custom approaches have the well known problems of long design cycles the advances in high level VLSI synthesis notwithstanding and low flexibility in the final implementation Programmable solutions are attractive afford to give up some processing power to the inefficiencies in the automatic tools In addition new techniques are being researched to make the process of auto matically mapping a design onto multiple processors more efficient this th is also an attempt in that direction This situation is analogous to how logic design ers have embraced automatic logic synthesis tools in recent years logic synthe sis tools and VLSI technology have improved to the point that the chip area saved by manual design over automated design is not worth the extra design time involved one can afford to waste a few gates just as one can afford to waste processor cycles to compilation inefficiencies in a multiprocessor DSP Finally there
177. r i 1 2 m 1 e denotes the edge instantiated by Convert to SC graph from a vertex in C to a vertex in C The sink source edge instantiated by Convert to SC graph is denoted e Output Non negative integers d d d _ Such that the estimated through put when delay e d 0 lt i lt m 1 equals estimated throughput of G Xa ey gt _ gt set delays on each edge to be infinite Anam BellmanFord X compute the max cycle mean of G a L t x yua an upper bound on the delay required for any xe V e I For i 0 1 m 1 6 MinDelay X e MCM d p X X e 7 fix the delay on e to be 6 End For Return m 1 Function MinDelay X e A B Input A synchronization graph X an edge e in X a positive real number and a positive integer B Output Assuming X e gt B has estimated throughput no less than a deter mine the minimum de 0 1 B such that the estimated throughput of X e gt d is no less than jos Perform a binary search in the range 0 1 B to find the minimum value of re 0 1 B such that BellmanFord X e gt r returns a value less than or equal to Return this minimum value of r Figure 5 12 An algorithm for determining the delays on the edges introduced by algorithm Convert to SC graph 139 and more than one sink SCC For example if a4 a a denote edges that were instantia
178. r technique for bounding buffer sizes this technique achieves bounded buffer sizes by transform ing the graph into a strongly connected graph by adding a minimal number of addi tional synchronization edges Thus in our final algorithm we will not in fact find it necessary to use or compute these bounds By 5 4 Synchronization model 5 4 1 Synchronization protocols We define two basic synchronization protocols for a communication edge based on whether or not the length of the corresponding buffer is guaranteed to be bounded from the analysis presented in the previous section Given an IPC graph G and a communication edge e in G if the length of the corresponding buffer is not bounded that is if e is a feedforward edge of G then we apply a syn chronization protocol called unbounded buffer synchronization UBS which guarantees that a an invocation of snk e never attempts to read data from an empty buffer and b an invocation of src e never attempts to write data into the buffer unless the number of tokens in the buffer is less than some pre specified limit By e which is the amount of memory allocated to the buffer as discussed 116 in the previous section On the other hand if the topology of the IPC graph guarantees that the buffer length for e is bounded by some value By e the self timed buffer bound of e then we use a simpler protocol called bounded buffer synchronization BBS that only explicitly e
179. raphs formally defined in Bhat95b a prov ably optimum resynchronization can be obtained using a procedure similar to pipelining This procedure however leads to an implementation that in general has a larger latency than the implementation we start out with The resynchroniza tion procedure as outlined in Bhat95a in general can lead to implementations with increased latency Latency is measured as the time delay between when an input data sample is available and when the corresponding output is generated In Bhat95b we show how we can modify the resynchronization procedure to trade off synchronization cost with latency An optimal latency constrained synchroniza tion however is again shown to be NP hard The work on resynchronization is very much ongoing research a brief out line of which we have presented in this section 5 10 Summary We have addressed the problem of minimizing synchronization overhead in self timed multiprocessor implementations The metric we use to measure syn chronization cost is the number of accesses made to shared memory for the pur pose of synchronization per schedule period We used the IPC graph framework introduced in the previous chapter to extend an existing technique detection of redundant synchronization edges for noniterative programs to the iterative case We presented a method for the conversion of the synchronization graph into a strongly connected graph which again results in reduce
180. rated nc tent E sheer tat tment 164 A possible access order list corresponding to the quasi static sched lS SO Fisso Mheen et an staseraaameteaccasceaumctial ENEE 165 An example of how execution time guarantees can be used to reduce buiffertsize DOUN dS sitssssiiet ekg hsb R A e eT IETA 168 ACKNOWLEDGEMENTS I have always considered it a privilege to have had the opportunity of pur suing my Ph D at Berkeley The time I have spent here has been very fruitful and I have found the interaction with the exceptionally distinguished faculty and the smart set of colleagues extremely enriching Although I will not be able to acknowledge all the people who have directly or indirectly helped me during the course of my Ph D I wish to mention some of the people who have influenced me most during my years as a graduate student First and foremost I wish to thank Professor Edward Lee my research advisor for his valuable support and guidance and for having been a constant source of inspiration for this work I really admire Professor Lee s dedication to his research I have learned a lot from his approach of conducting research I also thank Professors Pravin Varaiya and Henry Helson for serving on my thesis committee I thank Professor David Messerschmitt for his advice I have learned from him both in the classroom as well as through his insightful and humorous when I was at Bell Labs stories at our Friday afternoon post semi nar get
181. rced at run time We refer to this sequence of bus accesses by the term bus access order list Since the bus access order list is program dependent the controller must possess memory into which this list is downloaded after the scheduling and code generation steps are completed and when the transaction order that needs to be enforced is determined The controller must step through the access order list and must loop back to the first processor ID in the list when it reaches the end In addition the controller must be designed to effectively use bus arbitration logic present on chip to conserve hardware 3 4 2 1 Processor bus arbitration signals We use the bus grant BG signal on the DSP chip to allow the processor to perform a shared bus access and we use the bus request BR signal to tell the con troller when a processor completes its shared bus access Each of the two ports on the DSP96002 has its own set of arbitration sig nals the BG and BR signals are most relevant for our purposes and these signals 55 are relevant only for the processor port connected to the shared bus As the name suggests the BG line which is an input to the processor must be asserted before a processor can begin a bus cycle the processor is forced to wait for BG to be asserted before it can proceed with the instruction that requires access to the bus Whenever an external bus cycle needs to be performed a processor asserts its BR signal and this s
182. re able to use this idea to determine a upper bound for E T however this bound also proved to be too loose in general hence we omit the details of this construction here 4 6 3 Implications for the OT schedule Intuitively an OT schedule is more sensitive to variations in execution times even though the computations performed using the OT schedule are robust with respect to execution time variations the transaction order ensures correct sender receiver synchronization the ordering restriction makes the iteration period more dependent on execution time variations than the ideal ST schedule This is ap parent from our IPC graph model the transaction ordering constraints add addition al edges Eor to Gipc The IPC graph with transaction ordering constraints represented as dashed arrows is shown in Fig 4 9 we use the transaction order O Si Fis S3 Fas Sos Fas Sp Fy Seo Fo S5 rs determined in section 4 5 and again communication times are not included The graph for T c is now dif or t ferent and is plotted in Fig 4 8 Note that the T t curve for the OT schedule solid is above the corresponding curve for the unconstrained ST schedule dashed this shows precisely what we mean by an OT schedule being more sensi tive to variations in execution times of actors The optimal transaction order O we determined ensures that the transaction constraints do not sacrifice throughput ensures Tor Tor when actor
183. reducing synchronization overhead is removal of redundant synchronization edges from the synchronization graph i e finding a minimal set of edges E that need explicit synchronization Formally a synchronization edge is redundant in a synchronization graph G if its removal yields a synchronization graph that preserves G Equivalently from definition 5 1 a synchronization edge e is redundant in the synchronization graph G if there is a path p e in G directed from src e to snk e such that Delay p lt delay e Thus the synchronization function associated with a redundant synchroni zation edge comes for free as a by product of other synchronizations Fig 5 4 shows an example of a redundant synchronization edge Here before executing actor D the processor that executes A B C D does not need to synchronize with the processor that executes E F G H because due to the synchroniza tion edge x the corresponding invocation of F is guaranteed to complete before each invocation of D is begun Thus X is redundant in Fig 5 4 and can be removed from E into the set E It is easily verified that the path synch edges internal edges Figure 5 4 X2 is an example of a redundant synchronization edge 124 p F G G H x B C C D is directed from src x to snk x and has a path delay zero that is equal to the delay on x In this section we develop an efficient algorithm to optimally remove
184. redundant syn chronization edges A3 B Ay Bi B and B E Thus RemoveRedundantSynchs reduces the number of synchronizations from 8 down to 3 a reduction of 62 Fig 5 7 shows the synchronization graph of Fig 5 6 d after all redundant synchronization edges are removed It is easily verified that the synchronization edges that remain in this graph are not redundant explicit syn chronizations need only be implemented for these edges 5 7 Making the synchronization graph strongly con nected In section 5 4 1 we defined two different synchronization protocols bounded buffer synchronization BBS which has a cost of 2 synchronization accesses per iteration period and can be used whenever the associated edge is con tained in a strongly connected component of the synchronization graph and unbounded buffer synchronization UBS which has a cost of 4 synchronization accesses per iteration period We pay the additional overhead of UBS whenever 131 the associated edge is a feedforward edge of the synchronization graph One alternative to implementing UBS for a feedforward edge e is to add synchronization edges to the synchronization graph so that e becomes encapsu lated in a strongly connected component such a transformation would allow e to be implemented with BBS However extra synchronization accesses will be required to implement the new synchronization edges that are inserted In this sec tion we s
185. removed that results from a four processor schedule of a synthesizer for plucked string musical instruments in seven voices based on the Karplus Strong technique This algorithm was also discussed in Chapter 3 as an 135 example application that was implemented on the ordered memory access archi tecture prototype This graph contains n 6 synchronization edges the dashed edges all of which are feedforward edges so the synchronization cost is 4n 24 synchronization access per iteration period Since the graph has one source SCC and one sink SCC only one edge is added by Convert to SC graph and adding this edge reduces the synchronization cost to 2n 2 14 a42 savings Fig 5 11 shows the topology of a possible solution computed by Convert to SC graph on this example Here the dashed edges represent the synchroniza tion edges in the synchronization graph returned by Convert to SC graph synch edges internal edges Figure 5 10 The synchronization graph after redundant synchronization edges are removed induced by a four processor schedule of a music synthesizer based on the Karplus Strong algorithm 136 5 7 2 Insertion of delays One issue remains to be addressed in the conversion of a synchronization graph G into a strongly connected graph G the proper insertion of delays so that G is not deadlocked and does not have lower estimated throughput than G The potential for deadlock and reduc
186. rithm An interesting architecture in this context is the GF11 machine built at IBM Beet85 The GF11 is an SIMD machine in which proces sors are interconnected using a Benes network which allows the GF11 to support a variety of different interprocessor communication topologies rather than a fixed topology e g a mesh or a ring that most systolic arrays support Benes networks are non blocking i e they can provide connections from all the network inputs to the network outputs simultaneously according to any specified permutation These networks achieve the functional capability of a full crossbar switch with much simpler hardware The drawback however is that in a Benes network computing switch settings needed to achieve a particular permuta tion involves a somewhat complex algorithm Leigh92 In the GF11 this problem is solved by precomputing the switch settings based on the program to be executed on the array and reconfiguring the Benes network at runtime based on these prede termined switch settings The OT strategy is along similar lines as the static network configuration strategy employed in the GF11 The GF11 strategy however applies to a FS scheduling scenario the communication network and the processors run synchro nously and the compiler must determine what the network and the processors do at each clock cycle As we discussed earlier this restriction narrows down the appli 30 In Chapter 6 we discuss ideas for extendi
187. rocessor onto the cable connected to the OMA see Fig 3 9 In addi tion the serial I O port the SSI port is also brought out for interface with I O devices such as A D and D A convertors By making the DSP56000 write to appropriate memory locations the 5 bits of address and 16 bits of data going into the OMA may be set and strobed for a read or a write to or from the OMA board In other words the OMA board occupies certain locations in the DSP56000 mem ory map host communication is done by reading and writing to these memory locations 3 4 4 Processing element Each processing element PE consists of a DSP96002 processor local memory address buffers local address decoder and some address decoding logic The circuitry of each processing element is very similar to the design of the Motor ola 96000 ADS Application Development System board Moto90 The local address control and data busses are brought out into a 96 pin euro connector fol lowing the format of the 96ADS This connector can be used for local memory expansion we have used it for providing local I O interface to the processing ele ment as an alternative to using the shared bus for I O Port A of the processor forms the local bus connecting to local memory and address decoding PAL Each PE also contains address buffers and logic to set up the bootup mode upon reset and powerup Port B of the processor is connected to the shared bus 61 Local bus connector 9
188. rocessors that the ST schedule imposes is due to data dependencies each processor executes actors assigned to it including the communication actors according to the compile time determined order An actor at the head of this ordered list is executed as soon as data is available for it Any other schedule that maintains the same processor as signment and actor ordering and respects data precedences in G cannot result in an execution where actors fire earlier than they do in the idealized ST schedule In particular the overlap of successive iterations of the HSDFG in the idealized ST schedule ensures that T lt sr T pg in general The ST schedule allows reordering among IPCs at run time In fact we ob serve from Fig 4 2 that once the ST schedule settles into a periodic pattern IPCs in 80 Proga k oE a D aik a M E Proc2 B M Fi Le N F o i F Pe o F Proc3 G 4 Loch fa tet DORAE ek fc Proca o E Proc 5 Figure 4 2 Self timed schedule successive iterations are ordered differently in the first iteration the order in which IPCs occur is indeed s4 7 S2 Fas 53 F3 S4 F4 55 gt Fs Se To Once the schedule settles into a periodic pattern the order alternates between 53 F3 Sy gt Fis So Fa Sy Fy 56 gt Fo S5 rs and Si Fi S3 F3 S4 F S2 Fo S5 Fs So Fo In contrast if we impose the transaction order in Fig 3 1 that enforces the order S1 Fi 855 Fa 835 13 S4 Ty
189. rom the S 56X host This configuration allows the host to read and write from any of the HI ports of the processors and also to access the schedule memory and the shared memory on board Run time configuration on the Xilinx consists of the transaction controller implemented as a presettable counter The counter can be preset through the shared bus It addresses an external fast RAM 8 nanosecond access time that contains processor IDs corresponding to the bus access schedule Output from the schedule memory is externally latched and decoded to yield bus grant BG lines Fig 3 7 63 A schematic of the Xilinx configuration at run time is given in Fig 3 11 This configuration is for I O with an S 56X 16 bit data host although it can eas ily be modified to work with a 32 bit host Transaction controller I O interface Shared Address Bus I TA Shared Data Bus 1 s o H 1 Rxreaqd Txwr I aN AN 1 I RxEmg TxFull A Tx Reg 32 b 1 l Rx v reg prese 16 bit abl l es decode it presettable rea counter i by hody Y 16 1 ie I Xilinx i i 2 Address i 5 16 1 schedule RAM f Host Address Host Data Bus 1 4 I 1 BG latch I 4 1 1 BGO 3 1 1 1 Li Figure 3 11 Xilinx configuration at run time 3 4 5 1 I O interface The S 56X board reads data from the Transmit Tx register and writes into the receive Rx register on the Xilinx These registers are memory mapped to the shared bus such that an
190. rs Clusters of processors may be connected onto a common bus or may form a linear array with neighbor to neighbor communication This allows the multiprocessor to be reconfigured depending on the communication requirement of the particular application being mapped onto it Scheduling and code generation for this machine is done by the McDAS compiler Hoang93 The DSP3 multiprocessor Shive92 was built at AT amp T and is comprised of DSP32C processors connected in a mesh configuration The mesh interconnect is implemented using custom VLSI components for data routing Each processor communicates with four of its adjacent neighbors through this router which con s of input and output queues and a crossbar that is configurable under program control Data packets contain headers that indicate the ID of the destination proces sor The Ring Array Processor RAP system Morg92 uses TI DSP320C30 processors connected in a ring topology This system is designed specifically for 26 image processing applications that this part has been designed for the order in which each processor executes instructions assigned to it can potentially be fixed at compile time without loss in parallelism Systolic arrays normally employ a fully static strategy As we discussed before RIA specifications and the primarily SIMD approach used in systolic array mapping techniques restrict their domain of applicability The approach taken by these techniques is
191. s and how do we show that the limit is in fact the same for all actors assuming that the HSDFG is strongly connected These ques tions are fairly non trivial because the random process start v k may not even be stationary These questions are answered rigorously in Bacc92 where it is shown that start v K Pe E T Vv e V K K Thus the limit T is in fact a constant almost surely The problem with such exact analysis however is the very large state space that can result We found that for an IPC Graph similar to Fig 4 4 with certain choices of execution times and assuming that only t is random takes two differ ent values based on a weighted coin flip we could get several thousand states for the Markov chain A graph with more vertices leads to an even larger state space The upper bound on the size of the state space is exponential in the number of ver tices exponential in V Solving the stationary distribution for such Markov chains would require solving a set of linear equations equal in number to the num ber of states and is highly compute intensive Thus we conclude that this approach has limited use in determining effects of varying execution times even for unreal istically simple stochastic models computation of exact solutions is prohibitive If we assume that all actors have exponentially distributed execution times then the system can be analyzed using continuous time Markov chains Moll82
192. s an HSDFG and its execution models the execution of the corresponding self timed schedule The following def initions are useful to formally state the constraints represented by the IPC graph Time is modelled as an integer that can be viewed as a multiple of a base clock Recall that the function start v k Zz represents the time at which the kth execution of actor v starts in the self timed schedule The function end v k z represents the time at which the kth execution of the actor v ends and v produces data tokens at its output edges and we set start v k 0 and end v k 0 for k lt 0 as the initial conditions The start v 0 values are specified by the schedule start v 0 v Recall from Definition 2 2 as per the semantics of an HSDFG each edge v p v of Gipc represents the following data dependence constraint start v k 2 end vy k delay vy v y v v Epo Vk delay v v 4 1 The constraints in Eqn 4 1 are due both to communication edges representing synchronization between processors and to edges that represent sequential execu tion of actors assigned to the same processor Also to model execution times of actors we associate execution time f v with each vertex of the IPC graph t v assigns a positive integer execution time to each actor v which can be interpreted as t v cycles of a base clock Inter pro cessor communication costs can be represent
193. s and due to probabilistic memory access speedup due to cache hits In signal processing and other real time applications however there is a stringent requirement for deterministic performance guarantee as opposed to probabilistic speedup In fact the unpredictability in task execution times intro duced due to the use of caches may be a disadvantage for static scheduling tech niques that utilize compile time estimates of task execution times to make scheduling decisions we recall the discussion in section 1 2 3 on techniques for estimating task execution times In addition due to the deterministic nature of most signal processing problems and also many scientific computation problems shared data can be deterministically prefetched because information about when particular blocks of data are required by a particular processor can often be pre dicted by a compiler This feature has been studied in Mouss92 where the authors propose memory allocation schemes that exploit predictability in the mem ory access pattern in DSP algorithms such a smart allocation scheme alleviates some of the memory bandwidth problems associated with high throughput applica tions 45 Processors with caches can cache semaphores locally so that busy waiting can be done local to the processor without having to access the shared bus hence saving the bus bandwidth normally expended on semaphore checks Such a proce dure however requires special hardware
194. s must be identified in a given dataflow graph either manually or automatically before Ha s techniques can be applied These techniques make the simplifying assumption that the control tokens for different dynamic actors are independent of one another and 150 that each control stream consists of tokens that take TRUE or FALSE values ran domly and are independent and identically distributed i i d according to statistics known at compile time Such a quasi static scheduling approach clearly does not handle a general BDF graph although it is a good starting point for doing so Ha s quasi static approach constructs a blocked schedule for one iteration of the dataflow graph The dynamic constructs are scheduled in a hierarchical fash ion each dynamic construct is scheduled on a certain number of processors and is then converted into a single node in the graph and is assigned a certain execution profile A profile of a dynamic construct consists of the number of processors assigned to it and the schedule of that construct on the assigned processors the profile essentially defines the shape that a dynamic actor takes in the processor time plane When scheduling the remainder of the graph the dynamic construct is treated as an atomic block and its profile is used to determine how to schedule the remaining actors around it the profile helps tiling actors in the processor time plane with the objective of minimizing the overall schedule len
195. s to be executed immediately ensures that processor assignment constraints after v then there is an edge v v in Gipc The relations in Eqn 4 6 represent a system of Excl inequalities in V unknowns the quantities 0 v The system of inequalities in Eqn 4 6 is a difference constraint problem that can be solved in polynomial time O lE ipel V using the Bellman Ford shortest path algorithm Law76 Corm92 The details of this approach are well described in Corm92 the essence of it is to construct a constraint graph that has one vertex for each unknown 0 v Each difference constraint is then represented by an 93 edge between the vertices corresponding to the unknowns and the weight on that edge is set to be equal to the RHS of the difference constraint A dummy vertex is added to the constraint graph and zero weight edges are added from the dummy vertex to each of the remaining vertices in the constraint graph Then setting the value of o v to be the weight of the shortest path from the dummy vertex to the vertex that corresponds to 6 v in the constraint graph results in a solution to the system of inequalities if indeed a solution exists A feasible solution exists if and only if the constraint graph does not contain a negative weight cycle Corm92 which is equivalent to the following condition yi o apo ee c and from Eqn 4 3 this is equivalent to gt cycle C in Ge
196. s to processors according to this list granting access to the next processor in the list when the current bus owner releases the bus Such a mechanism is robust with respect to variations in execu tion times of the actors the functionality of the system is unaffected by poor esti 46 mates of these execution times although the real time performance obviously suffers If we are able to perform accurate compile time analysis then each proces sor would obtain access to the shared bus whenever it needed it No arbitration needs to be done since there is no contention for the bus In addition no semaphore synchronization needs to be performed because the transaction ordering con straints respect data precedences in the algorithm when a processor accesses a shared memory location and is correspondingly allowed access to it the data accessed by that processor is certain to be valid As a result in the ideal scenario a shared bus access takes no more than a single read or write cycle on the processor and the overall cost of communicating one data sample is two or three instruction cycles The performance of this scheme depends on how accurately the execution times of the actors are known at compile time If these compile time estimates are reasonably accurate then an access order can be obtained such that a processor gains access to shared memory whenever it needs Otherwise a processor may have to idle until it gets a bus grant or
197. schedule evolves in a self timed manner in Fig 4 2 it eventually settles into a periodic repeating pattern that spans two iter ations of the dataflow graph and the average iteration period Top is 9 We would ST like to determine these properties of self timed schedules analytically 4 1 Inter processor Communication graph Ge In a self timed strategy a schedule S specifies the actors assigned to each processor including the IPC actors send and receive and specifies the order in which these actors must be executed At run time each processor executes the actors assigned to it in the prescribed order When a processor executes a send it writes into a certain buffer of finite size and when it executes a receive it reads from a corresponding buffer and it checks for buffer overflow on a send and buffer un derflow on a receive before it performs communication operations it blocks or suspends execution when it detects one of these conditions We model a self timed schedule using an HSDFG Gpe V Ej de rived from the original SDF graph G V E and the given self timed schedule The graph G pc which we will refer to as the inter processor communication modelling graph or IPC graph for short models the fact that actors of G assigned to the same processor execute sequentially and it models constraints due to inter 82 processor communication For example the self timed schedule in Fig 4 1 b can be modelled
198. sie Synthe SAS i putea aos an dae E aud ich E ER 73 3 7 2 QME filter banks i seen cacssssdense stan ves cats sess teveeues i 75 3 7 3 1024 point complex FFT ccscsssssessccsenecceennecsenseees 76 is As Ramm ETB 01 D AET E a EEE EEE EA 78 4 AN ANALYSIS OF THE OT STRATEGY eseeseesessosseseesossesossossesoosoeseseoss 79 4 1 Inter processor Communication graph Gipo sssseeeeeeeeeeeeeeien 82 AD Exec ti n time ESUMALES 5 iec 9 20c u58 ele ceb iii inea 88 4 3 Ordering constraints viewed as edges added to Gipo n 89 AA Periodicity sinian nn noahas cedecdguca stosadsvaayeactasoteaa a deas 90 43 ETERNAL OL SE osni Sault sts A E a hehe A cay ev oe 92 4 6 Effects of changes in execution tiMeS see eeeeeeseeesteceteeeeeeeenees 96 4 6 1 Deterministic CASE o2esccecl as eectnesiav diodes wcasteatueaedietaanises 97 4 6 2 Modeling run time variations in execution times 99 4 6 3 Implications for the OT schedule 0 0 0 ee eee eeeeeeeeeeeee 104 Ae SULTAN ALY A ont a E E EE a 106 5 MINIMIZING SYNCHRONIZATION COSTS IN SELF TIMED SCHEDULES seccascitscasihassatesuatsctsndcandsscictcastbcssadadcetaceslegsaaeas daseeedeasesdesavetoudses 107 iv Sal Related workexcc eee eS ee ES BE ES 108 5 2 Analysis of self timed Cxecution ecceeececseeeeceeeeeceeeeeeeteeeeeaeees 112 5 2 1 Estimated throughput csscsscsssaccsvascessasetsoraacesstcceessecssees 114 5 3 Strongly connected components and buffer size bounds
199. sks are assigned priorities and among the tasks that are ready to run at any instant the task with the highest priority is executed first Various researchers have proposed different priority mechanisms for list scheduling Adam74 some of which use critical path based CPM methods Ram72 Koh75 Blaz87 Blaz87 summarizes a large number of CPM based heuristics for scheduling 17 wonder whether overlapped schedules are fundamentally superior to blocked schedules with the unfolding and retiming operations allowed This question is set tled in the affirmative by Parhi and Messerschmitt Parhi91 the authors provide an example of an HSDFG for which no blocked schedule can be found even allowing unfolding and retiming that has a lower or equal iteration period than the overlapped schedule they propose Optimal resource constrained overlapped scheduling is of course NP Hard although a periodic overlapped schedule in the absence of processor constraints can be computed efficiently and optimally Parhi91 Gasp92 Overlapped scheduling heuristics have not been as extensively studied as blocked schedules The main work in this area is by Lam Lam88 and deGroot deGroot92 who propose a modified list scheduling heuristic that explicitly con structs an overlapped schedule Another work related to overlapped scheduling is the cyclo static scheduling approach proposed by Schwartz This approach attempts to optimally tile the processor tim
200. son Computers and Intractability A Guide to the Theory of NP Completeness W H Freeman and Company New York 1979 Gasp92 F Gasperoni and Uwe Schweigelshohn Scheduling Loops on Parallel Processors A Simple Algorithm with Close to Optimum Performance International Conference on Vector amp Parallel Processors Sept 1992 175 Gau9 1 J Gaudiot and L Bic Advanced Topics in Data flow Computing Prentice Hall Englewood Cliffs New Jersey 1991 Gov94 R Govindarajan G R Gao and P Desai Minimizing Memory Require ments in Rate Optimal Schedules Proceedings of the International Con ference on Application Specific Array Processors San Francisco August 1994 Grin88 C M Grinstead Cycle Lengths in Akb SIAM Journal on Matrix Analy sis October 1988 Ha92 S Ha Compile Time Scheduling of Dataflow Program Graphs with Dynamic Constructs Ph D Thesis Memorandum No UCB ERL M92 43 April 1992 Hal91 N Halbwachs P Caspi P Raymond and D Pilaud The Synchronous Data Flow Programming Language LUSTRE Proceedings of the IEEE September 1991 Hal93 N Halbwachs Synchronous Programming of Reactive Systems Kluwer Academic Publishers 1993 Hav91 B R Haverkort Approximate Performability Analysis Using Generalized Stochastic Petri Nets Proceedings of the Fourth International Workshop on Petri Nets and Performance Models Melbourne Australia pp 176 176
201. sor attempts to perform a shared memory access when it executes a communication actor either send or receive If its BG line is asserted it performs the access otherwise it stalls and waits for the assertion After a processor obtains access to the shared bus it performs the shared memory operation and releases the bus The transaction controller detects the release of the bus and steps through its ordered list granting the bus to the next processor in its list The cost of transfer of one word of data between processors is 3 instruction cycles in the ideal case two of these correspond to a shared memory write by the sender and a shared memory read by the receiver and an extra instruction cycle is expended in bus release by the sender and bus acquisition by the receiver Thus for the example of Chapter 1 less than 1 of the available 380 instructions per sample are required per transaction This is of course in the ideal scenario where the sender and the receiver obtain access to the shared bus upon request Such low overhead interprocessor communication is obtained with the 48 Schedule Information y Transaction Controller Local Memory Local Memory DSP96002 f gt f DSP96002 Bus rN Grant rN ines Bus Release Figure 3 2 Block diagram of the OMA prototype transaction controller providing the only additional hardware support As described in a subsequent section this
202. sors Programmable digital signal processing chips tend to have spe cial instructions such as a single cycle multiply accumulate for filtering func tions modulo addressing for managing delay lines bit reversed addressing for FFT computation DSP chips also contain built in parallel functional units that are controlled from fields in the instruction such as parallel moves from memory to registers combined with an ALU operation It is difficult for automatic compilers to optimally exploit these features executable code generated by commercially available compilers today utilizes one and a half to two times the program memory that a corresponding hand optimized program requires and results in two to three times higher execution time compared to hand optimized code Zivo95 There has been some recent work on compilation techniques for embedded software target 12 ted towards DSP processors and microcontrollers Liao95 it is still too early to determine the impact of these techniques on automatic compilation for large scale DSP control applications however Block diagram languages based on models such as SDF have proven to be a bridge between automatic compilation and hand coding approaches a library of reusable blocks in a particular programming language is hand coded this library then constitutes the set of atomic SDF actors Since the library blocks are reusable one can afford to carefully optimize and fine tune them The atomic bl
203. sors relative to one another in order to ensure correct sender receiver syn chronization Multiprocessor designs such as the Warp array Ann87 Lam88 and the 2 D MIMD Multiple Instruction Multiple Data array of Ziss87 that could potentially use fully static scheduling still choose to implement such run time flow control at the expense of additional hardware for the resulting software sim plicity Lam presents an interesting discussion on the trade off involved between hardware complexity and ease of compilation that ensues when we consider dynamic flow control implemented in hardware versus static flow control enforced by a compiler pp 50 68 of Lam89 1 2 3 Execution time estimates and static schedules We assume we have reasonably good estimates of actor execution times available to us at compile time to enable us to exploit static scheduling techniques however these estimates need not be exact and execution times of actors may even be data dependent Thus we allow actors that have different execution times from one iteration of the HSDFG to the next as long as these variations are small or rare This is typically the case when estimates are available for the task execu tion times and actual execution times are close to the corresponding estimates with high probability but deviations from the estimates of effectively arbitrary 21 ing in fact estimation turns out to be the most time consuming step in the GRAPE d
204. sseeeeseresseeessresseesse 157 6 2 4 Generating the annotated bus access list eee 161 6 3 Data dependent iteration sesssessessseeeseeeesseessesseesserssseeesseesseese 164 6 4 SUMAN ALY oorsee iite ian rE S EE A A E E R E 165 7 CONCLUSIONS AND FUTURE DIRECTIONG ccccssssssssssssessssseees 8 REFERENCES 00oooooooooo000000000000000000000000000000000000000000000000000000000000000000000000000000000 vi Figure 1 1 Figure 1 2 Figure 1 3 Figure 3 1 Figure 3 2 Figure 3 3 Figure 3 4 Figure 3 5 Figure 3 6 Figure 3 7 Figure 3 8 Figure 3 9 Figure 3 10 Figure 3 11 Figure 3 12 Figure 3 13 Figure 3 14 Figure 3 15 Figure 3 16 Figure 3 17 Figure 3 18 List of Figures Pully static schedule 4 ci020 ceo 2 eeu Cee Sea eel Ce 16 Fully static schedule on five processors ssseseseeeeeseeeseeeseeeeees 17 Steps in a self timed scheduling strategy 0 eeeceeeseceesseeeenteeeeees 20 One possible transaction order derived from the fully static schedule EAE E Gases otto S aarti as cater tania eee asl ek oh 41 Block diagram of the OMA prototype ceccceeescceesseeeesteeeeneeeees 49 Modified De SPOT ecinic a aeataeses 50 Details of the TA line mechanism only one processor is shown Ree tre TEN E E eet try rr Poe RTT rere SCT ree reer S 51 Top level schematic of the OMA prototype ccccceseeesteeeeteeees 54 Using processor bus arbitration signals for contro
205. static scheduling techniques viable Eliminating dynamic scheduling results in much simpler hardware without an undue performance penalty Another example of an application specific dataflow architecture is the NEC uPD7281 Chase84 which is a single chip processor geared towards image processing Each chip contains one functional unit multiple such chips can be connected together to execute programs in a pipelined fashion The actors are stat ically signed to a given processor are igned to each processor and actors 24 scheduled on it dynamically The primitives that this chip supports convolution bit manipulations accumulation etc are specifically designed for image process ing applications 1 3 2 Systolic and wavefront arrays Systolic arrays consist of processors that are locally connected and may be arranged in different topologies mesh ring torus etc The term systolic arises because all processors in such a machine run in lock step alternating between a computation step and a communication step The model followed is usually SIMD Single Instruction Multiple Data Systolic arrays can execute a certain class of problems that can be specified as Regular Iterative Algorithms RIA Rao85 systematic techniques exist for mapping an algorithm specified in a RIA form onto dedicated processor arrays in an optimal fashion Optimality includes metrics such as processor and communication link utilization scal
206. straints for all the edges that were directed from v to vje Thus a general HSDF graph may be prepro cessed into a form where the source and sink vertices uniquely identify an edge in the graph and we may represent an edge ee E by the ordered pair src e snk e The multiple edges are taken into account individually when buffers are assigned to the arcs in the graph As we discussed in section 1 2 1 in some cases i s advantageous to unfold a graph by a certain unfolding factor say u and schedule u iterations of the graph together in order to exploit inter iteration parallelism more effectively The unfold ed graph contains u copies of each actor of the original graph In this case and 6 are defined for all the vertices of the unfolded graph i e and o are defined for u invocations of each actor T pg is the iteration period for the unfolded graph 37 T and the average iteration period for the original graph is then ze In the remainder of this thesis we assume we are dealing with the unfolded graph and we refer only to the iteration period and throughput of the unfolded graph if unfolding is in fact employed with the understanding that these quantities can be scaled by the unfold ing factor to obtain the corresponding quantities for the original graph In a self timed scheduling strategy we determine a fully static schedule to v v Tps using the execution time estimates but we only r
207. t algorithms and techniques that reduce the rate at which processors must access shared memory for the purpose of synchronization in multiprocessor implementations of SDF programs One of the procedures we present for example detects when the objective of one synchronization operation is guaranteed as a side effect of other synchronizations in the system thus enabling us to eliminate such superfluous synchronization operations The optimization pro cedure that we propose can be used as a post processing step in any static schedul ing technique any one of the techniques presented in Chapter 1 section 1 2 for reducing synchronization costs in the final implementation As before we assume that good estimates are available for the execution times of actors and that these execution times rarely display large variations so that self timed scheduling is via ble for the applications under consideration If additional timing information is available such as guaranteed upper and lower bounds on the execution times of actors it is possible to use this information to further optimize synchronizations in the schedule However use of such timing bounds will be left as future work we mention this again in Chapter 7 This chapter is a part of ongoing research in collaboration with Dr Shuvra Bhattacharyya who is a Research Scientist at Hitachi America Ltd 5 1 Related work Among the prior art that is most relevant to this chapter is the barrier
208. t in Eqn 4 3 is called the cycle mean of the cycle C The entire quantity on the right hand side of Eqn 4 3 is called the maximum cycle mean of the strongly connected IPC graph G If the IPC graph contains more than one SCC then different SCCs may have different asymptotic iteration periods depending on their individual maximum cycle means In such a case the iteration period of the overall graph and hence the self timed schedule is the maximum over the maxi mum cycle means of all the SCCs of G because the execution of the schedule is constrained by the slowest component in the system Henceforth we will define the maximum cycle mean as follows Definition 4 3 The maximum cycle mean of an IPC graph G denoted by MCM Gipc is the maximal cycle mean over all strongly connected components 87 of G That is ipc fie l a MCM G _ v is on C Gipo cycle C in Gj Delay C Note that MCM G may be a non integer rational quantity We will use the term MCM instead of MCM G when the graph being referred to is clear from the context A fundamental cycle in Gipo whose cycle mean is equal to MCM is called a critical cycle of G Thus the throughput of the system of processors executing value a particular self timed schedule is equal to the corresponding a5 For example in Figure 4 4 Gip has one SCC and its maximal cycle mean is 7 time units This corresponds to the critical cycle B E
209. t is not contained in a cycle an edge that is contained in at least one cycle is called a feedback edge 34 Recall that a fully static schedule specifies a processor assignment actor ordering on each processor and also the precise firing times of actors We use the following notation for a fully static schedule Definition 2 1 A fully static schedule S for P processors specifies a triple S 0 0 5 0 Test P is the processor as where vy gt 1 2 ignment and Trs is the itera tion period An FS schedule specifies the firing times start v k of all actors and since we want a finite representation for an infinite schedule an FS schedule is constrained to be periodic start v k v kTpg gt v is thus the starting time of the first execution of actor v i e start v 0 v Clearly the throughput for such a schedule is Tre The v function and the 5 v values are chosen so that all data pre cedence constraints and resource constraints are met We define precedence con straints as follows Definition 2 2 An edge e75 v V in an HSDFG V E represents the data precedence constraint start v k 2 end v k delay p v Vk 2 delay Op v The above definition arises because each actor consumes one token from each of its input edges when it fires Since there are already delay e tokens on each incoming edge e of actor v another k delay e
210. ta value produced onto e is written into the shared memory buffer for e at offset wr e wr e is updated as in BBS except that the new value is not written to shared memory and the count in sv e is incremented Just before each execution of snk e the value contained in sv e is repeatedly examined until it is found to be nonzero then the data value residing at offset rd e of the shared memory buffer for e is read the count in sv e is decremented and rd e is updated as in BBS 117 Note that we are assuming that there is enough shared memory to hold a separate buffer of size By e for each feedforward communication edge e of G and a separate buffer of size Bry e for each feedback communication edge ipc gt e When this assumption does not hold smaller bounds on some of the buffers must be imposed possibly for feedback edges as well as for feedforward edges and in general this may require some sacrifice in estimated throughput Note that whenever a buffer bound smaller than By e is imposed on a feedback edge e then a protocol identical to UBS must be used The problem of optimally choosing which edges should be subject to stricter buffer bounds when there is a shortage of shared memory and the selection of these stricter bounds is an interesting area for further investigation 5 4 2 The synchronization graph Gs As we discussed in the beginning of this chapter some of the communica tion edges in G
211. tc I have enjoyed many useful discussions with some of some of my friends and colleagues in particular Alan Kamas I have to mention his infectious sense of humor Louis Yun Wan teh Chan Rick Han William Li Tom Parks Jose Pino Brian Evans Mike Williamson Bilung Lee and Asawaree Kalavade who have made my innumerable hours in Cory Hall much more fun than what would have been otherwise I will miss the corridor elevator discussions on topics ranging from the weather to Hindu philosophy with Sriram Krishnan the other Sriram Jagesh Sanghavi Rajeev Murgai Shankar Narayanaswami SKI Angela Chuang Premal Buch and so will I miss the discussions reminiscences and retelling of old tales with the sizable gang of graduate students in Berkeley and Stanford with whom I share my alma mater IIT Kanpur Vigyan Adnan Kumud Sunil Amit Narayan Geetanjali Sanjay Vineet Ramesh to name a few While at Berkeley I have met several people who have since become good xii friends Juergen Teich Raghuram Devarakonda Amit Lal Amit Marathe Ramesh Gopalan Datta Godbole Satyajit Patwardhan Aparna Pandey Amar Kapadia I thank them all for their excellent company I have learned a lot from their talents and experiences as well I also wish to thank my long time friends Anurag Ashish Akshay Anil Kumud Nitin RD Sanjiv our occasional get togethers and telephone chats have always provided a welcome relief from the tedium that gra
212. ted by Convert to SC graph between the source SCCs with each a representing the ith edge created and similarly b b b denote the sequence of edges instantiated between the sink SCCs then algorithm Deter mineDelays can be applied with the modification that m k 1 and eg i 1 1 Ep Gy A s Ap bp B _ 5 b1 where e is the sink source edge from Convert to SC graph Further details related to these issues can be found in Bhat95a Determine Delays and its variations have complexity o Iv log IVI 3 Bhat95a It is also easily verified that the time complex ity of DetermineDelays dominates that of Convert to SC graph so the time com plexity of applying Convert to SC graph and DetermineDelays in succession is again O V log IVI Although the issue of deadlock does not explicitly arise in algorithm Deter mineDelays the algorithm does guarantee that the output graph is not deadlocked assuming that the input graph is not deadlocked This is because from Lemma 4 1 deadlock is equivalent to the existence of a cycle that has zero path delay and is thus equivalent to an infinite maximum cycle mean Since DetermineDelays does not increase the maximum cycle mean it follows that the algorithm cannot convert a graph that is not deadlocked into a deadlocked graph Converting a mixed grain HSDFG that contains feedforward edges into a strongly connected graph has been studied by Zivojnovic Zivo94b
213. termined at compile time and enforced at run time by a controller implemented in hardware We built a prototype of this architecture called the ordered memory access architecture and demon strated how we could achieve low overhead IPC at low hardware cost for the class of DSP applications that can be specified as SDF graphs and for which good com pile time estimates of execution times exist We also introduced the IPC graph model for modeling self timed schedules This model was used to show that we can determine a particular transaction order such that enforcing this order at run time does not sacrifice performance when actual execution times of tasks are close to their compile time estimates When actual running times differ from the compile time estimates the computation performed is still correct but the performance 166 elaborate interconnect such as a crossbar or a mesh topology may be required If the processors in such a system run a self timed schedule the communication pat tern is again periodic and we can predict this pattern at compile time We can then determine the states that the crossbar in such a system cycles through or we can determine the sequence of settings for the switches in the mesh topology The fact that we can determine this information should make it possible to simplify the hardware associated with these interconnect mechanisms since the associated switches need not be configured at run time How exactly this co
214. the Ordered Memory Access architecture that uses the ordered transactions principle to statically assign the sequence of pro cessor accesses to shared memory External I O and user control inputs can also be taken into account when scheduling accesses to the shared bus We also discussed the software interface details of the prototype and presented some applications that were implemented on the OMA prototype 78 AN ANALYSIS OF THE OT STRATEGY In this chapter we systematically analyze the limits of the OT scheduling strategy Recall that the ST schedule is obtained by first generating a fully static FS schedule t6 v Tps and then ignoring the exact firing times specified by the FS schedule the FS schedule itself is derived using compile time estimates of actor execution times of actors The OT strategy is essentially the self timed strategy with the added ordering constraints O that force processors to com municate in an order predetermined at compile time The questions we try to ad dress in this chapter are What exactly are we sacrificing by imposing such a restriction Is it possible to choose a transaction such that this penalty is minimized What is the effect of variations of task actor execution times on the throughput achieved by a self timed strategy and by an OT strategy The effect of imposing a transaction order on a self timed schedule is best illustrated by the following example Let us assume that we use t
215. the conditions we established for a synchronization operation to be redundant in section 5 6 holds for all execu tion paths in the qua tic schedule then we could identify redundant synchroni zation points in the schedule It may also be possible to extend the strongly connect and resynchronization transformations to handle graphs containing condi tional constructs these issues require further investigation 169 REFERENCES Ack82 W B Ackerman Data Flow Languages IEEE Computer Magazine Vol 15 No 2 February 1982 Adam74 T L Adam K M Chandy and J R Dickson A Comparison of List Schedules for Parallel Processing Systems Communications of the ACM Vol 17 No 12 pp 685 690 December 1974 Aik88 A Aiken and A Nicolau Optimal Loop Parallelization Proceedings of the SIGPLAN 88 Conference on Programming Language Design and Implementation 1988 Alle87 R Allen and K Kennedy Automatic Transformation of Fortran Programs to Vector Form ACM Transactions on Programming Languages and Sys tems Vol 9 No 4 October 1987 Ambl92 A L Ambler M M Burnett and B A Zimmerman Operational Versus Definitional A Perspective on Programming Paradigms IEEE Computer Magazine Vol 25 No 9 September 1992 Ariel9 1 User s Manual for the S 56X Ariel Corporation Highland Park New Jer sey 1991 Arvi90 Arvind and R S Nikhil Executing a Program on the
216. the firing times of transitions in a marked graph are periodic asymptotically Interpreted in our notation for any strongly connected HSDFG AK N s t start v k N start v k MCM Gipe xN Vve V Vk gt K Thus after a transient that lasts K iterations the ST schedule evolves into a peri odic pattern The periodic pattern itself spans N iterations we call N the periodic ity The periodicity depends on the number of delays in the critical cycles of G pc it can be as high as the number least common multiple of the number of delays in 90 the critical cycles of Gipo Bacc92 For example the IPC graph of Fig 4 4 has one critical cycle with two delays on it and thus we see a periodicity of two for the schedule in Fig 4 2 The transient region defined by K which is 1 in Fig 4 2 can also be exponential The effect of transients followed by a periodic regime is essentially due to properties of longest paths in weighted directed graphs These effects have been studied in the context of instruction scheduling for WLIW processors Aik88 Zaky89 as soon as possible firing of transitions in Petri nets Chre83 and determining clock schedules for sequential logic circuits Shen92 In Aik88 the authors note that if instructions in an iterative program for a VLIW processor represented as a dependency graph are scheduled in an as soon as possible fash ion a pattern of parallel instructions emerges after an i
217. the number of tokens on C can only change when actors that are on it fire because actors not on C remove and place tokens only on edges that are not pat of C If C veva Wa V3 s Vanor Ya Yp Vy and any actor v 1 lt k lt n fires then exactly one token is moved from the edge v _ v to the edge Ve Ve4 1 gt Where vp v and v v This conserves the total number of tokens on C QED 85 Definition 4 1 An HSDFG G is said to be deadlocked if at least one of its actors cannot fire an infinite number of times in any valid sequence of firings of actors in G Thus in a deadlocked HSDFG some actor v fires k lt number of times and is never enabled to fire subsequently Lemma 4 2 An HSDFG G in particular an IPC graph is free of deadlock if and only if it does not contain delay free cycles Proof Suppose there is a delay free cycle C Mav Va V5 o V _ pY pyp in G Le Delay C 0 By Lemma 4 1 none of the edges v4 v3 Va V3 e n p Yn p Vy Can contain tokens during any valid execution of G Then each of the actors v v has at least one input that never contains any data Thus none of the actors on C are ever enabled to fire and hence G is deadlocked Conversely suppose G is deadlocked i e there is one actor v that never fires after a certain sequence of firings of actors in G Thus after this sequence of firings there must be an input edge v v that never contains data This
218. these strategies appear to be most useful for a significant class of real time DSP algorithms Although 14 Proc 1 C A Procl C A C A Proc2 D B D B D HSDFG acyclic precedence graph Tt b blocked schedule Proc 1 Proc 2 B C B iy c overlapped schedule Figure 1 1 Fully static schedule on how successive iterations of the HSDFG are treated Execution times of all actors are assumed to be one time unit t u in this example The FS schedule in Fig 1 1 b represents a blocked schedule successive iterations of the HSDFG in a blocked schedule are treated separately so that each iteration is completed before the next one begins A more elaborate blocked schedule on five processors is shown in Fig 1 2 The HSDFG is scheduled as if it executes for only one iteration i e inter iteration dependencies are ignored this schedule is then repeated to get an infinite periodic schedule for the HSDFG The length of the blocked schedule determines the average iteration period T The scheduling problem is then to obtain a schedule that minimizes T which is also called the makespan of the schedule A lower bound on T for a blocked schedule is simply the length of the critical path of the graph which is the longest delay free path in the graph Ignoring the inter iteration dependencies when scheduling an HSDFG is
219. to use a large number of very simple processors to perform the computation whereas the approach we follow in this thesis is to use a small num ber of powerful processors This enables us to handle algorithms specified as data flow graphs where the actors are tasks with a potentially large granularity The parallelism we employ is therefore at the task level functional parallelism Such a strategy gives up some of the optimality properties that systolic array mapping techniques guarantee in exchange for a larger application domain Again utilizing a number of pretested processor cores is economically more attractive than build ing a systolic array implementation from scratch Ideally we would like to exploit the strategy of partitioning data among dif ferent processors data parallelism that systolic techniques employ along with task level parallelism There has not been much work in this direction although the work of Printz Prin91 and the Multidimensional SDF model proposed by Lee in Lee93 are two promising approaches for combining data and functional parallelism The multiple DSP machines we discussed in the last section all employ some form of self timed scheduling Clearly general purpose parallel machines like the Thinking Machines CM 5 and Stanford Dash multiprocessor can also be programmed using the self timed scheduling style since these machines provide mechanisms for run time synchronization and flow control These machin
220. togethers I have also greatly enjoyed attending classes and discussions with Professors Avideh Zakhor Jean Walrand John Wawrzynek and Robert Bray ton I thank Professors Michael Lieberman and Allan Lichtenberg for their support and encouragement during my first year as a graduate student During the course of my Ph D research I have had the opportunity to work closely with several fellow graduate students In particular I would like to mention Shuvra Bhattacharyya in collaboration with whom some of the work in this thesis was done and Praveen Murthy Praveen and Shuvra are also close friends and I xi have immensely enjoyed my interactions with them both technical as well as non technical such as music photography tennis etc I want to thank Phil Lapsley who helped me with the DSP lab hardware when I first joined the DSP group Soonhoi Ha who helped me with various aspects of the scheduling implementation in Ptolemy and Mani Srivastava who helped me a great deal with printed circuit board layout tools and provided me with several useful tips that helped me design and prototype the 4 processor OMA board I should mention Mary Stewart and Carol Sitea for helping me with reim bursements and other bureaucratic paperwork Christopher Hylands for patiently answering my system related queries and Heather Levien for cheerfully helping me with the mass of graduate division related paperwork deadlines formalities to be completed e
221. ually taken the pattern of processor availability after the construct completes execution is the same This has to be ensured by inserting idle time on processors when necessary Fig 6 4 shows a quasi static schedule for a conditional construct Maintaining the same pattern of processor availability allows static scheduling to proceed after the execution of the conditional the data dependent nature of the control construct can be ignored at that point In Fig 6 4 for example the schedul 0 B CODE FOR f conditional branch instructions Ce fF proc 1 TRUE proc2 G SWITCH Branch proc 3 H i pattern of processor availability FALSE Branch subgraph 1 A schedule for CODE FOR g subgraph 1 Figure 6 4 Quasi static schedule for a conditional construct adapted from Lee88b ing of subgraph 1 can proceed independent of the conditional construct because 152 the pattern of processor availability after this construct is the same independent of the branch outcome note that nops idle processor cycles have been inserted to ensure this Multiprocessor implementation of a quasi static schedule directly how ever implies enforcing global synchronization after each dynamic construct in order to ensure a particular pattern of processor availability We therefore use a mechanism similar to the self timed strategy we first determine a quasi static schedule using the methods of Lee and Ha and t
222. uce and consume fixed number of tokens and these numbers are known at compile time This allows us to obtain periodic schedules for SDF graphs such that the average rates of firing of actors are fixed relative to one another We will not be concerned with synchronous languages in this thesis although these languages have a close and interesting relationship with dataflow models used for specification of signal processing algorithms Lee95 1 1 2 Utility of dataflow for DSP As mentioned before dataflow models such as SDF and other closely related models have proven to be useful for specifying applications in signal pro cessing and communications with the goal of both simulation of the algorithm at 11 is not possible for a general dataflow model consequently buffers can be allo cated statically and run time overhead associated with dynamic memory allocation is avoided The existence of a periodic schedule that can be inferred at compile time implies that a correctly constructed SDF graph entails no run time scheduling overhead An SDF graph in which every actor consumes and produces only one token from each of its inputs and outputs is called a homogeneous SDF graph HSDFG An HSDF graph actor fires when it has one or more tokens on all its input edges it consumes one token from each input edge when it fires and pro duces one token on all its output edges when it completes execution A general multirate SDF graph can always b
223. ugh to handle even a significant class of IPC graphs Again exponen tially distributed execution times for all actors is clearly a crude approximation to any realistic scenario to make the computations involved in exact calculations worthwhile As an alternative to determining the exact value of E T we discuss how to determine efficiently computable bounds for it Definition 4 5 Given an HSDFG G V E that has actors with random execu tion times define G V E to be an equivalent graph with actor execution times equal to the expected value of their execution times in G Fact 4 1 Durr91 Jensen s inequality If f x is a convex function of x then Elf x 2f Elx 102 In Rajs94 the authors use Fact 4 1 to show that E T 2 MCM G This follows from the fact that MCM G ave ave 18 a convex function of the execution times of each of its actors This result is interesting because of its generality it is true no matter what the statistics of the actor execution times are even the various independence assumptions we made can be relaxed One might wonder what the relationship between E T and E MCM G might be We can again use Fact 4 lalong with the fact that the maximum cycle mean is a convex function of actor execution times to show the fol lowing E MCM G 2MCM G However we cannot say anything about E 7 in relation to E MCM G we were able to construct some graphs where E T gt
224. used with synchronous languages Hal93 Ben9 1 e g LUSTRE SIGNAL and ESTEREL which have very different semantics from SDF Synchronous languages have been proposed for formally specifying and modeling reactive systems i e systems that constantly react to stimuli from a given physical environment Signal processing systems fall into the reactive cate gory and so do control and monitoring systems communication protocols man machine interfaces etc In these languages variables are possibly infinite sequences of data of a certain type Associated with each such sequence is a con ceptual and sometimes explicit notion of a clock signal In LUSTRE each vari able is explicitly associated with a clock which determines the instants at which the value of that variable is defined SIGNAL and ESTEREL do not have an explicit notion of a clock The clock signal in LUSTRE is a sequence of Boolean values and a variable in a LUSTRE program assumes its n th value when its corre sponding clock takes its nth TRUE value Thus we may relate one variable with another by means of their clocks In ESTEREL on the other hand clock ticks are implicitly defined in terms of instants when the reactive system corresponding to an ESTEREL program receives and reacts to external events All computations in synchronous language are defined with respect to these clocks In contrast the term synchronous in the SDF context refers to the fact that SDF actors prod
225. ver be more that po snk e sre e tokens more than the initial number of tokens on e delay e Since the initial number of tokens on e was delay e the size of the buffer corresponding to e is bounded above by By e Pg snk e src e delay e QED The quantities pg snk e src e can be computed using Dijkstra s algorithm Corm92 to solve the all pairs shortest path problem on the synchroni zation graph in time ol Iv 5 9 Resynchronization It is sometimes possible to reduce the total number of synchronization edges E by adding new synchronization edges to a synchronization graph We refer to the process of adding one or more new synchronization edges and remov ing the redundant edges that result as resynchronization Fig 5 14 a illustrates this concept where the dashed edges represent synchronization edges Observe that if we insert the new synchronization edge d C H then two of the original synchronization edges B G and E J become redundant and the net effect is that we require one less synchronization edge to be implemented In Fig 5 14 b we show the synchronization graph that results from inserting the resyn chronization edge d C H grey edge into Fig 5 14 a and then removing the redundant synchronization edges that result We refer to the problem of finding a resynchronization with the fewest number of final synchronization edges as the resynchronization problem In
226. viour must be broadcast to all the processors that execute that construct Thus in Fig 6 4 the value c which is computed by Proces sor 2 since the actor that produces c is assigned to Processor 2 must be broad cast to the other two processors In a shared memory machine this broadcast can be implemented by allowing the processor that evaluates the control token Processor 2 in our example to write its value to a particular shared memory location preas signed at compile time the processor will then update this location once for each iteration of the graph Processors that require the value of a particular control token simply read that value from shared memory and the processor that writes the value of the control token needs to do so only once In this way actor executions can be conditioned upon the value of control tokens evaluated at run time In the previous chapters we discussed synchronization associated with data transfer between pro cessors Synchronization checks must also be performed for the control tokens the processor that writes the value of a token must not overwrite the shared memory location unless all processors requiring the value of that token have in fact read the shared memory location and processors reading a control token must ascertain that the value they read corresponds to the current iteration rather than a previous iteration The need for broadcast of control tokens creates additional communication overhead that s
227. vojinovic et al does not apply uni formly to all HSDFGs if there are tight cycles in the graph that need to be partitioned among processors the samples simply cannot be vectorized Messer88 Thus presence of a tight cycle precludes arbitrary blocking of data Third vectorizing samples leads to increased latency in the implementation some signal processing tasks such as interactive speech are sensitive to delay and hence the delay introduced due to blocking of data may be unacceptable Finally the problem of vectorizing data in HSDFGs into blocks even with all the above limi 44 tations appear to be fundamentally hard the algorithms proposed by Zivojinovic et al have exponential worst case run times Code generated currently by the Ptolemy system does not support blocking or vectorizing of data for many of the above reasons Another possible solution is to use special hardware One could provide a full interconnection network thus obviating the need to go through shared mem ory Semaphores could be implemented in hardware One could use multiported memories Needless to say this solution is not favourable because of cost espe cially when targeting embedded applications A general purpose shared bus machine the Sequent Balance Patt90 for example will typically use caches between the processor and the shared bus Caches lead to increased shared memory bandwidth due to the averaging effect provided by block fetche
228. was found to be ideal for the particular algorithms that were to be mapped to this machine The ring structure is similar to the SMART array except that no processor ID is included with the data and processor reads and writes into the ring are scheduled in a fully static fashion The ring is used to broadcast data from one processor to all the others during one phase of the neural net algorithm and is used to shift data from processor to processor in a pipelined fashion in the second phase The MUSIC system Gunz92 uses Motorola DSP96000 processors and has been designed for neural network simulations and scientific simulations An intelligent communication network implemented on Xilinx gate arrays broad casts data generated by any processing element PE to all the other PEs The PEs are arranged in a ring topology This kind of broadcast mechanism is suited to the applications the MUSIC system targets the outputs from one layer of a multi layer perceptron a kind of neural net is needed by all the neurons in the next layer making broadcasting an ideal strategy when the different net layers are executed in a pipelined fashion The molecular dynamics example the authors provide also benefits from this broadcast mechanism 1 4 Thesis overview our approach and contributions We argued that the self timed scheduling strategy is suited towards parallel implementation for DSP The multiprocessor architectures we discussed in the pre
229. y processor that possesses the bus may write to the Tx reg ister or read from the Rx register For a 16 bit host two transactions are required to perform a read or write with the 32 bit Tx and Rx registers The processors them selves need only one bus access to load or unload data from the I O interface Syn chronization on the S 56X host side is done by polling status bits that indicate an Rx empty flag if true the host performs a write otherwise it busy waits and a Tx full flag if true the host performs a read otherwise it busy waits On the OMA side synchronization is done by the use of the TA transfer acknowledge pin on 64 the processors When a processor attempts to read Rx or write Tx the appropriate status flags are enabled onto the TA line and wait states are automatically inserted in the processor bus cycle whenever the TA line is not asserted which in our implementation translates to wait states whenever the status flags are false Thus processors do not have the overhead of polling the I O status flags an I O transac tion is identical to a normal bus access with zero or more wait states inserted auto matically The DSP56000 processor on the S 56X card is responsible for performing T O with the actual possibly asynchronous data source and acts as the interrupt processor for the OMA board relieving the board of tasks such as interrupt servic ing and data buffering This of course has the downside that the S 56X

S. Sriram`s Thesis - Electrical Engineering & Computer Sciences

Contents

Download Pdf Manuals

Related Search

Related Contents

S. Sriram`s Thesis - Electrical Engineering &amp; Computer Sciences

Contents

Download Pdf Manuals

Related Search

Related Contents

S. Sriram`s Thesis - Electrical Engineering & Computer Sciences