Home

Dataflow Design Tool - ODU Computer Science

1. Network with Com Controller This architecture model assumes the processors are completely con nected via communication paths Unlike the Network without Com Controller option each processing unit is paired with a communication com controller that handles the transfer of information after the processor sets up for the transfer Thus the processors will not be burdened with the communication transfers to neighboring processors Network without Com Controller This architec ture model assumes the processors are completely connected via communication paths Unlike the Network with Com Controller option this model does not assume that each processing unit is paired with a communication com controller that handles the transfer of information after the processor sets up for the transfer Thus each processor will be burdened with the communication transfers to neighboring processors Multiple Graph Strategy Invokes the dialogue box shown in figure 8 to allow the user to choose multiple graph execution strategy The user simply Multiple Graphs r Parallel Execution lime Multiplex Execution Graph _ Graph_3 Figure 8 Dialogue box for selection of multiple graph strategy clicks on a graph and chooses to move it to the left for Parallel Execution or to the right for Time Mul tiplex Execution The strategies are defined as follows Parallel Graph Execution Multiple graph execu tion strategy where graphs are i
2. Write Float Figure 19 Single graph play window showing internal transitions associated with reading processing and writing data be removed by the Delete Edge command or by the Undo command Delete Edge Allows the user to delete a previ ously inserted control edge by prompting for the terminal and initial nodes as done in adding a con trol edge Control edges already present within the graph file cannot be deleted A beep indicates success Undo Deletes the most recently inserted control edge Repeating the Undo command will continue to remove the most recently inserted control edges until none remain A beep indicates success Slice Zooms into the time interval defined by the left and right time cursors vertical lines Previous Slice Zooms into the time interval defined by the left and right time cursors vertical lines After zooming into a time slice the user can move left or right of the time interval by using the horizontal scroll bar Whole Displays the entire SGP schedule over the schedule length Redraw Refreshes the display without changing the current zoom or time cursor positions Reset Refreshes the display by returning to the entire picture removes zoom and positions the left and right time cursors to the left and right margins respectively 5 1 2 Select Menu The Select menu includes commands that enable the user to display only selected nodes force the
3. 10 Figure 12 TCE dialogue box for each architecture model 0 0 cece eee eee eee 11 Figure 13 Dialogue box to select desired sink for TBIO calculations 0 12 Figure 14 Dialogue box to set TBO value lssleeeeeeeeeeee III 12 Figure 15 Dialogue box to set processor limit lleleleeeee ee 12 Figure 16 Dialogue box to change graph name 0 eee ene en 13 Figure 17 Single graph play window Shaded bars indicate task execution duration unshaded bars indicate slack time 2 0 eee cece e 14 Figure 18 Two ways to display information about a node lesse esee 15 Figure 19 Single graph play window showing internal transitions associated with reading processing and writing data s 16 Figure 20 Select Display Paths menu command to display all paths or just critical paths or Display Circuits menu command to display same for circuits Paths and circuits are denoted with gray shaded bars 0 0 0c cece cece eee cence 17 Figure 21 Choose Display command from Select menu to customize display of nodes within Graph Play windows 0 0 0c cece cece mme 18 Figure 22 Choose Jump by command from Select menu to choose nodes or events to jump to when moving time cursors with arrow keys 0 c eee cece eee eee 18 Figure 23 Choose Scroll Step command from Select menu to set amount to increment when using horizontal scroll bar with a z
4. 20 7 3 Processors 81 2 2 Processors 97 5 1 Processors 100 0 0 Processors 0 0 Computing Effort 940 Total Utilization 14 8 Read Setup 120 Overhead 12 8 TIME 0 com TIME 0 314 rod ES EAN za Figure 56 Dataflow analysis of DFG of figure 55 with communication delays and Network with Com Controller model 42 EVENT p Priariby Max Instances l atency Read Process rile Earliest Start Latest Finish Slack Inpuls D gt Outputs F D TIME 495 z som pmr puw Single Graph Play HA Display Selec l EVENT Name Priority Max Instances Latency Head Process Write Earliest Start Latest Finish LL Slack inputs A Ouiputs gt F Figure 57 Effect of edge delays on node slack time Intra iteration slack of node C is reduced by C lt F edge delay of 20 time units Inter iteration slack of node E is reduced by E lt D edge delay of 10 time units communications processor In this case the processor is responsible for the communications burden as well as the computing effort required by the algorithm tasks The effect that edge delays under this model have on the data flow analysis conducted under the previous models can be seen by examining figure 58 With the added commu nications effort indicated by the increase in TCE to 1110 time units TBO for a computed proces
5. digm has also been getting considerable attention in the areas of DSP and real time systems The commercial products offered today utilize the dataflow paradigm as a graphical programming language but do not incorporate dataflow analyses in designing a multiprocessing solu tion Although there are many advantages to graphical programming the full potential of the dataflow represen tation is lost by not utilizing it analytically as well In the absence of the analysis and or design offered by the soft ware tool described in this paper programmers must rely on approximate compile time solutions heuristics or run time implementations which often utilize ad hoc design approaches This paper describes the Dataflow Design Tool which is capable of determining and evaluating the steady state behavior of a class of computational prob lems for iterative parallel execution on multiple proces sors The computational problems must meet all the following criteria 1 An algorithm decomposition into primitive opera tions or tasks must be known 2 The algorithm task dependencies preferably due to the inherent data dependencies must be modeled by a directed graph 3 The directed graph must be deterministic as defined below 4 The algorithm execution must be repetitive for an infinite input data stream 5 The algorithm must be executed on identical processors When the directed graph is a result of inherent data dependencies with
6. time delay between input injection of two or more graphs partial ordering of tasks parallel interface bus available processors calculated processor requirement corresponds to theoretic lower limit speedup minimum time to execute all tasks for given computation single graph play architecture model which assumes processors are fully connected to shared memory with enough paths to avoid contention graph output data stream maximum time token can be delayed at node without increasing delay in longest path graph input data stream single resource envelope set of tasks set of instructions as basic unit of scheduling time between inputs time between input and output lower bound time between input and output time between outputs lower bound time between outputs total computing effort total graph play ith task in 7 execution of multiple graphs in parallel by controlling graph phasing maximum time per token ratio for all graph recurrence loops initial conditions and data values within graph maximum number of tokens on edge total resource envelope utilization very high speed integrated circuit Abstract The Dataflow Design Tool is a software tool for selecting a multiprocessor scheduling solution for a class of computational problems The problems of interest are those that can be described with a dataflow graph and are intended to be executed repetitively on a set of identical processors Typical applications includ
7. Clicking on node E results in the inclusion of the E lt C constraint thereby eliminating the needless parallelism for a single iteration as shown in figure 44 Even though the computing effort is smaller than before adding the E lt C constraint there is still some comput ing effort requiring four processors Additional prece dence constraints may exist that could effectively reallocate the computing effort requiring four processors to fill in the underutilized idle time requiring only two processors By using the time cursors the user can locate the four processor effort in the TGP note cursor position in 33 E gt gle Grap Initial Node gt C Ce Displa Select DFG Display Set Display Select Display Select Single Resource Envelope zie TotalGrapnpiay f m r3 Py a O OOo me cee eea s Total Resource Envelope Display Select DFG TIME 0 314 reist Figure 43 Dataflow schedule for desired number of processors three for example and TBO 314 time units may in fact require more pro cessors four in this case User may wish to eliminate needless parallelism and fill in underutilized processor time Figure shows intent to delay node C behind nodes B D or E the TGP and TRE of fig 44 Referring to the TGP of figure 44 one can determine that the likely candidate is to delay node D behind either node C or node B Select i
8. Controller 0 0 0 41 10 2 2 Network without Communication Controller 20 0 0 cee ene 42 10 3 Multiple Graph Models 0 0 0 eee ccc e nee bene eens 43 10 3 1 Parallel Execution of Multiple Graphs lsssseee eee 43 10 3 2 Time Multiplex Execution of Multiple Graphs 0 0 c eee eee eee 44 11 Future Enhancements ur ener ALSO dd Wo iia ald ee eA eee a UR 48 12 Concluding Remarks 22225 tae et dees hie mete ee Meee aoe oda ant 54 Appendix Graph Text Description 0 0 eee m eens 55 References iue AS etal MAS eae dea Resa dd dele WOO a eds 58 iv Figures Figure 1 Dataflow graph example seseeeeeeee III 2 Figure 2 Dataflow Design Tool information flow sssseeeeee e 5 Figure 3 Example of DESIGN INI file lseeeeeeeeee IRR 6 Figure 4 Design Tool main window eeseeeeeeeeee nent eee eens 7 Figure 5 Dialogue box for opening graph file lsslseeeeeee eA 8 Figure 6 Dialogue box for exiting Design Tool program 0 00 ee eee eee eee 8 Figure 7 Dialogue box for selection of architecture model 0 0 cece eee eee eee 8 Figure 8 Dialogue box for selection of multiple graph strategy 0 0 00 e eee ee eee 9 Figure 9 Choose Help menu in main window or press F1 to invoke on screen Help window 10 Figure 10 About box ips a E abge dua ee ae RIE seed nns 10 Figure 11 Metrics window oo
9. Instances Latency Head Process Write Earliest Start Latest Finish Slack Inputs D y Outputs F gt D a Click on a node bar while holding down lt Shift gt key EVENT b Move left time cursor with left and right arrow keys Figure 18 Two ways to display information about a node time paths circuits and legends A description of each command is given as follows Show Transitions Turns on and off the display of internal transitions read process and write setup time as shown in figure 19 show Segments Turns on and off the display of TBO width segments separated by dotted lines that indicate the multiple computations for successive data packets shown by the TGP win dow In fact the TGP window can be constructed by superimposing the TBO width segments Show Slack Turns on and off the display of slack time by using unshaded bars Paths Shows none of the paths all of the paths or just the critical paths within the graph by denot ing member nodes with gray bars fig 20 a Circuits Shows none of the circuits all of the circuits or just the critical circuits within the graph by denoting member nodes with gray bars fig 20 b Select Node Allows the user to highlight selected nodes bars in gray by clicking on a bar or using the up and down arrow keys to obtain information on the selected node Given a selected gray shaded node al
10. Multiple Graph Execution Display Select Time Multiplex 2 Processors 50 0 96 1 Processors 100 0 96 0 Processors 0 0 96 TIME 0 1800 Computing Effort 2700 fp os Total Utilization 75 0 b New sequence and phasing between graphs with Graph 2 repeated twice Figure 67 TBO for a given graph under time multiplex strategy depends on phase delays 51 Multiple Graph Execution Display Select Time Multiplex Utilization 3 Processors 2 Processors 100 0 1 Processors 100 0 96 0 Processors 0 0 96 TIME 0 1000 Ci Pam ieee ade tate ery Computing Effort 2700 Total Utilization 90 0 Figure 68 Phase delays can be altered with Phasing window to fill in idle time between graph transitions Phase delays have been reduced to increase throughput and processor utilization in relation to figure 67 52 GRAPH ENTRY TOO nultigrt qrt File Edit Tools Display Window Run a Graph entry tool after graphs have been updated b Edge delays on source control edges impose sequence and phasing portrayed in figure 68 Figure 69 In time multiplex mode updating graph file not only sets graph attributes such as queue sizes but imposes control edges with delay attributes around sources to control phase of graphs 53 12 Concluding Remarks The Dataflow Design Tool a software tool that pro vides automatic graph analysis of algorithm
11. Office of Management and Budget Paperwork Reduction Project 0704 0188 Washington DC 20503 1 AGENCY USE ONLY Leave blank 2 REPORT DATE 3 REPORT TYPE AND DATES COVERED February 1996 Technical Memorandum 4 TITLE AND SUBTITLE 5 FUNDING NUMBERS Dataflow Design Tool User s Manual WU 233 01 03 13 6 AUTHOR S Robert L Jones III 7 PERFORMING ORGANIZATION NAME S AND ADDRESS ES 8 PERFORMING ORGANIZATION REPORT NUMBER NASA Langley Research Center L 17451 Hampton VA 23681 0001 9 SPONSORING MONITORING AGENCY NAME S AND ADDRESS ES 10 SPONSORING MONITORING AGENCY REPORT NUMBER National Aeronautics and Space Administration NASA TM 4672 Washington DC 20546 0001 P 11 SUPPLEMENTARY NOTES 12a DISTRIBUTION AVAILABILITY STATEMENT 12b DISTRIBUTION CODE Unclassified Unlimited Subject Category 61 Availability NASA CASI 301 621 0390 13 ABSTRACT Maximum 200 words The Dataflow Design Tool is a software tool for selecting a multiprocessor scheduling solution for a class of com putational problems The problems of interest are those that can be described with a dataflow graph and are intended to be executed repetitively on a set of identical processors Typical applications include signal processing and control law problems The software tool implements graph search algorithms and analysis techniques based on the dataflow paradigm Dataflow analyses provided by the software are introduced and shown to effectively det
12. are required for iteration period of 108 time units l l m 47 Figure 63 Dataflow analysis of Graph 3 shows four processors are required for iteration period of 134 time units llle 47 Figure 64 Parallel Graph Execution window summarizes processor requirements for all graphs intended to execute in parallel Since it is assumed that phasing between graphs cannot be controlled in parallel graph strategy total processor requirement is worst case or sum of individual processor requirements 00 eee cee m 48 Figure 65 Time multiplex strategy assumes phasing between graphs can be controlled by defining a finite delay between inputs to graphs sls 49 vii Figure 66 Select Edit Graph Sequence command from Select menu in Multiple Graph window to edit periodic graph sequence User can also replicate graphs so that they execute n times as often as other graphs within scheduling cycle Figure shows Graph 2 has been replicated twice 1 licel sd vere Sad eae ee ES 50 Figure 67 TBO for a given graph under time multiplex strategy depends on phase delays 51 Figure 68 Phase delays can be altered with Phasing window to fill in idle time between graph transitions Phase delays have been reduced to increase throughput and processor utilization in relation to figure 67 0 0 0 0 cece cece hmm 52 Figure 69 In time multiplex mode updating graph file not only sets graph attributes such as queue sizes bu
13. bound time between input and output lower bound time between outputs time between input and output Figure 11 Metrics window minimum time to execute all tasks for a given computation The number of Processors shown in the Metrics window will not necessarily equal the number of proces sors required for a cyclic dataflow schedule The number calculated processor requirement of Processors shown here is the optimum number of time between outputs Shared Memory No Contention Total Computing Effort Processing 820 ReadfMrte 120 Overhead 12 8 2 Network with Com Controller Total Computing Effort Processing B ll Readj Setup 120 Overhead 12 8 99 Network without Com Controller Total Computing Effort Processing G20 Head Setup 170 Communicating Overhead 12 8 5 E Figure 12 TCE dialogue box for each architecture model processors for the current TBO setting from equation 5 and is referred to as the calculated processor require ment The actual processor requirement may be greater than the calculated requirement because of the partial ordering of tasks The job of scheduling partially ordered tasks to processors is known to be NP complete ref 4 This implies that an exhaustive search rescheduling tasks with start times greater than the earliest start times given by the dataflow analysis is required to find an optimum solution that achieves the timing criteria e g minimu
14. dataflow analysis via the Design Tool net work model with and without communication processors will be demonstrated 10 2 1 Network with Communication Controller The Network with Com Controller model assumes that each processing element is paired with a communi cations processor controller or state machine In this way the burden of transferring information between pro cessors is removed from the processing elements The effect of edge delays on the dataflow analysis conducted before can be seen by examining figure 56 With the computed number of processors set to three TBO remains 314 time units This is only because the total communication delay of 20 time units added to the recur rence loop results in a time token ratio of 300 time units still less than TCE 3 940 3 314 time units Note that TCE remains 940 time units since the communication effort does not consume processor time The significant difference between this dataflow analysis and the previ ous one Shared Memory No Contention is the finite gaps between the dependent node execution bars and the corresponding processor idle time It should be mentioned here that this model assumes that the communications required of all edges can occur 41 Latency 390 Edge Delay 30 190 Figure 55 Dataflow graph with communication delays modeled by edge delays in parallel Thus the minimum time required to transfer results of a node with multiple outp
15. decomposi tions was presented The graph analysis is based on the dataflow paradigm which is very good at exposing inherent parallelism within algorithms The functionality of the Design Tool was defined and shown to facilitate the selection of graph theoretic multiprocessing solu tions The addition of artificial data dependencies con trol edges was shown to be a viable technique for improving scheduling performance by reducing the pro cessor requirements The selection of an optimum solu tion is based on user selected criteria time between outputs time between input and output and number of processors or trade offs among these parameters when a solution optimizing all three cannot be found or may not exist Case studies demonstrated the ability of the tool to optimize a dataflow derived scheduling solution by using control edges to model communication delays between synchronized tasks and accommodate multiple graph modeling The dataflow graph actually describes all the pertinent run time criteria such as task instantiations 54 synchronization communication shared memory requirements and iteration period The dataflow graph used as input is updated by the Design Tool so that the run time scheduling solution is represented by the data flow graph However the dataflow model used and updated by the Design Tool at compile time does not imply that a dataflow graph implementation must be used at run time If the user has a da
16. from Select menu to choose nodes or events to jump to when moving time cursors with arrow keys 5 2 Total Graph Play Window This section discusses the TGP window offers some comments and observations and defines the menu commands The Total Graph Play window for the DFG of figure 1 is shown in figure 24 for a TBO of 314 time units Just as in the SGP task node names are shown vertically and time is represented along the horizontal axis Node latencies are represented by the length of the red shaded bars Comparing figure 24 with the TBO 18 Scroll Step Figure 23 Choose Scroll Step command from Select menu to set amount to increment when using horizontal scroll bar with a zoomed interval width segments in figure 17 one can observe that the construction of the TGP from the ES module TBO map ping function is equivalent to superimposing the TBO width segments onto the SGP The numbers above the node bars indicate the relative data packet data set num bers of the scheduled task That is data packet n 1 denotes a data packet injected into the graph one TBO interval after data packet n The overlapped execution of multiple node instantiations is represented by multiple rows for the same node as is the case for node B requir ing two instantiations Referring to the scheduled effort of nodes D and E which form a recurrence loop circuit one can observe some idle time from the completion time of E to the ea
17. graph sequence and phasing which are por trayed and controlled by the Time Multiplex and Phasing windows described in section 6 4 2 This section will demonstrate the performance behavior of the individual graphs shown in figure 59 under a time multiplex graph scenario The time multiplex performance of the three graphs periodically executed with the sequence Graph 1 Graph 2 Graph 3 Graph 1 is shown in figure 65 By comparing the delays shown in the Phasing window fig 65 b with the TBIO values fig 65 a one can infer that the graphs are spaced out by the respective TBIO values which are equal to the respective schedule lengths such that the processing of graphs does not over lap Thus the dashed lines in the resource envelope win dow actually indicate when the processing of one graph ends and that of another begins Also since neither graph is replicated i e the system operates by a single sam pling rate in DSP terminology the time between graph executions TBO equals the sum of the phase delays 1350 time units in this case The disablement of the TBO and Processors buttons is indicated by the grayed button text in figure 65 a The calculated processor require ment shown next to the Processors button is one for each graph This is expected since in each case TCE is less than TBO However the system requires two processors to exploit the parallel concurrency in the individual graphs with an overall processors u
18. in previous sessions 10 Case Studies For the purposes of presenting and demonstrating the remaining Design Tool features a few case examples will be discussed First optimization of the DFG exam ple in figure 1 will be demonstrated by inserting control edges Second the effect of communication delays mod eled as edge delays on the performance and displays pre viously presented will be demonstrated And third the capability of the Design Tool in modeling multiple graph execution scenarios will be demonstrated 10 1 Optimization of Dataflow Derived Schedule The DFG example in figure 1 has the potential of having a speedup performance of 3 with three processors as indicated by equations 5 and 6 and portrayed in figure 36 However the precedence relationships given by the dataflow may not lend themselves to this theoreti cal solution in terms of requiring three processors at a TBO of 314 time units The dataflow analysis provided by the tool only guarantees the scheduling solution with sufficient resources processors When resource require ments extend beyond resource availability trade offs may be necessary between R TBO and TBIO in addi tion to optimization of the dataflow schedule with artifi cial precedence constraints The inclusion of additional precedence constraints in the form of control edges may reduce the processor requirements of a DFG for a desired level of perfor mance Since the problem of findin
19. of Multiple Graphs 2 2 0 0 0 ce eee eee ee eee 23 6 4 1 Parallel Execution Model Window 00 cece cee cnet n eens 24 64 LT OVERVIEW ci 9 uel sem A dit ee ado ees 24 6 4 12 Display menu 2i e ras POP nea aes eae OVE PURI ERU S 25 6 4 2 Time Multiplex Model Window sseseeeeee n 25 A Display menu 4o 25 es Re EIE e RARO p ie eR e eR 25 6 4 22 Select inenu i i oc up Be EER BL S MES BE we ER C NU Re MORE h 2 6 4 3 Phasing Window Overview 2 0 cece cee eee cence en ene 28 7 Measuring Graph Theoretic Speedup Performance 0 0 cece o 28 DAS OVELVIEW eyo ee A BN A eee oe 28 72 Display Menu s A Re RR RS LAS Gann e tee E al ea ca Ni 28 iii 8 Summarizing Dataflow Graph Attributes 0 20 0 eee ee eee eee 29 Bel OVERVIEW 2 355 A Sek reed e OM E REESE n Beste etre Beste aaah 29 82 Display Meri i dns br tte presets Festi Pde UE d e obo itle 29 9 Cperatirig Points dr A EeePC PR up EP le ean 31 9 I OVervieWw ti 31 9 2 Display Men oe Rete e TAN COS ee A ob a E Magara eee GN anes 32 10 Case Studies ata ps ye See dee C he Sue dota ve Ee eR a v 33 10 1 Optimization of Dataflow Derived Schedule 0 0 0 0 cece eee ee eee eee 33 10 1 1 Initialization of Control Edges lesse 36 10 1 2 Updating Dataflow Graph 0 0 ec eee cent nen ens 38 10 2 Modeling Communication Delays 0 0 cece cc cence eee nee 40 10 2 1 Network with Communication
20. sets set of task latencies execution time of task represented by node latency of edge representing communication delay latest finish time of node ith element in latency of ith task sum of node and edge latencies in graph recurrence loop latency of node representing executable task initial marking of graph ith node in graph architecture model which assumes processors are fully connected via communication paths each processing element is paired with communica tion com controller which handles transfer of information after processor sets up for transfer architecture model which assumes processors are fully connected via communication paths model does not assume each processing unit is paired with communication com controller which handles transfer of information graph vertex represents algorithm instructions or task to be executed ix operating point 0 OE OF parallel execution phase lt PI R R S schedule length SGP Shared Memory No Contention Cc sink slack float source SRE T task TBI TBIO TBIO TBO TBO TCE TGP T time multiplex execution To token token bound TRE U VHSIC predicted performance and resource requirement characterized by TBO TBIO and R schedule length output empty number of initially empty output queue slots output full number of initially full output queue slots execution of multiple graphs in parallel without controlling graph phasing
21. the production of an input token by the source and the consumption of the corresponding output token by the sink is defined as the time between input and output or TBIO TBIO is frequently equivalent to the scheduling length defined as the minimum time to execute all tasks for a given data set However when initial tokens are present the scheduling length may be greater than TBIO The TBIO metric in relation to the time it would take to execute all tasks sequentially can be a good mea sure of the parallel concurrency inherent within a DFG If there are no initial tokens present in the DFG TBIO can be determined by using the traditional critical path analysis where TBIO is given as the sum of node laten cies in and data communication delays in D modeled by edge latency contained in the critical path When M defines initial tokens in the forward direction the graph takes on a different behavior ref 5 This occurs in many signal processing and control algorithms where ini tial tokens are expected to provide previous state infor mation history or to provide delays within the algorithm A general equation is used by the Design Tool to calculate the critical path and thus TBIO as a function of TBO when initial tokens are present along forward paths TBIO y Enode Laode 7 Deritical pan TBO 1 V node e critical path V edge e critical path where Lnode are the node latencies Ledge are the edge latencies and Deritical
22. to section 5 1 1 for detailed explanations of each command 21 Total Resource Envelope isplay Select DFG 4 TIME 0 sri 314 Figure 28 Total Resource Envelope window displays processor utilization associated with Total Graph Play window Utilization al 28 7 70 7 100 0 96 100 0 9 0 0 96 4 Processors 3 Processors 2 Processors 1 Processors 0 Processors Computing Effort 940 Total Utilization 74 8 96 Figure 29 Select Utilization command from Display menu in Total Resource Envelope window to measure utilization of pro cessors Utilization depicted is associated with one TBO interval of 314 time units as shown in figure 28 The Total Resource Envelope window provides an additional command that enables the user to measure processor utilization within a scheduling period Utilization Invokes the window shown in fig ure 29 which displays processor utilization mea surements The measurements are based on the time interval defined by the current left and right time cursor positions 22 6 2 Select Menu The Select menu includes the commands Jump by Scroll Step that enable the user to force the cursors to jump to selected nodes and or events and set the time step for the horizontal scroll just as in the graph play windows Refer to section 5 1 2 for detailed explanation of each command 6 3 Utilization Window Overview The Utilization window dis
23. 350 time units 39 Notepad DFG GTF File Edit Search Help Graph DFG Ej Node A E Read 18 Process 78 Write 18 Inst 1 End Node Source Src Sink Snk End Sink Edge Data Initial Src Node B Read 10 Process 378 Write 18 Inst 2 End Node Tokens 8 Queue 1 End Edge EDGE CONTROL INITIAL E Node C Read 18 Process 70 Write 18 Inst 1 End Node EDGE CONTROL Node D INITIAL B Read 18 TERMINAL D Process 170 TOKENS 1 Write 18 QUEUE 2 Inst 1 END EDGE End Node Node E Read 18 Process 78 Write 18 Inst 1 Terminal B Tokens 8 Queue 1 End Edge Notepad DFG G IF f File Edit Search Help Notepad DFG GTF MA File Edit Search Help Edge Data E Initial B Terminal F Tokens 8 Queue 2 End Edge Edge Data Initial F Terminal Snk Tokens 8 Queue 1 End Edge Edge Data Initial fi Terminal C Tokens 8 Queue 2 End Edge Edge Data Initial C Terminal F Tokens 6 Queue 1 End Edge Edge Data Initial f Terminal D Tokens 8 Queue 1 End Edge Figure 52 Select View command from File menu in main window to view graph file Updated graph for three processor design is shown Note added control edges in middle window happens to decrease TBIO to 590 time units in this exam ple as shown in figure 53 The current updated graph design
24. 6 4 1 Parallel Execution Model Window 6 4 1 1 Overview The Parallel Execution window displays the processor requirements and utilization per formance for all graphs analyzed by the Parallel Execu tion model An example for three graphs is shown in figure 32 This window is invoked by selecting the show Parallel Execution command from the window menu in the main window section 3 2 3 The Total Computing Effort shown in the window will be the summation of the TCE values for all graphs For each individual graph the processor requirements are equal to the peak of the corre sponding TRE curve and processor utilization is calcu lated as before from equation 7 The total processor requirement is the sum of the individual processor 24 requirements calculated to be 16 in this example The total processor utilization U parallel model Of the system is calculated by summing the individual graph speedup val ues eq 6 and dividing by the total number of proces sors as shown in equation 9 S 1 U parallel model 7 a re 9 total where S is the speedup of the ith graph and Rotal is the total processor requirement Further discussion of the Parallel Execution model and the calculation of total pro cessor utilization by using a collection of example graphs is deferred until section 10 3 Parallel Graph Execution 4 Total Computing Effort 1950 Graph 1 Processors 4 Processor Utilization 100 0 96 Graph 2
25. A ETENE EE RE b rese haa 6 Sel I Graph Text Bless edd p EUR Rep tasers tend PARTES UESTRE 6 3 122 Notes File vbi REDI A EE eo es EPA oe ies G Ea ES 7 32 Main Program Overview st imes 0 a e E RE E E EE E E E EENE O E S eas 7 3 21 Wile Menu sc a ae a PSOE NEU UE re a 7 3 2 2 Setup Menu ovn A A dee ee ade neces E RN eps SER 8 3 23 Window Menu eph sib heat es ee etes e B OR PST RUE a es 9 3 2 4 Help Menus bee A REED EP CE EY 9 4 Metrics Window so Lese LEER RIED ERR eS ae Wate d aree ette aos 9 4 1 Metrics Window Menus 0 cece teen nent ene e enn n ees 12 4 11 Display Menu cos cases fet od isa 12 4 125SeCMenu ono espresse eta pep ache ple Tea chy pales ug vut une eines 12 4 2 Total Computing Pltortiaiici a Mead Pix Ad XE ave Sade 12 Se Graph Play deno RES NE LEER eet ete e a ede oe 13 5 1 Single Graph Play Window o oooooooococoo e eee nee en ene 13 5 L1 Display Menu urne bre A ye eR EU Ser en hee ees 13 L2zselect Menu ti cade hota ot eb nU PE eld qa Smiles ueste ate erdt 16 5 2 Total Graph Play Window seeeeeeeeeee m n ence en ene 18 A A ESO A Menu zT RN 21 2 2 2 Select A A Ue A 21 6 Measuring Concurrency and Processor Utilization ooo oococooooncnconoooooronom o 21 6 1 Display MEMO cio d e E a por td Desa actio vagas 21 62 Select Menu tati A A e A ve us aoe Rawat 22 6 3 Utilization Window Overview 1 0 0 cece cece e neren 22 6 4 Portraying Processor Utilization
26. E OF Control edges are distinguished from data edges by using blue text in the window display and an asterisk in the notes file The display is updated automatically as the modeling parameters or characteristics change during the design session 8 2 Display Menu The Display menu includes commands that enable the user to select display options and save results Show Invokes the dialogue box shown in fig ure 39 to allow the user to change the viewing options Simply check or uncheck the boxes to show or hide the graph attributes respectively save Results Saves the graph information to the notes file The name of the notes file can be defined with the Save Info command from the File menu in the main window see section 3 1 2 Selecting this 29 30 la Notepad DFG TAT File Edit Search Help E DESIGN TOOL NOTES FILE Time and Date Wed Jul 13 13 37 43 1994 D SUTHTANATAHHNDE Hd Fg g te Processors LATENCY 90 1 0 gt Snk 1 0 A Figure 38 Graph Summary window displays DFG attributes associated with current dataflow schedule _ Display El instantiations Latency Output Empty Full El Output Queue Ll Earliest Start O Latest Finish El Slack Time conos Figure 39 Select Show command from Display menu in Graph Summary window to select amount of information to display command will update the speedup performance data in the file as shown in figure 40 Redr
27. END SOURCE end of source object To specify a sink transition SINK name specifies a sink with a unique name END SINK end of sink object To specify an edge edges must be specified following the NODE SOURCE and SINK statements Window Dataflow Design Tool Select architecture model Open close and view DFG file Select multiple graph strategy Name and view notes file Exit program View operating point window multiple graph windows and desired metrics window Select color or black and white display Invokes on line help and provides information on DesignTool Figure 4 Design Tool main window type can be DATA or CONTROL EDGE type INITIAL name name of node producing tokens to edge ERMINAL name name of node consuming tokens from edge OKENS integer number of initial tokens minimum FIFO queue size of edge QUEUE integer DELAY integer edge delay used to model communication time END EDGE end of edge object Note that the INITIAL and TERMINAL statements must precede the remaining EDGE statements 3 1 2 Notes File A notes file is a file designated by the user via the save Notes command for the saving of performance results or personal notes during the design session After creation the file can be viewed at any time via t
28. Edge Edge Data Initial A Terminal D Tokens 0 Queue 1 End Edge Edge Data Initial D Terminal E Tokens 0 Queue 1 End Edge Edge Data Initial E Terminal D Tokens 1 Queue 1 End Edge Edge Data Initial E Terminal F Tokens 0 Queue 1 End Edge 55 Graph text file of figure 1 updated for three processor schedule Graph DFG Node A Read 10 Process 70 Write 10 Inst 1 End Node Node B Read 10 Process 370 Write 10 Inst 2 End Node Node C Read 10 Process 70 Write 10 Inst 1 End Node Node D Read 10 Process 170 Write 10 Inst 1 End Node Node E Read 10 Process 70 Write 10 Inst 1 End Node 56 Node F Read 10 Process 70 Write 10 Inst 1 End Node Source Src TBI 314 End Source Sink Snk End Sink Edge Data Initial Src Terminal A Tokens 0 Queue 1 End Edge EDGE CONTROL INITIAL E TERMINAL C TOKENS 0 QUEUE 1 END EDGE EDGE CONTROL INITIAL B TERMINAL D TOKENS 1 QUEUE 2 END EDGE Edge Data Initial A Terminal B Tokens 0 Queue 1 End Edge Edge Data Initial B Terminal F Tokens 0 Queue 2 End Edge Edge Data Initial F Terminal Snk Tokens 0 Queue 1 End Edge Edge Data Initial A Terminal C Tokens 0 Queue 2 End Edge Edge Data Initial C Terminal F Tokens 0 Queue 1 End Edge Edge Data Initial A Terminal D Tokens 0 Queue 1 End Edge Edge Data Initial D Terminal E Tokens 0 Queue 1 End Edge Edge Data Initial E Terminal D Tokens 1 Queue 1 End Ed
29. NASA Technical Memorandum 4672 Dataflow Design Tool User s Manual Robert L Jones III Langley Research Center e Hampton Virginia National Aeronautics and Space Administration Langley Research Center Hampton Virginia 23681 0001 M AAA C CAMCN CN NCC NC CN a C February 1996 The use of trademarks or names of manufacturers in this report is for accurate reporting and does not constitute an official endorsement either expressed or implied of such products or manufacturers by the National Aeronautics and Space Administration Available electronically at the following URL address http techreports larc nasa gov ltrs ltrs html Printed copies available from the following NASA Center for AeroSpace Information National Technical Information Service NTIS 800 Elkridge Landing Road 5285 Port Royal Road Linthicum Heights MD 21090 2934 Springfield VA 22161 2171 301 621 0390 703 487 4650 Contents Nomenclature 0 5 2 sch ete tres e ens Sk ete eR pie n P ni seh sm eas ix T Introduction e 12 E pne a i d de teed aa eia da eed qe tet 1 2 Datatlow Graphs riot A wae Sg wee Ai VU EEES 2 2 1 Measuring and Constraining Parallelism 2 0 0 eee cece eA 3 2 2 Run Time Memory Requirements 0 0 0 cee cece eee eee ee 4 2 3 Control Edesa A abi e uei peer pP ith doa ded 4 3 Datatlow Design Tool uu phi bip nisal ee do ER Re BA ER Red e bres tec 5 3 1 Fil Input Output senes sere enr Sued epee DETER E
30. OK button will exit the pro gram whereas clicking Cancel will return to previ ous state By checking the Save Setup box the program will remember the current graph file and automatically load it upon reexecution of the program Open Graph Directores File dc win 1 bat amm demo Liszt Files of Type Drives Graph Tent rues arr Figure 5 Dialogue box for opening graph file Data ow Design Tool Architecture Model Invokes the dialogue box shown in figure 7 to allow the user to select a gen eral model of the target architecture ki This will end the session The architectural models are defined as Save Setup OF Cancel Figure 6 Dialogue box for exiting Design Tool program Architecture Shared Memory f No Contention C Network with Cam Controller Network without Com Controller Figure 7 Dialogue box for selection of architecture model 3 2 2 Setup Menu The Setup menu includes commands that enable the user to select the architecture model and the type of mul tiple graph execution strategy A description of each command is given as follows 8 Shared Memory No Contention This architecture model assumes the processors are completely connected to shared memory with enough paths to avoid contention In effect this model provides an architecture independent model that exposes the par allelism inherent within the algorithm constrained only by the algorithm decomposition
31. Processors 7 Processor Utilization 85 7 96 Graph 3 Processors 5 Processor Utilization 80 0 Total Processors 16 Total Processor Utilization 87 5 Figure 32 Parallel Graph Execution window displays processor requirements and utilization under Parallel Execution model 6 4 1 2 Display menu The Display menu includes commands that enable the user to change display options and save results The commands are defined below Show Allows the user to change viewing options Save Results Saves the utilization measurements shown in the window to the notes file The name of the notes file can be defined with the Save Info command from the File menu in the main window See section 3 2 1 Redraw Refreshes the display 6 4 2 Time Multiplex Model Window The Time Multiplex window portrays the processor requirements and utilization for the algorithm graphs analyzed with the Time Multiplex model A sample dis play is shown in figure 33 a for three graphs The win dow displays the overall resource envelope due to the overlap of the resource envelopes of the individual graphs The display portrays the processor utilization which is dependent on the phase delays between graphs for a single periodic scheduling cycle The graph sequence and phasing determines the amount of overlap and thus the peak processor requirements for all time multiplex graphs The dotted lines in figure 33 a indi cate the graph p
32. al Multiple instantiations of a task are shown by creating multi ple rows per task this allows the bars to overlap Single Resource Envelope SRE A plot of the processor requirement for the SGP Total Resource Envelope TRE A plot of the processor requirement for the TGP Performance Plots speedup versus processors given by equation 6 The following buttons when clicked on provide numerical data on DFG attributes computing effort and allow the user to select a sink for graphs with multiple sinks to measure TBIO Graph Summary Displays a window summariz ing the DFG attributes node names latencies ear liest start latest finish instantiations and FIFO queue sizes TCE Invokes the dialogue box shown in fig ure 12 which shows a breakdown of computing effort The TCE dialogue box is discussed in more detail in section 4 2 TBIO Invokes the dialogue box shown in fig ure 13 to select the desired sink in which to mea sure TBIO TBIO Schedule Length Toggles between dis playing TBIO or the schedule length TBO Allows the user to increment or decre ment TBO or set it to a desired value 11 Figure 13 Dialogue box to select desired sink for TBIO calculations INPUT Figure 14 Dialogue box to set TBO value Processors Figure 15 Dialogue box to set processor limit Clicking on the button invokes the dialogue box shown in
33. alysis Parallel Execution and Time Multiplex Execution In the Parallel model the phasing between the graphs is nondeterministic whereas in the Time Multiplex model the phasing between the graphs is known a priori Figure31 portrays the differences between the two models and the effect on the total pro cessor requirements Since the Parallel Execution model assumes that the phasing of the graphs and hence the overlap of the individual TRE s cannot be controlled the system must provide for the worst case processor requirements that is summation of the processor requirements of the individual graphs In the Time Multi plex model the overall processor requirement is a func tion of the overlap between graphs determined by the user Thus the determinism provided by the Time Mul tiplex model can result in fewer processors There are two window displays for the multiple graph models one for each model The window displays and user interface to each are discussed in this section 23 Graph 2 Graph 3 Bo Riotal A1 R5 R5 Time Multiplex Rotal max R 4 R15 Ro R55 R3 R31 Figure 31 Parallel Execution model assumes no control over graph phasing which requires worst case processor design In Time Multiplex Execution model phasing between graphs is fixed to particular values by user This deterministic phasing and hence overlap between graph reduces processor requirements based on amount of graph overlap
34. and since the TGP displays the superimposed TBO width seg ments shown by the SGP window There is no Show Slack command since the dis play would become messy because of the over lapped task schedules shown within the TGP The TGP window includes a command not offered by the SGP window which is Show Packets Shows the relative data packet data set numbers above the associated node exe cution bars Single Resource Envelope Display Select DFG Figure 27 Single Resource Envelope window displays processor utilization associated with Single Graph Play window 5 2 2 Select Menu The Select menu includes commands that enable the user to display only selected nodes force the cursors to jump to selected nodes and or events or set the time step for the horizontal scroll Since the commands are func tionally equivalent to the Select menu commands pro vided by the SGP window see section 5 1 2 for a description of each command 6 Measuring Concurrency and Processor Utilization The Concurrency windows plot processor require ments and utilization over time for the DFG schedules The plots are referred to as resource envelopes and the area under the curve is equal to the computing effort required of the processors The single resource envelope SRE shows the steady state processor requirements of the DFG for the execution of a single computation or data packet The SRE for the dataflow schedule of fig u
35. aph command from the Display menu in the Operating Point window See also section 9 Invoking the update Graph command will prompt the user with the dialogue box shown in fig ure 51 In addition to reminding the user of the processor requirements the dialogue box allows the user to accept or cancel the update to the graph Refer to section 9 for the meaning of the Index 1 statement in the update Graph box Selecting Yes will update the graph file Update Graph New Operating Point Modify graph s tor the Operating Point Total Processors 3 Figure 51 Updating dataflow graph with design attributes DFG GTF with the appropriate graph attributes The updated file can be viewed by selecting the View com mand from the File menu in the main window as shown in figure 52 approximately a three page view The updated attributes referenced to the appendix shown in figure 52 can be compared with those of figure 47 Note that the required instantiations of node B left window has been set to 2 two control edges middle window have been added the control edge B lt D has been ini tialized with a single token and data edges A lt C and B F require two queue slots each The user may alter an updated graph as many times as required overwriting the current design For example one may decide later that TBO may be sacrificed for an improvement in TBIO for the same number of processors three in this case Increasing TBO to
36. are referred to as control edges The Design Tool allows the user to alter the dataflow schedule by choosing that a given task be delayed until the execution of another task The Task system 7 lt M T Setoftasks L Fixed task latencies lt Partial order on T Mo Initial state Dataflow graph Nodes represent T Edges describe lt Tokens indicate presence of data Initial marking M Dataflow graph DFG Performance bounds Schedule length c Time between input and output TBIO y Minimum iteration period T Time between outputs TBO Slack Processor utilization Run time requirements Task instantiations Processor requirement Data buffers Artificial lt control edges Graphical displays Gantt chart task execution Single iteration SGP Periodic execution TGP Resource envelopes Figure 2 Dataflow Design Tool information flow Design Tool automatically models this additional prece dence constraint as a control edge and initializes the edge with tokens positive or negative as needed to provide proper synchronization That is as a function of the new schedule the precedence constraint may impose intra iteration dependencies for the same data set which do not require an initial token On the other hand the prece dence relationship may impose inter iteration depen dency for different data sets which requires initial tokens to occur 3 Dataflow Design Tool The dataflow paradigm p
37. aw Refreshes the display window 9 Operating Points 9 1 Overview The term operating point is used to define a partic ular multiprocessing design that results in a level of per formance and processor requirement for example TBO TBIO and R processors The user may select multipro cessing schedules that have operating points within a plane TBO x TBIO that has a graph theoretic lower bound bottom left of the plane defined by TBOj and TBIO The Operating Point window displays the plot of TBO versus TBIO with the number of required proces sors as a parameter An example of an Operating Point window displaying four operating points associated with one to four processors is shown in figure 41 The dashed Notepad DFG TXT File Edit Search Help DFG GRAPH SUMMARY Total Computing Effort 940 Processing 820 Read Write 128 Overhead 12 8 Schedule Length 578 TBO 314 Processors Y SINK TBIO Snk 578 NAME LATENCY A 90 INST DE OF 1 0 1 0 1 0 2 0 Wommmnmmmnooo Figure 40 Select save Results command from Display menu in Graph Summary window to update notes file with DFG attributes for cur rent scheduling solution 31 DFG Operating Points TBO 334 TBIO 666 R 3 TB1Oalb Figure 41 Operating Point window plots TBO versus TBIO Number above each point represents number of processors required to achieve that level of performance Subscript alb denotes ab
38. ck and white displays 3 2 4 Help Menu The Help menu allows the user to invoke the Win dows Help program for on screen help and information about the Dataflow Design Tool A description of each command follows Help Invokes the help window as shown in fig ure 9 Pressing F1 also invokes the help window About Design Tool Displays information about the tool as shown in figure 10 4 Metrics Window The Metrics window displays the numerical perfor mance characteristics of a graph and allows the user to invoke the graphical performance displays The graph name is shown in the window title A Metrics window as shown in figure 11 is created for each graph in the graph file Performance metrics include TCE total computing effort equal to the sum of all task latencies Ele Datatl o Design Tool Edit cia ila Help Dataflow Design Tool pe Design Tool Cherie Commands File Manu setup Manu Window hen Help Menu Glossary Defined Terms Tha Index contains a list of all Help topics available tor tha Hala Example Far infarmation an howto use Help press Fl or choose Using Help from the Help menu Dataflow Design Tool el NASA Langley Research Center TBIO TBO TBIO Schedule length TBO Processors 10 Figure 9 Choose Help menu in main window or press F1 to invoke on screen Help window Version 2 6 by Robert L Jones Hampton Virginia Figure 10 About box lower
39. cursors to jump to selected nodes and or events or set the time step for the horizontal scroll bar Display Invokes the dialogue box shown in fig ure 21 allowing the user to choose which nodes to include and the vertical ordering within the display Double click the node name shown in the list to toggle between x show and _ don t show Click once on a node and press the Top Up Down or Bottom buttons to move its position relative to other nodes Jump by Invokes the dialogue box shown in figure 22 allowing the user to choose which node or event to have the cursors jump by Check the box for the desired node and or event condition and select the node and or event of interest Scroll Step Invokes the dialogue box shown in figure 23 allowing the user to select the jump interval for the horizontal scroll bar The range is 0 to 32767 gg Single Graph Play TIME 0 570 PET TET TEEEEEESUEEHHURHESERE EET Critical Path a Display Paths menu single Graph Play b Display Circuits menu Figure 20 Select Display Paths menu command to display all paths or just critical paths or Display Circuits menu command to dis play same for circuits Paths and circuits are denoted with gray shaded bars 17 Figure 21 Choose Display command from Select menu to cus tomize display of nodes within Graph Play windows Jump by Figure 22 Choose Jump by command
40. displays DFG attributes associated with current dataflow schedule eee mde RS ER edhe oe he eb Se EH 30 Figure 39 Select Show command from Display menu in Graph Summary window to select amount of information to display 0 cece cece cc eee eee eee 31 Figure 40 Select save Results command from Display menu in Graph Summary window to update Notes file with DFG attributes for current scheduling solution 31 Figure 41 Operating Point window plots TBO versus TBIO Number above each point represents number of processors required to achieve that level of performance Subscript alb denotes absolute lower bound 0 cece ete teen nea 32 Figure 42 Select Show command from Display menu in Operating Point window to select graph and option index to VIEW 6 eee eee m 33 Figure 43 Dataflow schedule for desired number of processors three for example and TBO 314 time units may in fact require more processors four in this case User may wish to eliminate needless parallelism and fill in underutilized processor time Figure shows intent to delay node C behind nodes B D or E 2 1 eee 34 Figure 44 Imposing intra iteration control edge EX C delays node C within its slack time so that TBIO is not increased Rescheduling node C eliminates needless parallelism so same TBIO can be obtained with two processors rather than three 0000000 35 Figure 45 The Design Tool prevent
41. e graph name Processors Invokes the dialogue box in figure 15 to set the calculated processor limit TBO Invokes the dialogue box in figure 14 to set the desired TBO period Sink Invokes the dialogue box in figure 13 to set the desired sink for TBIO calculations Graph Name Invokes the dialogue box in fig ure 16 to allow the user to rename the graph for dis play purposes 4 2 Total Computing Effort The total computing effort TCE value given by the Metrics window depends on the chosen architecture model defined in section 3 Figure 12 shows the dialogue Graph Mame Figure 16 Dialogue box to change graph name box displayed for each of the three architecture models The difference between the Shared Memory model and the Network with Com Controller model is in the inter pretation of graph attribute write time For the network models write time is assumed to represent setup time for the transfer of information and is denoted as such The Network without Com Controller model displays the time spent communicating defined by edge delays because the processor will be burdened with the effort Since the graph described by DFG GTF does not define edge delays the communication effort is shown to be zero 5 Graph Play The Graph Play windows provide Gantt charts describing the dataflow derived task schedules of the algorithm graph The single graph play SGP shows the steady state time schedule of t
42. e signal pro cessing and control law problems The software tool implements graph search algo rithms and analysis techniques based on the dataflow paradigm Dataflow analyses provided by the software are introduced and shown to effectively determine perfor mance bounds scheduling constraints and resource requirements The software tool provides performance optimization through the inclusion of artificial precedence con straints among the schedulable tasks The user interface and tool capabilities are described Examples are provided to demonstrate the analysis scheduling and opti mization functions facilitated by the tool 1 Introduction For years digital signal processing DSP systems have been used to realize digital filters compute Fourier transforms execute data compression algorithms and run vast amount of other computationally intensive algorithms Today both government and industry are finding that computational requirements especially in real time systems are becoming increasingly challeng ing As a result many users are relying on multiprocess ing to solve these problems To take advantage of multiprocessor architectures novel methods are needed to facilitate the mapping of DSP applications onto multi ple processors Consequently the DSP market has exploded with new and innovative hardware and soft ware architectures that efficiently exploit the parallelism inherent in many DSP applications The dataflow para
43. ed along with the SGP window All other Metrics windows will be mini mized to icons as shown beside the Parallel Execution window icon at the bottom of figure 61 Figure 61 also indicates that Graph 1 requires a four processor schedule for an iteration period of 200 time units This is an optimum four processor solution for both TBIO and TBO since there is no idle time in the periodic schedule At this level of throughput 1 TBO three data sets would execute simultaneously in the graph Figure 62 shows the Metrics window SGP win dow and TGP window of Graph 2 eight processors are required at an iteration period of 108 time units The TGP shows that five data sets would be processed in the graph simultaneously at this level of throughput The dataflow analysis of Graph 3 portrayed in figure 63 indicates that four processors would suffice for an itera tion period of 134 time units Since the dataflow sched ules of Graph 2 and Graph 3 have idle time the solutions are not optimal Nonetheless a search for a better solu tion will not be shown here since the intent of this dis cussion is to demonstrate the multiple graph analysis features 44 Given the individual scheduling solutions portrayed in figures 61 to 63 the overall processor requirements and utilization are displayed by the Parallel Graph Exe cution window in figure 64 The total computing effort is computed as 1950 time units by summing the TCE val ues for all graph
44. efs 7 9 2s L nodet T ma ze max node e loo loop Ledge edge loo 2 loop Given a finite number of processors the actual lower bound on the iteration period or TBO is given by 3 TBO max T TRO where TCE is the total computing effort and R is the available number of processors If communication effort modeled by edge delays is ignored TCE can be calcu lated from the latencies in as TCE Y Lj 4 ie L and the theoretically optimum value of R for a given TBO period referred to as the calculated R can be com puted as R TCE 5 TBO where the ceiling function is applied to the ratio of TCE to TBO Since every task executes once within an iteration period of TBO with R processors and takes TCE amount of time with one processor speedup S can be defined by Amdahl s Law as _ TCE TBO 2 and processor utilization U ranging from 0 to 1 can be defined as U 7 for a processor requirement R By definition the critical path does not contain slack thus critical path tokens will not wait on edges for noncritical path tokens ideally The inherent nature of dataflow graphs is to accept data tokens as quickly as the graph and available resources processors and memory will allow When this occurs the graph becomes con gested with tokens waiting on the edges for processing because of the finite resources available without result ing in throughput above the graph im
45. er mine performance bounds scheduling constraints and resource requirements The software tool provides perfor mance optimization through the inclusion of artificial precedence constraints among the schedulable tasks The user interface and tool capabilities are described Examples are provided to demonstrate the analysis scheduling and optimization functions facilitated by the tool 14 SUBJECT TERMS E 15 NUMBER OF PAGES Multiprocessing Real time processing Scheduling theory Graph theoretic model 67 Graph search algorithms Dataflow paradigm Petri net Performance metrics 16 PRICE CODE Computer aided design Digital signal processing Control law Windows environment 17 SECURITY CLASSIFICATION 18 SECURITY CLASSIFICATION 19 SECURITY CLASSIFICATION 20 LIMITATION OF REPORT OF THIS PAGE OF ABSTRACT OF ABSTRACT Unclassified Unclassified Unclassified NSN 7540 01 280 5500 Standard Form 298 Rev 2 89 Prescribed by ANSI Std Z39 18 298 102
46. essors 44 4 1 Processors 100 0 0 Processors 0 0 TIME 0 1350 Computing Effort 1950 EEE ECOS Total Utilization 72 2 b Here delays are set to TBIOj of each graph shown in Phasing window placing graphs back to back without processing overlap Dashed lines in resource envelope denote phase delay between graphs Figure 65 Time multiplex strategy assumes phasing between graphs can be controlled by defining a finite delay between inputs to graphs 49 Graph Execution Sequence Graph 2 gt 600 Graph 3 gt 450 Graph 2 gt 600 Figure 66 Select Edit Graph Sequence command from Select menu in Multiple Graph window to edit periodic graph sequence User can also replicate graphs so that they execute n times as often as other graphs within scheduling cycle Figure shows Graph 2 has been repli cated twice selection of control edges and static processor assign ments for optimal or near optimal scheduling solutions Also planned are enhancements to the underlying model and control edge heuristics to permit the design of real time multiprocessing applications for both hard and soft deadlines ref 13 For hard real time modeling the design would assume worst case task latencies It has been observed that under such assumptions run time dataflow behavior may result in anomalous behavior such as requiring more processors than indicated from the worst case scenario ref 14 This is a resu
47. f Dataflow Pro grams Proceedings of the 15th Annual International Sympo sium on Computer Architecture IEEE 1988 pp 141 150 Som S Mielke R R and Stoughton J W Effects of Resource Saturation in Real Time Computing on Data Flow Architectures Twenty Fifth Asilomer Conference on Signals Systems amp Computers Volume 1 IEEE 1991 Som S Obando R Mielke R R and Stoughton J W ATAMM A Computational Model for Real Time Data Flow Architectures Int J Mini amp Microcomput vol 15 no 1 1993 pp 11 22 Stankovic John A and Ramamritham Krithi What is Pre dictability for Real Time Systems Real Time Syst vol 2 1990 pp 247 254 Manacher G K Production and Stabilization of Real Time Task Schedules J Assoc Comput Mach vol 14 no 3 July 1967 pp 439 465 Form Approved REPORT DOCUMENTATION PAGE Public reporting burden for this collection of information is estimated to average 1 hour per response including the time for reviewing instructions searching existing data sources gathering and maintaining the data needed and completing and reviewing the collection of information Send comments regarding this burden estimate or any other aspect of this collection of information including suggestions for reducing this burden to Washington Headquarters Services Directorate for Information Operations and Reports 1215 Jefferson Davis Highway Suite 1204 Arlington VA 22202 4302 and to the
48. figure 14 The minimum TBO value per missible is determined from equation 3 for the current calculated processor setting Processors Allows the user to increment or decrement the calculated processor limit Each time the calculated processors count is changed TBO is set to the optimum value determined from equation 3 12 4 1 Metrics Window Menus The previous section presented the buttons used to invoke displays and set parameters The Metrics window also provides two menus Display and Set as shown in figure 11 that can be used instead of the buttons The commands for the Display and Set menus are described as follows 4 1 1 Display Menu The Display menu includes commands that enable the user to view and arrange the previously described window displays The first six commands Graph TCE Schedule length TBIO Graph Play Concurrency Performance are equivalent to the button definitions given previously The following three commands allow the user to refresh and arrange the displays currently on the screen Tile Tiles the currently active windows invoked by the Metrics window Cascade Cascades the currently active windows invoked by the Metrics window Reset Refreshes the currently active windows invoked by the Metrics window 4 1 2 Set Menu The Set menu includes commands that enable the user to define the calculated processor value set TBO and change th
49. g this optimum solu tion is NP complete and requires an exhaustive search the Design Tool was developed to help the user find appropriate control edges when needed and to make trade offs when the optimum solution cannot be found or does not exist ref 9 The solution for a particular TBO TBIO and R is ultimately application dependent That is one application may dictate that suboptimal graph latency TBIO gt TBIO may be traded for maximum throughput 1 TBO while another application may dic tate just the opposite An application may also specify a control or signal processing sampling period TBO and the time lag between graph input g f and graph output g t TBIO that is greater than the lower bounds deter mined from graph analysis possibly making it easier to find a scheduling solution The use of the Design Tool for solving the optimum three processor solution is pre sented in this section With reference to figure 43 node C has intra iteration slack time that may be utilized by delaying node C without degradation in TBIO performance Selecting the Add Edge command from the Display menu in the SGP window and clicking on the execution bar of node C immediately highlights the nodes indepen dent of node C These highlighted nodes are the only options for an intra iteration control edge By using the time cursors one can easily determined that node C can be delayed behind node E without extending beyond its slack time
50. ge Edge Data Initial E Terminal F Tokens 0 Queue 1 End Edge Graph text file of figure 55 with communication delays modeled by edge delays Graph DFG Node A Read 10 Process 70 Write 10 Inst 1 End Node Node B Read 10 Process 370 Write 10 Inst 1 End Node Node C Read 10 Process 70 Write 10 Inst 1 End Node Node D Read 10 Process 170 Write 10 Inst 1 End Node Node E Read 10 Process 70 Write 10 Inst 1 End Node Node F Read 10 Process 70 Write 10 Inst 1 End Node Source Src TBI 400 End Source Sink Snk End Sink Edge Data Initial Src Terminal A Tokens 0 Queue 1 End Edge Edge Data Initial A Terminal B Tokens 0 Queue 1 DELAY 10 End Edge Edge Data Initial B Terminal F Tokens 0 Queue 1 DELAY 25 End Edge Edge Data Initial F Terminal Snk Tokens 0 Queue 1 End Edge Edge Data Initial A Terminal C Tokens 0 Queue 1 DELAY 50 End Edge Edge Data Initial C Terminal F Tokens 0 Queue 1 DELAY 20 End Edge Edge Data Initial A Terminal D Tokens 0 Queue 1 DELAY 30 End Edge Edge Data Initial D Terminal E Tokens 0 Queue 1 DELAY 10 End Edge Edge Data Initial E Terminal D Tokens 1 Queue 2 DELAY 10 End Edge Edge Data Initial E Terminal F Tokens 0 Queue 1 DELAY 15 End Edge 57 References 1 58 Mielke R Stoughton J Som S Obando R Malekpour M and Mandala B Algorithm to Architecture Mapping Model ATAMM Multicomputer Operati
51. h the x n edge would equal the path latency leading to the x n edge Likewise the minimum time at which the token firing the nth iteration on the y n dj edge could arrive from the source equals the path latency DFG Summary LATENCY 30 OE OF 1 0 gt D 210 gt C 1 0 gt 8B 111 gt D 2 0 gt F 1 0 gt F 1 0 gt E 1 0 gt C 110 gt F 0 1 gt D 1 0 gt Snk Figure 47 Dataflow graph attributes and timing are displayed in Graph Summary window Edge added between nodes B and D requires ini tial token OF 1 a n x n y n dy wn dy Figure 48 Algorithm function example leading to the y n dj edge However since this token is associated with the n dj th iteration pro duced d TBO intervals earlier the actual path latency referenced to the same iteration is reduced by the product of d and TBO From this example it is easy to infer that the actual path latency along any path with a collection of d initial tokens is equal to the summation of the associated node latencies less the product of d and TBO Thus the critical path and TBIO is a function of TBO and is given as the path from source to sink that maximizes equation 1 where d is the total number of initial tokens along the path Although initial tokens defined by the algorithm functions tend to complicate the dataflow analysis they become useful when the user attempts to optimize the dataflow schedule by introduci
52. hase delays The sequence order and exact delays are portrayed in the Phasing window shown in figure 33 b The Phasing window is used to define the phase between the graphs The phase between two graphs G and G is determined by the amount of delay between injecting input data into G4 and G A given graph may be required to receive input data more than once in a sin gle scheduling cycle and hence execute more often than others within a given sequence Such a graph is said to be multiply instantiated or replicated In effect this is equiv alent to having multiple sampling rates within the multi ple graph system The resulting TBO for a given graph is the total amount of delay in a sequence cycle divided by the number of instantiations for the graph In the example of figure 33 all graphs have the same TBO which is d d5 d4 600 time units Since TBO is a function of the phase delays and the total processor requirement depends on the resulting overlaps of graphs the TBO and Processors button and menu commands are disabled These two parameters cannot be changed independently by the user Instead these parameters reflect the resulting time multiplex characteristics As in measuring utiliza tion for individual graphs the overall processor utiliza tion is displayed via the Utilization window shown in figure 33 c The computing effort area under the curve is equal to the sum of all computing efforts for the indi vidual gra
53. hat the graphs have been separated by the respective TBIO values As before the TBO values for Graph 1 and Graph 3 equal the total amount of phase delay in this new sequence 45 GRAPH ENTRY TOO multigr qr File Edit Tools Display Window Aun a amp a La A y i Lal Parallel Graph Graph 2 Hiph 3 Ese inen Figure 61 When more than one graph is in graph file each is given its own Metrics window and associated displays windows Minimizing Metrics windows Graphs 2 and 3 in this figure hides all opened window displays pertaining to the graphs Dataflow analysis of Graph 1 shows four processors are required for an iteration period of 200 time units 46 Datatlow Design Tool DXWINTBYATAMMYDE MOWnultigr gri Set Display Select 3 icm Display Select Graph 2 2 r r 0 450 ii E E Tuum Greh e E Figure 62 Dataflow analysis of Graph 2 shows eight processors are required for iteration period of 108 time units HEN Ts Graph 3 moo mm TIME 0 134 ae bi 1 2 x EE uox Pore Gespn Desp 1 Groh 2 Execution Figure 63 Dataflow analysis of Graph 3 shows four processors are required for iteration period of 134 time units 47 m Parallel Graph Execution Total Computing Effort 1950 Graph 1 Processors 4 Processor Utilization 100 0 95 Graph 2 Processors f Processor Utilization HB B8 95 Graph 3 Processors 4 Processor U
54. he Notes menu command The following windows can save infor mation to this file Graph window Performance window Parallel Execution window Time Multiplex window 3 2 Main Program Overview Upon invoking the Design Tool DESIGN EXE the main window will appear at the top of the screen with a caption and menus as shown in figure 4 The menu com mands provided by the main window are defined in this section 3 2 1 File Menu The File menu includes commands that enable the user to open a graph file create and view a notes file for the current session or exit the program A description of each command is given as follows Open Invokes the dialogue box shown in figure 5 to allow the user to select a graph file to open as input Close Ends the current session for a particular graph file without exiting the program View File Invokes the editor e g NOTE PAD EXE specified in the DESIGN INI file for displaying the current graph file Get Info Shows information on the current graph file Save Info Invokes a dialogue box to allow the user to specify a notes file in which to save infor mation regarding the current design session Notes Invokes the editor e g NOTEPAD EXE specified in the DESIGN INI file for viewing and updating the notes file with personal notes Exit Exits the program Upon exiting the pro gram the dialogue box shown in figure 6 will be displayed Clicking the
55. he graph for a single com putation referred to as data packet or data set The tasks are shown scheduled at the earliest start times determined by the dataflow graph analysis If tasks are scheduled this way and infinite resources are assumed all the inherent parallelism present within the algorithm decomposition is exposed and limited only by the data precedences The SGP shows the task schedule over a time axis equal to the schedule length Task executions are represented by bars with lengths proportional to the task latencies The SGP determines the minimum num ber of processors sufficient to execute the algorithm for the schedule length shown For digital signal processing and control law algo rithms the algorithm represented by the DFG is assumed to execute repetitively on an infinite input stream In such instances the user does not have to wait until the algorithm finishes the computations for a given data packet before starting on the next Thus it is of interest to determine a cyclic schedule that permits the simulta neous execution of multiple data packets which exploits pipeline concurrency while assuring data integrity For this purpose the total graph play TGP shows the steady state periodic schedule of the graph for multiple computations or data packets over a schedule cycle of period TBO which is assumed to repeat indefinitely The TGP is also constructed by assuming infinite resources so that the parallel
56. ial data Such task systems can be described by a directed graph where nodes vertices represent the tasks and edges arcs describe the precedence relationship between the tasks When the precedence constraints given by lt are a result of the dataflow between the tasks the directed graph is equivalent to a dataflow graph DFG as shown in figure 1 Special transitions called Latency 390 Source 7 Token Figure 1 Dataflow graph example sources and sinks are also provided to model the input and output data streams of the task system The presence of data is indicated within the DFG by the placement of tokens The DFG is initially in the state indicated by the marking M The graph transitions through other mark ings as a result of a sequence of node firings That is when a token is available on every input edge of a node and sufficient resources are available for the execution of the task represented by the node the node fires When the node associated with task T fires it consumes one token from each of its input edges delays an amount of time equal to L and then deposits one token on each of its output edges Sources and sinks have special firing rules in that sources are unconditionally enabled for fir ing and sinks consume tokens but do not produce any By analyzing the DFG in terms of its critical path critical circuit dataflow schedule and the token bounds within the graph the performance characteristics a
57. ile the Design Tool can accept input from the ATAMM graph entry tool developed for the AMOS system at the Langley Research Center Updates to the graph file are made via dynamic data exchange DDE messages to the graph entry tool for a given design point R TBO and TBIO Changes to the graph topology due to added con trol edges appear in real time The graph entry tool is responsible for writing the graph updates to the graph file The Design Tool also makes use of other files An RTT file is created automatically for each graph file GRAPHFILE RTT and contains performance informa tion needed for follow up design sessions of previously updated graphs Two TMP files are also created for pro cessing paths PATHS TMP and circuits CIRCS TMP within the graph An INI file DESIGN INI stores 1 the graph file used in the last session 2 the default graph file extension to search for when opening a file 3 the location of the ATAMM graph entry tool if used and 4 the editor to be used to display the notes file dis cussed in section 3 1 2 An example of the INI file is shown in figure 3 DesignTool Extension GTF Model 0 Graph D WIN16 ATAMM DEMO DFG GRF GraphTool D WIN16 ATAMM GRAPH GRAPHGEN EXE Editor C WINNT NOTEPAD EXE Figure 3 Example of DESIGN INI file Written by Asa M Andrews CTA Inc 3 1 1 Graph Text File The Design Tool allows the user to describe a data flow graph with a te
58. in the problem the directed graph is equivalent to a dataflow graph Dataflow graphs are gen eralized models of computation capable of exposing inherent parallelism in algorithms ranging from fine to large grain This paper assumes an understanding of both dataflow graph theory as described by a Petri net marked graph and the fundamental problem of task scheduling onto multiple processors The Dataflow Design Tool is a Microsoft Windows application and thus a working knowledge of Microsoft Windows i e launching programs using menus window scroll bars is also assumed In the context of this paper graph nodes represent schedulable tasks and graph edges represent the data dependencies between the tasks Because the data depen dencies imply a precedence relationship the tasks make up a partial order set That is some tasks must execute in a particular order whereas other tasks may execute inde pendently When a computational problem or algorithm can be described with a dataflow graph the inherent par allelism present in the algorithm can be readily observed and exploited The deterministic modeling methods pre sented in this paper are applicable to a class of dataflow graphs where the time to execute tasks are assumed con stant from iteration to iteration when executed on a set of identical processors Also the dataflow graph is assumed data independent that is any decisions present within the computational problem are conta
59. ined within the graph nodes rather than described at the graph level The dataflow graph provides both a graphical model and a mathematical model capable of determining run time behavior and resource requirements at compile time In particular dataflow graph analysis can determine the exploitable parallelism theoretical performance bounds speedup and resource requirements of the system Because the graph edges imply data storage the resource requirement specifies the minimum amount of memory needed for data buffers as well as the processor require ments This information allows the user to match the resource requirements with resource availability In addi tion the nonpreemptive scheduling and synchronization of the tasks that are sufficient to obtain the theoretical performance are specified by the dataflow graph This property allows the user to direct the run time execution according to the dataflow firing rules 1 e when tasks are enabled for execution so that the run time effort is sim ply reduced to allocating an idle processor to an enabled task refs 1 and 2 When resource availability is not suf ficient to achieve optimum performance a technique of optimizing the dataflow graph with artificial data depen dencies called control edges is utilized An efficient software tool that applies the mathemat ical models presented is desirable for solving problems in a timely manner A software tool developed for design and anal
60. initial token Even though this initial token is deter mined automatically by the Design Tool a brief explana tion of inter iteration dependency is provided in this section As discussed in section 2 many signal processing and control algorithms require previous state information history or delays within the algorithm These delays can be modeled by initial tokens on edges With refer ence to figure 48 the node output z n associated with the nth iteration is dependent on the current input x n input y n dj provided by the n d th iteration and input w n dz produced by the n d5 th iteration But the initial tokens necessary to obtain a desired algorithm function affect the permissible schedule of tasks for a 36 desired iteration period Implementation of this function would require d initial tokens on the y n dj edge and d initial tokens on the w n d5 edge in order to create the desired delays In such cases the critical path and thus TBIO is also dependent on the iteration period TBO ref 5 For example given that a node fires when all input tokens are available if sufficient resources are present the earliest time at which the node shown in figure 48 could fire would depend on the longest path latency lead ing to the x n or y n d edges Assuming that the d and d tokens are the only initial tokens within the graph the time for a token associated with the nth iteration to reac
61. ism inherent in the algorithm is exposed The TGP determines the maximum number of processors sufficient to execute the algorithm periodi cally and as mentioned in section 4 that number may be greater than the calculated number of processors given by equation 5 When processor requirements exceed processor availability the Design Tool provides a tech nique of inserting artificial data dependencies called control edges to alter the dataflow derived schedule in hopes of reducing the processor requirement Insertion of control edges is explained in more detail later in this sec tion and in section 10 1 5 1 Single Graph Play Window The single graph play window for the DFG of fig ure 1 is shown in figure 17 The task node names are shown vertically and time is represented along the hori zontal axis Node latencies are represented by the length of the red shaded bars Slack time defined as the maxi mum time a task can be delayed without degrading the TBIO performance or violating inter iteration prece dence relationships is represented by unshaded bars fig 17 a Intra iteration control edges can be inserted by utilizing the SGP window It is often useful to observe the location of slack time displayed by the SGP and insert control edges to take advantage of the slack time while rescheduling nodes Time measurements can be taken with the left and right cursors displayed as vertical lines fig 17 b The left and righ
62. ive more rescheduling options corresponding to segments to the left of the node to reschedule are available to the user 10 1 2 Updating Dataflow Graph A multiprocessor solution of the dataflow graph in figure 1 was designed in the previous section The THOmh Figure 50 Operating Point plane for three processor design dataflow graph characterizes not only the partial ordering of tasks for correct results but also the communication synchronization and memory requirements at run time As discussed in section 9 the Design Tool allows the user to update the dataflow graph file with the appropri ate attributes that characterize a multiprocessor solution This section presents the update procedures required for the three processor design example Once the design is finalized as shown in the previous section the user can review the performance plane TBO versus TBIO by selecting the show Operating Points command from the Window menu in the main window Having done this the Operating Point window in fig ure 50 will be displayed showing the TBO 314 time units and TBIO 626 time units performance of the three processor R 3 scheduling solution Figure 50 shows that this design will result in optimum throughput 1 TBO but suboptimal graph latency TBIO gt TBIO The dataflow graph described by the file DFG GTF can be updated with the design attributes summarized in figure 47 by selecting the update Gr
63. l nodes independent of it will be highlighted in yellow Pressing the Enter key while holding down the Shift key will display the information box for the node as shown in figure 18 Legend Displays the legend for the given display mode Add Edge Allows the user to insert an intra iteration control edge between two nodes for example N N Selecting this command will display at the bottom of the window the following prompt for the terminal side of the edge Initial Node gt Terminal Node The terminal node N node receiving data from edge is prompted for first since the intent is to delay a particular node behind another Point and click the left mouse button on the terminal node The display will be updated and show all nodes independent of the terminal node highlighted in yellow these highlighted nodes are the only options for the terminal node to create an intra iteration control edge At this point the text display will prompt the user for the initial node Nj Initial Node gt N Upon clicking the left mouse button on the initial node N all displays will be automatically updated and show the new performance based on this newly inserted edge This procedure can be canceled before selecting the initial node by pressing the Esc key After the edge has been inserted it can 15 16 EEUU rs Graph Play Select t G Transitions Read Process
64. lt of the nondeterministic overlap of computing effort required of independent tasks both intra iteration and inter iteration dependencies That is when tasks finish earlier than the worst case execution times and they frequently will predecessor tasks have the potential of starting earlier altering the resource envelope predicted by the worst case analysis and con suming processing power away from other tasks which may result in missed deadlines Static assignment of 50 tasks nodes to processors with fixed start times will prevent this behavior since the independent nodes will be prevented from moving and altering the resource envelope in a way that exceeds the worst case processors requirements However such anomalies can also be avoided by inserting additional control edges that impose stability criteria ref 14 Incorporating a stability crite ria algorithm similar to that of reference 14 would allow the Design Tool to not only determine control edges for increased performance but also to guarantee hard deadlines In the context of DSP systems the Design Tool can support only a single sampling rate per graph Many DSP algorithms require multiple sampling rates which is equivalent to graph nodes encumbering and depositing multiple tokens per firing as opposed to only one token Enhancements are planned to the graph analysis tech niques that will support multiple sampling rates within a DSP algorithm reten
65. m TBO and or schedule length with only the cal culated processor requirement However one cannot guarantee that a solution even exists when both TBO and R are held constant ref 9 In such cases one must choose a heuristic that relaxes the criteria fixing one parameter e g processors and allowing the other e g TBO to vary until a solution is found The graphical windows provided by the Design Tool are briefly described below A more detailed description of each is provided in later sections The windows can be invoked from the Metrics window by clicking on the but tons defined below Single Graph Play SGP A Gantt chart display ing the steady state task schedule for a single com putation The chart is constructed by allowing tasks to start at the earliest possible time referred to as the earliest start ES time with infinite resources assumed The chart is plotted with tasks names shown vertically task execution duration given by bars and a horizontal time axis equal to the sched ule length Total Graph Play TGP A Gantt chart display ing the steady state task schedule for multiple com putations executed simultaneously over one scheduling period which repeats indefinitely The chart is constructed by allowing tasks to start at an earliest time equal to the ES times given by the SGP modulo TBO with infinite resources assumed The chart is plotted similar to the SGP except only over a TBO time interv
66. nd 41 Figure 54 Design Tool warns user when updating a dataflow graph for same number of processors will overwrite a previous design 0 0 ee eee e 41 Figure 55 Dataflow graph with communication delays modeled by edge delays 42 Figure 56 Dataflow analysis of DFG of figure 55 with communication delays and Network with Com Controller model 0 0 0 cece cee cece ee 42 Figure 57 Effect of edge delays on node slack time Intra iteration slack of node C is reduced by C lt F edge delay of 20 time units Inter iteration slack of node E is reduced by E lt D edge delay of 10 time units 00 0 ect enenen ene 43 Figure 58 Dataflow analysis of DFG of figure 55 with communication delays and Network without Com Controller model leleeeeee eee 44 Figure 59 Three graph examples used in demonstrating multiple graph analysis and design 45 Figure 60 Capture of multiple graphs with ATAMM graph entry to0l ooooo oooooooo 46 Figure 61 When more than one graph is in graph file each is given its own Metrics window and associated displays windows Minimizing Metrics windows Graphs 2 and 3 in this figure hides all opened window displays pertaining to the graphs Dataflow analysis of Graph 1 shows four processors are required for an iteration period 01200 TIME UNIS oou te et Rare EAT ERE RI DNE PENES EATER E ERE 46 Figure 62 Dataflow analysis of Graph 2 shows eight processors
67. nd resource requirements can be determined a priori The Design Tool uses this dataflow representation of a task system and the graph theoretic performance metrics presented herein The Design Tool relies heavily on the dataflow graph for its functionality and interface However when the abstraction of representing the task dependencies T lt T by an edge is used so often one may adopt the terminology of saying a node executes on a processor even though a node only represents task instructions that get executed Nevertheless depending on the context of the discussion the terms node and task are used interchangeably in this paper 2 1 Measuring and Constraining Parallelism The two types of concurrency that can be exploited in dataflow algorithms can be classified as parallel and pipeline Two graph theoretic metrics are measured by the Design Tool as indicators of the degree of concur rency that may be exploited The metrics are referred to as TBIO time between input and output and TBO time between outputs and reflect the degree of parallel and pipeline concurrency respectively Parallel concurrency is associated with the execution of tasks that are independent no precedence relationship imposed by lt The extent to which parallel concur rency can be exploited depends on the number of parallel paths within the DFG and the number of resources avail able to exploit the parallelism The elapsed time between
68. ndependent that is there is no control over the graph phasing This type of strategy requires more processors than if the phas ing between graphs is controlled Because the peak processor requirements within the system may overlap at a given time a worst case processor requirement must be utilized in the design Time Multiplex Execution Multiple graph execu tion strategy where graphs are dependent on each other in that the phasing between graphs is controlled This type of strategy can require fewer processors than if the phasing between graphs is not controlled The intent is to phase the graphs in a way that idle time is filled in as processors migrate from graph to graph but the peak processor requirement is limited to system availability 3 2 3 Window Menu The Window menu includes commands that enable the user to view the overall performance of the system based on a particular strategy view a particular graph window or draw in color or black and white A descrip tion of each command is given as follows show Parallel Execution Invokes the Parallel Graph window displaying parallel graph execution analysis show Time Multiplex Execution Invokes the Time Multiplex Graph window displaying the time multiplex graph execution analysis show Operating Points Invokes the Operating Point window displaying a plot of TBO versus TBIO with the required processors Draw in Color BW Toggles between color or bla
69. ng System Functional Specification NASA CR 4339 1990 Hayes P J Jones R L Benz H F Andrews A M and Malekpour M R Enhanced ATAMM Implementation on a GVSC Multiprocessor GOMAC 1992 Digest of Papers 181 Nov 1992 Jones Robert L Stoughton John W and Mielke Roland R Analysis Tool for Concurrent Processing Computer Systems IEEE Proceedings of the Southeastcon 91 Volume 2 1991 Coffman E G ed Computer and Job Shop Scheduling The ory John Wiley amp Sons Inc 1976 Jones Robert L III Design Tool for Multiprocessor Schedul ing and Evaluation of Iterative Dataflow Algorithms NASA TP 3491 1995 Som Sukhamoy Stoughton John W and Mielke Roland Strategies for Concurrent Processing of Complex Algorithms in Data Driven Architectures NASA CR 187450 1990 Lee Edward Ashford Consistency in Dataflow Graphs IEEE Trans Parallel amp Distrib Syst vol 2 no 2 Apr 1991 pp 223 235 10 11 12 13 14 Parhi Keshab K and Messerschmitt David G Static Rate Optimal Scheduling of Iterative Data Flow Programs Via Optimum Unfolding JEEE Trans Computers vol 40 no 2 Feb 1991 pp 178 195 Heemstra de Groot Sonia M Gerez Sabih H and Herrmann Otto E Range Chart Guided Iterative Data Flow Graph Scheduling JEEE Trans Circuits amp Syst vol 39 no 5 May 1992 pp 351 364 Culler David E Resource Requirements o
70. ng artificial data depen dencies The options the user has for rescheduling a task are bounded by the number of task end times in a Total Graph Play diagram when the reschedule is a result of a new precedence constraint As mentioned in section 5 the Total Graph Play is equivalent to the superposition of TBO width segments dividing up a Single Graph Play diagram Such an equivalent view of the steady state dataflow schedule will be used to generalize the potential precedence relationships between nodes Figure 49 shows a generalized Single Graph Play diagram divided into segments of TBO width The num bers above the segments refer to the relative iteration numbers or data packet numbers for each segment The figure shows that the bounded rescheduling options can actually be divided into three regions The three cases assume that the user wishes to reschedule node T behind the completion of one of the three nodes T4 T5 or T3 37 Control edges created from nodes in this region would require initialization with tokens e g Tz lt T would require three tokens Packet numbers Control edges created from nodes in this region do not require initialization with tokens e g T5 X T Control edges created from nodes in this region would require initialization with antitokens e g Tj Tu Figure 49 Inter iteration control edges may be initialized with tokens depending on iteration number separation If the user decided to impo
71. ng node C which imposes the C lt D constraint creates an artificial recurrence loop with a timing violation that is the time per token ratio of the loop C D E exceeds the current TBO of 314 time units When such a situation arises the Design Tool will warn the user by displaying the dialogue box shown in figure 45 Attempt ing the other option and imposing the B lt D constraint produces the results shown in figure 46 a three processor scheduling solution having optimum through put However the solution is not optimum in terms of TBIO performance Imposing B lt D effectively delayed node D 76 time units Before the B lt D constraint was imposed node D had 20 time units of slack As a result of B lt D node D is now pushed 56 time units beyond 34 its slack time increasing TBIO above the TBIO of 570 time units to 626 time units as shown by the Metrics window of figure 46 The Graph Summary window in figure 47 also dis plays the control edges added for the three processor schedule indicated by asterisks With reference to the B lt D control edge OF 1 representing the presence of one initial token characterizes the inter iteration rela tionship required between B and D TBO delay of 1 to assure the desired schedule in figure 46 This inter iteration dependency is implied by the relative data packet numbers assigned to the node execution bars in figure 46 If data packet 2 represents the nth iteration of a n
72. nvokes the dialogue box shown in fig ure 42 which allows the user to choose the desired graph and index for viewing get Point Displays the TBO TBIO and proces sor requirement for the operating point colored in blue After the command is selected and before the dialogue box is closed the command name changes to next Point next Point This command is created in the menu after the get Point command is selected TBO TBIO and processor requirement information for a different operating point is displayed each time this command is selected Redraw Refreshes the display Figure 42 Select Show command from Display menu in Oper ating Point window to select graph and option index to view update Graph Allows the user to update the dataflow graph file for the current operating point A dialogue box will appear reminding the user of the total processor requirement for the operating point and allowing the user to cancel the update A detailed discussion of updating the graph file is given in the next section 10 1 Since the graph files do not store performance information only data flow graph topology and attributes TBO TBIO and R values are saved in a separate file same file name as graph file but with the extension rtt cre ated by the Design Tool When the graph file is reopened the information in this file is read so that the Operating Point window can be redrawn to show operating points defined
73. ode notice that node D is enabled for execution by the n 1 th iteration of node B Display Select DFG MEE PTIME 90 0 ELE cuu ES cud Figure 44 Imposing intra iteration control edge E lt C delays node C within its slack time so that TBIO is not increased Rescheduling node C eliminates needless parallelism so same TBIO can be obtained with two processors rather than three ERROR A timing violation has been detected in circuit C gt D gt E gt The last control edge inserted will be delete d Figure 45 The Design Tool prevents insertion of artificial precedence relationships not permissible as steady state schedules 35 DFG e Single Graph Play HES Total Graph Pla dal Set DFG Performance E Sing TIME 0 626 TIME 0 314 CP Sea Re Se le Resource Envelope lo ODE UC EE Eras A AE SE ae lt cee ES FE SE A le lff l Total Resource Envelope 7 Figure 46 Imposing control edge B lt D delays node D behind node B Since nodes B and D belong to different iterations control edge imposes inter iteration dependency requiring initial tokens one in this example Design Tool automatically calculates appropriate num ber of initial tokens 10 1 1 Initialization of Control Edges The three processor scheduling solution in the previ ous section required an inter iteration control edge with an
74. oomed interval 2 0 0 0 0 cece eee eee eee 18 Figure 24 Total Graph Play window for DFG of figure 1 at TBO 314 time units Numbers over bars indicate relative data packet numbers 0 0 02 c eee ee eee eee 19 Figure 25 Single Graph Play window view can be customized with Slice and Display menu commands Left and right time cursors vertical lines are shown measuring processing time of task C which begins at time 100 time units and has a duration of 70 time units o cerier oiu erbe bem Seg ER ede eee e e oe beers a 19 Figure 26 Lower bound on TBO TBO is limited inherently by recurrence loop of algorithm composed of nodes D and E 2 0 0 eect e 20 Figure 27 Single Resource Envelope window displays processor utilization associated with Single Graph Play window 0 0 0 e eee eben e nee 21 Figure 28 Total Resource Envelope window displays processor utilization associated with Total Graph Play window seeeeeeeeeee e 22 Figure 29 Select Utilization command from Display menu in Total Resource Envelope window to measure utilization of processors Utilization depicted is associated with one TBO interval of 314 time units as shown in figure 28 ooo oooooooococococro nono 22 Figure 30 TRE window time cursors define time interval for utilization measurements 23 Figure 31 Parallel Execution model assumes no control over graph phasing which requires worst case process
75. optimum value for TBO TBOy from equation 3 For a given value of R TBO may be changed to a value greater than or equal to TBO When the schedule is altered result ing in added control edges the analysis is repeated to determine the new critical path critical circuits and modifications to the performance bounds The dataflow graph example shown in figure 1 is used to present the displays and capabilities of the tool The format for the graph description file is described in section 3 1 1 and the complete graph text description used for figure 1 is provided in the appendix The node latencies shown in figure 1 are interpreted generally as time units so that real time can be user interpreted 5 That is if the clock used to measure or derive the task durations has a resolution of 100 usec the latency of node A can be interpreted to be 9 usec To maintain the resolution of time when applying the equations of sec tion 2 the Design Tool always rounds applies the ceil ing function to the next highest clock tick 3 1 File Input Output The Design Tool takes input from a graph text file that specifies the topology and attributes of the DFG The graph text file format is given in this section Updates to the graph text file e g GRAPHFILE GTF with design attributes and artificial dependencies are made directly by the Design Tool with the original version saved as a backup graphfile bak In addition to this graph text f
76. or design In Time Multiplex Execution model phasing between graphs is fixed to particular values by user This deterministic phasing and hence overlap between graph reduces processor requirements based on amount of graph overlap 24 Figure 32 Parallel Graph Execution window displays processor requirements and utilization under Parallel Execution model 0 0 0 cee eee eee 25 Figure 33 Time Multiplex window portrays processor requirements and utilization for algorithm graphs analyzed with Time Multiplex model 1 2 2 0 0 eee eese 26 Figure 34 Select save Results command from Display menu of Time Multiplex window to save contents of Utilization window Utilization window must be opened tosave Results i ERR C es evel Ae ERES bid GEM BE ae be whee 27 Figure 35 Select Edit Graph Sequence from Select menu to define order in which graphs should execute within scheduling cycle Replicating graph n times allows multiple sampling rates within graph system That is a given graph will execute n times more often within scheduling cycle than other graphs 0 0 0 0 eese 28 Figure 36 Performance window displays theoretical speedup performance of DFG Shown is speedup limit associated with DFG of figure 1 2 0 2 0 0 eee ee eee eee 29 Figure 37 Select save Results command from Display menu in Performance window to update Notes file with speedup data 2 0 eee cee ee 30 Figure 38 Graph Summary window
77. p values above the blocks DFG Performance Display SpeedUp Processors Figure 36 Performance window displays theoretical speedup performance of DFG Shown is speedup limit associated with DFG of figure 1 Lines Turns on or off major grid lines on the dia gram for ordinate speedup values save Results Saves the speedup performance information to the notes file The name of the notes file can be defined by using the Save Info com mand from the File display in the main window see section 3 1 2 Selecting this command will update the speedup performance data in the file as shown in figure 37 Redraw Refreshes the display window 8 Summarizing Dataflow Graph Attributes 8 1 Overview The Graph Summary window shown in figure 38 displays the DFG attributes and scheduling criteria for the current design or operating point processors TBO and TBIO The characteristics include NAME node names LATENCY node latencies ES earliest start times LF latest finish times SLACK slack times INST maximum node instantiations OE output empty number of initially empty queue slots memory buffers OF output full number of initially full queue slots memory buffers due to initial tokens Initial tokens are a result of either initial conditions data values represented by data edges or initial tokens on control edges required to impose inter iteration dependencies QUEUE output queue size O
78. path 15 the total delay along the crit ical path ref 5 The critical path defined as the path without slack is the path that maximizes equation 1 Including edge latency as a model parameter provides a simple but effective means of modeling the cost of com municating data between nodes This communication model assumes that nodes with multiple output edges can communicate the data for each edge simultaneously Of particular interest are the cases when the algo rithm modeled by the DFG is executed repetitively for different data sets data samples in DSP terminology Pipeline concurrency is associated with the repetitive execution of the algorithm for successive data sets with out waiting for the completion of earlier data sets The iteration period and thus throughput inverse of the itera tion period is characterized by the metric TBO time between outputs defined as the time between consecu tive consumptions of output tokens by a sink Because of the consistency property of deterministic dataflow graphs all tasks execute with period TBO refs 6 and 7 This implies that if input data are injected into the graph with period TBI time between inputs then output data will be generated at the graph sink with period TBO TBI The minimum graph theoretic iteration period T due to recurrence loops is given by the largest ratio of loop time Lj to the initial tokens within the loop Digg for all recurrence loops within the DFG r
79. ph at the top is the first graph in the sequence the graph at the bottom is last The Phasing window also dis plays the time delays between the graphs phase delay The delay time is the delay from when the previous graph in the sequence receives its input data to when the given graph receives input data This delay can be changed by pressing the Delay button 7 Measuring Graph Theoretic Speedup Performance 7 1 Overview The Performance window displays the number of processors versus speedup based on equation 6 The display automatically increases or decreases the abscissa 28 each time the number of processors is changed from the Metrics window Figure 36 shows the theoretical speedup for the DFG of figure 1 The speedup curve indi cates that maximum speedup performance is attainable with four processors additional processors will not result in any further speedup This leveling off of performance is attributable to the recurrence loop circuit within the DFG Without this circuit the graph theoretic speedup would continue to increase linearly with the addition of processors However this linear increase in speedup would ultimately break off because of operating system overhead such as synchronization costs and inter processor communication 7 2 Display Menu The Display menu includes commands that enable the user to select display options and save results Values Turns on or off the display of the actual speedu
80. phs 6 4 2 1 Display menu The Display menu includes commands that enable the user to view processor utiliza tion show input injection intervals and zoom into a time interval For more information select the Display menu command name Utilization Displays the Utilization window of figure 33 c which shows the overall processor uti lization See section 6 3 for further discussion of the Utilization window Show Segments Displays dotted lines indicating the graph phase delays save Results Saves the information shown in the Utilization window in the notes file section 3 1 2 as shown in figure 34 In addition to naming the notes file from the Design Tool main window the Utilization window must be opened the Utilization window does the actual calculation of processor utilization when opened The following commands Slice Previous Slice Whole 25 Multiple Graph Execution Display Select Time Multiplex Scheduling cycle a Total Resource Envelope shows processor requirement of three parallel executing graphs Graph sequence Ss Phasing _ si 4 Processors 41 7 96 J Processors 03 35 e Processors 100 0 96 1 Processors 100 0 96 D Pracessars 0 0 Computing Effort 1950 Total Utilization 81 27 96 b Phase between graphs is controlled with Phasing window c Overall processor utilization is displayed via Utilization window Figure 33 Time Multiplex
81. plays the utilization of the processors and the computing effort area under the envelope for the interval defined by the left and right time cursors Note that the computing effort shown in figure 29 for an entire TBO interval is equal to the TCE measurement sum of all node latencies given by the Metrics window in figure 11 The Utilization window is automatically updated based on current cursor positions each time the Utilization command is selected from the Display menu in the TRE window An example is shown in figure 30 Not only is the total processor utilization measured but the incremental processor utilization as well that is the utilization of 1 processor 2 processors 3 processors etc Total Resource Envelope Display Select m DFG 4 Utilization 4 Processors 54 7 5 3 Processors 100 0 2 Processors 100 0 96 1 Processors 100 0 0 Processors 0 0 Computing Effort 548 Total Utilization 48 6 b Utilization within time interval is measured to be 88 6 percent Area under curve within interval is 588 time units Figure 30 TRE window time cursors define time interval for utilization measurements 6 4 Portraying Processor Utilization of Multiple Graphs This section describes the Concurrency window provided to analyze multiple graph execution under the multiple graph execution strategies As discussed in section 3 2 2 the Design Tool provides two models for multiple graph an
82. point is still shown in green whereas the new possibly alternative design point is shown in red with the information Point box Selecting the update Graph command for a previously updated graph for the same number of processors will result in the dialogue box shown in figure 54 reminding the user that this design point already exists 40 10 2 Modeling Communication Delays Up to now the dataflow model has assumed a shared memory architecture without processor contention This section will briefly discuss the effect of communication delays on the two network models discussed in sec tion 3 2 2 The communication delays that are not negli gible between synchronized nodes can be modeled with edge delays in the dataflow graph To simplify the DFG Operating Points Display Figure 53 After dataflow graph has been updated current design may be overwritten Alternate design is shown with same number of pro cessors TBO increases to 350 time units and TBIO decreases to 590 time units Subscript alb denotes absolute lower bound Upd ate Gra ph Operating Point already exists Overwrite graph s modifications for the Operating Point Total Processors 3 Index 1 Figure 54 Design Tool warns user when updating a dataflow graph for same number of processors will overwrite a previous design demonstration the dataflow graph of figure 1 will be used except that the edge delays shown in figure 55 are added The
83. posed upper bound refs 10 and 11 However when tokens wait on the crit ical path for execution because of token congestion within the graph an increase in TBIO above the lower bound occurs This increase in TBIO can be undesirable for many real time applications Therefore to prevent saturation constraining the parallelism that can be exploited becomes necessary The parallelism in data flow graphs can be constrained by limiting the input The ceiling of a real number x denoted as x is equal to the smallest integer greater than or equal to x 4 injection rate to the graph Adding a delay loop around the source makes the source no longer unconditionally enabled ref 1 It is important to determine the appropri ate lower bound on TBO for a given graph and number of resources 2 2 Run Time Memory Requirements The scheduling techniques offered in this paper are intended for modeling the periodic execution of algo rithms In many instances the algorithms may execute indefinitely on an unlimited stream of input data this is typically true for DSP algorithms To achieve a high degree of pipeline concurrency a task may be required to begin processing the next data sets before completing the execution of the current data set resulting in multiple instantiations of a task Multiple instantiations of a task require that a task execute on different processors simul taneously for different sequential data sets System memo
84. r liest start of D which shows up as slack in figure 25 Since this slack time is a result of inter iteration depen dencies it is also a function of TBO this can be demon strated by reducing the TBO of the DFG by changing the number of processors to four Doing this causes the Design Tool to apply an R of 4 in equation 3 and to find that TBO is limited by the recurrence loop D lt E having a time per token ratio of 280 time units as shown in figure 26 At TBO TBO 280 time units there is no longer idle slack time in the recurrence loop since node D begins as soon as node E completes Before we discuss menu commands one difference between the SGP and TGP windows involves insertion of control edges As mentioned in the previous section one can impose only intra iteration control edges with the SGP window However with the TGP window both intra and inter iteration control edges may be imposed Intra and inter iteration edges data or control can be distinguished from one another by the fact that intra iteration edges have no initial tokens whereas inter iteration edges do Whether an imposed control edge has Total Graph Display Select DFG AVISO soi E rS CEU Figure 24 Total Graph Play window for DFG of figure 1 at TBO 314 time units Numbers over bars indicate relative data packet numbers Single Graph Play IE BH TIME 100 ENR rs 70 Figure 25 Single Graph Play window view can be c
85. re 17 is shown in figure 27 For the Shared Memory No Contention and the Network with Com Controller mod els the SRE is equivalent to counting the number of overlapped execution bars within the SGP over the schedule length time interval The SRE determines the minimum number of processors sufficient to execute the algorithm for a desired schedule length and TBIO For the Network without Com Controller model the SRE includes the effort required for communication as mod eled by the edge delays The total resource envelope TRE shows the steady state periodic processor requirements for multiple computations of data packets over a schedule cycle of period TBO which is assumed to repeat indefinitely The TRE for the dataflow schedule of figure 24 is shown in figure 28 The TRE determines the maximum number of processors sufficient to execute the algorithm periodically with period TBO Like the SRE the TRE is equivalent to counting the number of overlapped execution bars in the TGP when not using the Network without Com Controller model Processor utili zation measurements can be taken from the TRE window 6 1 Display Menu The Display menu includes the commands Slice Previous Slice Whole Redraw Reset that enable the user to zoom in zoom out or refresh the picture These are functionally equivalent to the same commands provided in the graph play windows so an explanation of each will not be given here Refer instead
86. resented in the previous section is useful for exposing inherent parallelism con strained only by the data precedences Such a hardware independent analysis can indicate whether a given algorithm decomposition has too little or too much paral lelism early on in the development stage before the user attempts to map the algorithm onto hardware The Data flow Design Tool version 3 0 described in the remaining sections analyzes dataflow graphs and applies the design principles discussed herein to multiprocessor applica tions The software was written in C and executes in Microsoft Windows or Windows NT The software can Version 3 1 by Borland International Inc Version 3 1 by Microsoft Corporation be hosted on an 1386 486 personal computer or a compat ible type The various displays and features are presented in this section As a convention menu commands are denoted with the symbol Figure 2 provides an overview of the input and out put process flow of the Design Tool After a DFG is loaded the Design Tool will search the DFG for recur rence loops circuits and determine the minimum itera tion period T by using equation 2 where T is zero if no circuits are present TBO will initially be set to the largest task latency or T whichever is larger The calcu lated processor requirement R is initially given by equa tion 5 TBIO is determined from equation 1 Any changes to R will result in an update of the
87. ry requirements increase with the instantiation requirements of tasks since multiply instantiated tasks must be redundantly allocated on multiple processors For deterministic algorithms executing at constant itera tion periods the bound on the number of task instantia tions can be calculated as n Li Instantiations of T lx 8 TBO Even though the multiprocessor schedules deter mined by the Design Tool are periodic it is important to determine whether the memory requirement for the data is bounded However just knowing that the memory requirement is bounded may not be enough One may also wish to calculate the maximum memory require ments a priori By knowing the upper bound on memory the memory can be allocated statically at compile time to avoid the run time overhead of dynamic memory man agement Dataflow graph edges model a FIFO manage ment of tokens migrating through a graph and thus imply physical storage of the data shared among tasks Using graph theoretic rules the Design Tool is capable of determining the bound on memory required for the shared data as a function of the dataflow schedule 2 3 Control Edges When resource requirements for a given dataflow graph schedule are greater than resource availability imposing additional precedence constraints or artificial data dependencies onto 7 thereby changing the sched ule is a viable way to improve performance refs 1 5 and 12 These artificial data dependencies
88. s For each individual graph the proces sor requirement is equal to the peak of the corresponding TRE curve and processor utilization is calculated from equation 7 To provide for the worst case overlap of the individual TRE s the total processor requirement is the sum of the individual processor requirements calculated to be 16 in this example As defined in section 6 4 the total processor utilization of the system is calculated from equation 9 where the individual graph speedup values are summed together and divided by the total number of processors In this example utilization of all processors is computed to be 800 200 750 108 400 134 16 87 1 percent 10 3 2 Time Multiplex Execution of Multiple Graphs This section demonstrates the idea of deterministi cally controlling the phasing between graphs which can produce scheduling solutions with fewer processors The Graph 1 400 200 Graph 3 200 Onn 100 Ee OH 100 Figure 59 Three graph examples used in demonstrating multiple graph analysis and design procedures involved in analyzing graphs with the Time Multiplex model are slightly different from the proce dures discussed up to this point As with the Parallel Exe cution model a Metric window is created for each graph But as discussed in section 6 4 2 the user cannot inde pendently control the TBO and processors R parame ters for each graph These parameters depend on the periodic
89. s insertion of artificial precedence relationships not permissible as steady state schedules lese 35 vi Figure 46 Imposing control edge B lt D delays node D behind node B Since nodes B and D belong to different iterations control edge imposes inter iteration dependency requiring initial tokens one in this example Design Tool automatically calculates appropriate number of initial tokens lleleeeeeeeeeee RI nea 36 Figure 47 Dataflow graph attributes and timing are displayed in Graph Summary window Edge added between nodes B and D requires initial token OF 1 0 004 37 Figure 48 Algorithm function example 0 0 cece ccc he 37 Figure 49 Inter iteration control edges may be initialized with tokens depending on iteration number separations 20 ce b es 38 Figure 50 Operating Point plane for three processor design 0 0 c eee eee eee eee 39 Figure 51 Updating dataflow graph with design attributes 0 0 0 0 0 eee eee eee 39 Figure 52 Select View command from File menu in main window to view graph file Updated graph for three processor design is shown Note added control edges in middle window RC 40 Figure 53 After dataflow graph has been updated current design may be overwritten Alternate design is shown with same number of processors TBO increases to 350 time units and TBIO decreases to 590 time units Subscript alb denotes absolute lower bou
90. se the constraint Ta lt T the two nodes would belong to the same iteration i intra iteration dependency Thus the control edge imposing T lt T would not require an initial token If the user decided to impose the constraint 73 lt T the two nodes would not belong to the same iteration i but would instead be separated by three iterations That is the ith iteration of node T would be triggered by the i 3 th iteration of node 73 This implies that during the transient state of execution node T would fire three times on three consecutive data sets before node Ty fired even once The only way this could happen is if the edge T5 lt T had three initial tokens The last case may seem strange at first If the user imposed the constraint T lt T the two nodes would again not belong to the same iteration i but would 38 instead be separated by a single iteration What is strange about this is that the ith iteration of node T would be triggered by the i 1 th iteration of node T a data set injected into the graph in the future This implies that during the transient state of execution node 7 would have to throw away the first token received from T and not fire until receiving the second and subsequent tokens This type of synchronization can be modeled with negative token counts These negative token counts are referred to as antitokens By permitting initial token counts to be negative as well as posit
91. solute lower bound lines indicate the absolute lower bound of the perfor mance plane The displayed plot is associated with only one graph and index at a time The term index is used to distinguish different options for the same number of processors The Design Tool can utilize multiple graph and index options when using graph files generated by the ATAMM Graph Entry Tool The user can select the graph and option index to view by using the Show command from the Display menu If the graph has multi ple sinks the TBIO metric is equal to the maximum TBIO for all paths within the graph The current undefined user has not updated the graph file for an operating point design point is colored red operating points already defined updated graph file are colored green and a coincident undefined and defined point is colored magenta When the user is searching the TBO TBIO R information box for a partic ular operating point see next Point command the selected point is colored blue The Display menu commands Show get Point Redraw 32 update Graph are defined next 9 2 Display Menu The Display menu includes commands that enable the user to select the graph and the index to view get information TBO TBIO and R for an operating point refresh the display and update the graph file with the necessary modifications and characteristics to model or achieve the desired operating point Show I
92. sor num ber of three is now 370 time units One would expect the earliest start times shown by the SGP to remain the same and the SRE would remain the same as well except for the filling in of idle time due to the communication requirements Since the TBO has changed however the TGP and TRE should be different Referring to the TRE one can see that even though the idle time has been filled in under this model the processor requirement remains four with slightly less utilization This can be attributed to the fact that computing power is the product of TBO and the number of processors Thus computing power can be increased to meet the added computing effort of a problem by increasing TBO processors or both In this case the increase in TBO has accommodated the added effort incurred by the communication requirements keeping the processor requirement at four The one major difference in the two network models is seen in over head in this case measured to be 26 1 percent due to the added 170 time units of communication effort 10 3 Multiple Graph Models The Design Tool provides two models of parallel graph execution The first model called the Parallel Exe cution model is applicable when multiple independent algorithms are executing in the system simultaneously The model assumes that the phasing of one algorithm graph with another is not known and cannot be con trolled The second model called the Time Multiplex model handles
93. t cursors are controlled with the left and right mouse buttons respectively The left cursor can also be controlled with the left and right arrows keys alone and the right cursor in the same way while holding down the Shift key There are also commands to enable the user to zoom into time intervals between the left and right cursors Information can be obtained on any node by pointing to a node and clicking the left mouse button while holding the Shift key down as shown in fig ure 18 a Figure 18 b shows information that can also be obtained on the event associated with the current left cursor position by pressing Shift Enter Moving the cursors with the keyboard automatically updates the information window 5 1 1 Display Menu The Display menu includes commands that enable the user to zoom into and view internal transitions slack 13 nn Single Graph Play iv 14 Display Select DFG Intra iteration slack time E EET E dais avoir We TBO width segment line Single Graph Play Display Select DFG Left cursor i 4 Right cursor F controlled by left mouse button or left and right arrow keys mouse button or left and right arrow keys CN ae Left cursor time TL controlled by right Figure 17 Single graph play window Shaded bars indicate task execution duration unshaded bars indicate slack time 14 EVENT Priority Max
94. t imposes control edges with delay attributes around sources to control phas EuriDL EC 53 viii Nomenclature AMOS ATAMM control edge data packet data set D Deritical path D loop DFG DSP di lt j edge EF ES FIFO graph GVSC index instantiations L latency Ledge LF L i Loop Lnode M 0 Nj Network with Com Controller Network without Com Controller node ATAMM multicomputer operating system algorithm to architecture mapping model artificial data dependency added to dataflow graph to alter schedule input output and intermediate computations for given iteration set of edge latencies representing communication delay between dependent tasks total number of initial tokens in longest path total number of initial tokens within graph recurrence loop dataflow graph digital signal processing communication delay between task i and task j represents data dependency real or artificial within directed graph edge is implemented with FIFO queue or buffer earliest finish time of node earliest start time of node first in first out graphical and mathematical model of algorithm decomposition where tasks are represented by nodes or vertices and data or control dependencies are represented by directed edges or arcs generic VHSIC spaceborne computer operating point option for same number of active processors same task simultaneously executing in multiple processors for different data
95. taflow machine hardware and or software implementation the dataflow graph can be executed as is to achieve the solution predicted by the tool Alternatively the user may wish to implement a static scheduling solution with a heuristically chosen schedule In this case the static scheduling algorithms could be tailored to use the dataflow analysis e g earli est start and slack times of nodes as criteria to obtain a static solution that also assures for data integrity that pre cedence constraints are not violated NASA Langley Research Center Hampton VA 23681 0001 September 21 1995 Appendix Graph Text Description Graph text file of figure 1 Graph DFG Node A Read 10 Process 70 Write 10 Inst 1 End Node Node B Read 10 Process 370 Write 10 Inst 1 End Node Node C Read 10 Process 70 Write 10 Inst 1 End Node Node D Read 10 Process 170 Write 10 Inst 1 End Node Node E Read 10 Process 70 Write 10 Inst 1 End Node Node F Read 10 Process 70 Write 10 Inst 1 End Node Source Src TBI 0 End Source Sink Snk End Sink Edge Data Initial Src Terminal A Tokens 0 Queue 1 End Edge Edge Data Initial A Terminal B Tokens 0 Queue 1 End Edge Edge Data Initial B Terminal F Tokens 0 Queue 1 End Edge Edge Data Initial F Terminal Snk Tokens 0 Queue 1 End Edge Edge Data Initial A Terminal C Tokens 0 Queue 1 End Edge Edge Data Initial C Terminal F Tokens 0 Queue 1 End
96. the OK button Reset Graph Delays Resets the phase delays between graphs such that graphs are separated by the respective scheduling length times of the indi vidual graphs Reset Graph Sequence Resets the graph sequence such that each graph appears only once in the sequence cycle The resulting sequence will reflect the order in which the graphs were saved in the graph file created by the ATAMM graph entry tool The order of the sequence is depicted in the Phasing window The remaining Select menu commands Jump by Scroll Step are identical to the commands provided by the graph play windows See section 5 1 2 for definitions 27 Multiple Graph Execution Display UG Time Edit Graph Sequence reset Graph Delays reset Graph Sequens Jump by Scroll Step Graph Execution Sequence Graph 1 gt 150 Graph_2 gt 200 Graph_3 gt 250 Figure 35 Select Edit Graph Sequence from Select menu to define order in which graphs should execute within scheduling cycle Repli cating graph n times allows multiple sampling rates within graph system That is a given graph will execute n times more often within scheduling cycle than other graphs 6 4 3 Phasing Window Overview The Phasing window fig 33 b determines the sequential ordering and delays of the graphs for the Time Multiplex model The ordering of the graph sequence is represented by a top down list of graph names The gra
97. the cases where the graph phasing is known a priori and can be controlled deterministically As you will see in this section the processor require ments of the Parallel Execution model tend to be greater than those of the Time Multiplex model The three graphs shown in figure 59 will be used to demonstrate Design Tool features that apply both multiple graph models At present the multiple graph model features of the Design Tool can only be used with graph files gener ated by the ATAMM Graph Entry Tool since the Graph Text File format only accommodates a single graph Figure 60 shows the capture of the three graphs by the Graph Entry Tool The user chooses which multiple graph model gets applied to which graphs by using the dialogue box shown in figure 8 10 3 1 Parallel Execution of Multiple Graphs When multiple graphs are defined in a graph file generated by the ATAMM graph entry tool the Design 43 i Display Select FiS Total Graph PI ee Utilization 4 Processors 31 1 3 Processors 55 4 2 Processors 100 0 9 1 Processors 100 0 0 Processors 0 0 Processing 820 Read Setup 120 Communicating 170 Overhead 25 1 PTIMEO 605 Li aS SEA Computing Effort 1060 Figure 58 Dataflow analysis of DFG of figure 55 with communication delays and Network without Com Controller model Tool creates a Metrics window for each graph The first graph found in the file will be open
98. the graph This prevents the source from being unconditionally enabled and instead injects input periodically with period TBO when desired For the Time Multiplex model the sources are updated with a network of edges that describe the desired graph sequence and phase Figure 69 a shows the resulting ATAMM Graph Entry Tool display after the graphs have been updated Not apparent in the picture because the edges overlap one another are two edges forming a loop between Graph 1 and Graph 2 and between Graph 2 and Graph 3 as portrayed in figure 69 b An initial token marks the edge incident on Graph 1 to indicate that it fires first The firing of Graph 1 will encumber the token and deposit output tokens for node A and the source G2 However the token destined for G2 will not appear for 250 time units which imposes the desired phase delay between Graph 1 and Graph 2 After 250 time units source G2 will encumber the token and deposit a token for source G3 and node N1 which takes 350 time units to arrive The sequence of firings continues delaying Graph 2 after G3 by 150 time units and Graph 1 after Graph 2 by 250 time units returning the token back to the initial marking and allowing the sequence to repeat indefinitely 11 Future Enhancements Extensions to the Design Tool planned in the near future include incorporating heuristics to automate the Multiple Gr aph Execution m a Em Phas sing Time Multiplex Z Utilization 2 Proc
99. ther discus sions of the models implemented by the Design Tool are provided in section 10 for a few case studies Enhance ments planned for the tool are discussed in section 11 2 Dataflow Graphs A generalized description of a multiprocessing prob lem and how it can be modeled by a directed graph is presented Such formalism is useful in defining the mod els and graph analysis procedures supported by the Design Tool A computational problem can often be decomposed into a set of tasks to be scheduled for execu tion ref 4 If the tasks are not independent of one another a precedence relationship will be imposed on the tasks in order to obtain correct computational results A task system can be represented formally as a 5 tuple 7 L D M The set T Tj Tz T3 Ta is a nonempty set of n tasks to be executed and lt is the precedence relationship on 7 such that T lt T signifies that T cannot execute until the completion of T The set L L4 Ly L3 Ln is a nonempty strictly positive set of run time latencies such that task T takes L amount of time to execute The set D d j dy j d em d y is a strictly positive set of latencies associated with each precedence relationship A latency d lt in D that is associated with the precedence T lt T represents the time required to communicate the data from T to Tj Finally M is the initial state of the system as indicated by the presence of init
100. tilization 74 6 Total Processors 16 Total Processor Utilization B7 1 9 Figure 64 Parallel Graph Execution window summarizes proces sor requirements for all graphs intended to execute in parallel Since it is assumed that phasing between graphs cannot be con trolled in parallel graph strategy total processor requirement is worst case or sum of individual processor requirements 1800 time units However since Graph 2 produces out put twice in this period it has twice the throughput as the other graphs Consequently its TBO value is 900 time units Whether a replicated graph produces output with a fixed period depends on the total phase delay that con verges on the graph in a given sequence With reference to the Phasing window of figure 67 notice that Graph 2 will receive input and produce output after 1050 time units Graph 1 to Graph 2 and then after 750 time units Graph 3 to Graph 2 and then after 1050 time units and so on This means that within a cycle of 1800 time units Graph 2 actually has two TBO intervals However over the 1800 time unit cycle the average TBO interval for Graph 2 is 900 time units the value always displayed in the Metrics window A three processor dual sampling rate solution that fills in as much idle time as possible any further overlap of graphs will require four or more processors is shown in figure 68 The phase delays have been reduced such that the overall scheduling cycle is no
101. tilization of 72 2 percent In this example the area under the resource envelope as indicated by the Computing Effort value shown in the Utilization window is the sum of the indi vidual graph TCE values Assume in the remaining discussions that this multi ple graph problem is to be mapped onto a three processor system and that one can arbitrarily define the sampling rates of the system The intent would be to fully utilize minimize idle time the three processors in a way that overall system speedup is maximized One can see from figure 65 that for an overall speedup of 1 44 TCE TBO z 1950 1350 there is significant idle time in the periodic dataflow schedule By adjusting the relative phase delays between the graphs the user can fill in this idle time by allowing just the right amount of graph overlap For demonstration purposes two different sampling rates can be allowed in the system by replicating Graph 2 twice That is Graph 2 will receive process and produce data twice as often as Graphs 1 and 3 within a cycle To do this click on the Edit Graph Sequence command from the Select menu in the Multiple Graph window to invoke the dialogue box shown in figure 66 Clicking on Graph 2 pressing the Replicate button and then clicking on the Down button will produce the sequence shown in the dialogue list box Figure 67 shows the resulting per formance of the new sequence Graph 1 Graph 2 Graph 3 Graph 2 Graph 1 Notice t
102. ustomized with Slice and Display menu commands Left and right time cursors ver tical lines are shown measuring processing time of task C which begins at time 100 time units and has a duration of 70 time units 19 Total Graph Play TIME 0 280 CE TETONA EA Figure 26 Lower bound on TBO TBO is limited inherently by recurrence loop of algorithm composed of nodes D and E one or more initial tokens is schedule dependent The Design Tool automatically calculates the proper number of initial tokens if any needed on a control edge at the time of insertion to model the proper synchronization between the nodes Once a control edge is inserted with a determined number of initial tokens the initial token count will not change as the schedule is altered by adding more control edges or changing TBO Thus the favored schedule resulting from the initialized control edges at one TBO period may not be favorable at a different TBO period Most of the features offered by the SGP window are also offered by the Total Graph Play window Rather than redefining the shared functionality only the added or missing features are discussed in this section Refer to section 5 1 1 for detailed descriptions of the common menu commands 20 5 2 1 Display Menu The commands offered by the Display menu are equivalent to that of the SGP window except for the fol lowing omissions There is no Show Segments comm
103. ut edges is equal to the maximum of all output edge delays The processor idle time that occurs during communications which may result in O percent processor utilization at times is nec essary to assure data integrity There is nothing for a pro cessor to do in this graph during this time but maybe there is something for it to do elsewhere especially in the multiple graph execution scenarios Note that the over head is the same as the shared memory model The edge delays not only affect the calculation of earliest start times but also the calculation of inherent slack time With reference to figure 57 it is apparent that the intra iteration slack of node C which was 300 time units before has been reduced to 265 time units Edge delay also affects the inter iteration slack The slack within the recurrence loop was 24 time units without the edge delays and the slack of node E in figure 57 is now 14 time units This slack can be calculated as the differ ence between the iteration period TBO 314 time units and the total loop time including edge delay 300 time units The difference is equal to the slack within the loop in this case 14 time units 10 2 2 Network without Communication Controller The Network without Com Controller model assumes that each processing element is not paired with a TIME 0 605 iL Total Computing Effort Processing 820 Utilization 4 Processors
104. w 1000 time units Unlike the two processor example where Graph 2 pro duced output in an oscillating fashion due to the unbal anced phase delay this solution has equal phase delay 48 500 time units along both paths converging onto Graph 2 Hence Graph 2 will always produce output every 500 time units in this solution Noting the results in the Utilization window the processor utilization has increased to 90 percent with an overall speedup of 2 7 2700 1000 The dataflow analysis shown by this example pre dicts the performance and resource requirements of a multiple graph execution scenario assuming that the graphs are phased as shown in figure 68 In the same way that graph edges describe the communication and syn chronization required of partially ordered tasks the graph edges can also be used to describe a multiple graph execution sequence Just as edge delays can be used to model communication delays between synchronized tasks edge delays can model the phase delays between synchronized graphs To do this the Design Tool imposes control edges with edge delay onto sources normally unconditionally enabled The control edges are added to the sources along with the usual graph attributes as a result of updating the graph as discussed in sec tion 10 1 2 For the Parallel Execution model which is the model of choice for a single graph sources are updated with a self loop control edge with a delay equal to the desired TBO of
105. window portrays processor requirements and utilization for algorithm graphs analyzed with Time Multiplex model 26 i File Edit Search Help Notepad MULTIGRF TXT DESIGN TOOL NOTES FILE Time and Date Graph File Total Graphs Total Nodes Total Sources Total Sinks Tue Jul 19 15 02 40 1994 D WIN16 ATAMM DENO multigrf grf TIME MULTIPLEX GRAPH EXECUTION UTILIZATION SLOT DELAY 156 1 200 2 256 GRAPH Graph_1 Graph 2 Graph 3 Processors 41 7 Processors 83 3 Processors 188 8 Processors 108 0 Processors 0 0 Computing Effort 1950 Total Utilization 81 2 Figure 34 Select save Results command from Display menu of Time Multiplex window to save contents of Utilization window Utilization window must be opened to save results Redraw Reset are identical to the commands provided by the graph play windows See section 5 1 1 for definitions 6 4 2 2 Select menu The Select menu includes commands that enable the user to force the cursors to jump to selected events and to set the time step for the horizontal scroll bar Edit Graph Sequence Invokes the dialogue box shown in figure 35 that allows the user to change the sequential ordering of graphs Graphs can be replicated for multiple sampling rates by clicking on a graph name and pressing the Replicate button The edited sequence and graph replications will be depicted in the Phasing window upon selecting
106. xt file The file may only describe a single graph Updates to the file node instantiations queue sizes input injection rate and added control edges for a given analysis or design are done automati cally by the tool by using the update Graph menu com mand within the Operating Point window defined in section 9 The format of the file is given below Key words are not case sensitive items shown in brackets are optional name specifies a character string with a maximum of 20 characters and no spaces and integer specifies a number from 0 to 32767 Optional parameters that are omitted have a default value of zero Blank lines separating statements are allowed See appendix for examples The first line in the file must be GRAPH name specifies the name of the graph Following the GRAPH statement in any order are To specify a node transition specifies a node with a unique name NODE name PRIORITY integer task priority for information only READ integer time to read input data PROCESS integer time to process data time to write output data or set up for communication WRITE integer INST integer task instantiations END NODE end of node object Note that the statements between NODE and END NODE may be in any order To specify a source transition SOURCE name specifies a source with a unique name time between inputs i e the input injection period TBI integer
107. ysis is introduced The software program referred to hereafter as the Dataflow Design Tool or Design Tool provides automatic and interactive analy sis capabilities applicable to the design of a multiprocess ing solution The development of the Design Tool was motivated by a need to adapt multiprocessing computa tions to emerging very high speed integrated circuit VHSIC space qualified hardware for aerospace appli cations In addition to the Design Tool a multiprocessing operating system based on a directed graph approach called the ATAMM multicomputer operating system AMOS was developed AMOS executes the rules of the algorithm to architecture mapping model ATAMM and has been successfully demonstrated on a generic VHSIC spaceborne computer GVSC consisting of four proces sors loosely coupled on a parallel interface PI bus refs 1 and 2 The Design Tool was developed not only for the AMOS and GVSC application development envi ronment presented in references 1 and 3 but also for other potential dataflow applications For example infor mation provided by the Design Tool could be used for scheduling constraints to aid heuristic scheduling algorithms A formal discussion of dataflow graph modeling is presented in section 2 along with definitions of graph theoretic performance metrics Sections 3 through 9 pro 2 vide an overview of the user interface and the capabilities of the Dataflow Design Tool version 3 0 Fur

Dataflow Design Tool - ODU Computer Science

Contents

Download Pdf Manuals

Related Search

Related Contents