Home

Algorithms for Mapping and Scheduling Real-Time

1. 86T 09 GIT 67 6cT LV oc 97 6cTU TV 059718 6 5 GE Su 8 0718 8Y0 8Z 042719 66607 VIT8G 069 L8 72096 2 99ugjsur T6T 0 627 0 TEZO PIST 00c 0 06 0 GIE 0 ESPERO 9T7 GI 59 74 06 2 946 6 666 GCE V y 9ouejsul Scr 991726 242 Lay 66 r1077 cv v9 608 CC leenman own AdO slow s eun 140 5 oum 5 oum s un own Sumesig Sup eo1q Sjure1jsuoo Surpeoxqqnnur sjure1jsuoo aue Sousa momum c ApwwAg zer YM ATI Sjure1jsuoo Aze YUM dT eugeseq ATI Purpsegg UM AE qua dI OIJ JOOJIBIA uo 530101380002 Ze SUISN oY JO SIYNSOY 9 40 CHAPTER 6 EXPERIMENTAL SETUP Chapter 7 Related works In this chapter we present several works which have been done in field of application mapping and scheduling We mainly focus on real time applications Each work listed approaches our problem a similar problem or approaches it from a different angle We are providing a quick overview of these approaches 10 describes how we can represent real time data driven applications using SDF graphs Also how computation of the throughput is done Unfortunately it does not cover the problem of mapping and scheduling 7 does not deal with the mapping or scheduling but it shows what needs to be considered if we modeled a real system with communication between processors and shared memories 9 is a grea
2. Figure 4 2 Schedule showing variables used by equation 4 3 Equation 4 4 ensures that actor i is executed only once at a time This equation is related to equation 4 1 Since actor i can be mapped only on one particular processor it can not be executing more then once in the same moment This would mean one processor would be handling two tasks at the same time which is not permitted in general 14 CHAPTER 4 BASELINE ILP gt Au S0 Es t lt 1 VjeJ te 0 Tmar 4 4 1 l Equation 4 5 makes sure that the amount of work done by actor during the periodic phase is equal to the actors repetition vector n multiplied by its WCET n d Mapping decision makes sure we are counting WCET only for decided mapping Counting the work done by actor during the periodic phase is done by calculating the difference of total work done before the period starts W t start t and total work done during the whole schedule W Tinax Tas Wi Tmaz Y Wilt start t m Ars dij VicT 4 5 t 0 jeJ We are able to count work done time while the actor i was executing W t by actor i in every time unit by counting the difference between number of started S t and finished E t executions of actor i in time t Difference between S t and E t can be in interval 0 1 if and only if the difference is equal to 1 the task is actually executing in the time t Equation 4 6 counts these differences and fills the ve
3. Figure A 17 Artifical instance 12 Appendix B Nomenclature API CP DAG FPS GPU HSDF ILP MCM MPSoC NoC SDF TDMA UAV WCET Application interface Constraint programming Directed acyclic graph Frames per second Graphics processing unit Homogeneous synchronous dataflow model Integer linear programming Maximum cycle mean Multiprocessor system on chip Network on chip Synchronous dataflow Time division multiple access Unmanned aerial vehicle Worst case execution time 65 66 APPENDIX B NOMENCLATURE Appendix Content of attached CD Java implementation L MappingRTStreamingApplicationsOnMPSoC LaTex Chapters Hron thesis 2009 pdf Hron thesis 2009 ps Hron thesis 2009 tex Images Makefile figures hyphen tex k336 thesis macros sty reference bib Figure C 1 Content of attached CD 67 68 APPENDIX C CONTENT OF ATTACHED CD
4. I declare that I worked out the presented thesis independently and I quoted all used sources of information in accord with Methodical instructions about ethical principles for writing academic thesis In Prague on May 11 2015 Abstract This thesis addresses the problem of mapping and scheduling of real time data driven applications on heterogenous multi core platforms We are proposing an ILP for mulation to solve this problem and introduce multiple performance upgrades for the formulation in usage of lazy constraints and a symmetry breaking algorithm Also we describe the pros and cons of two options and how to use lazy constraints We provide an implementation tutorial for lazy constraints with use of CPLEX Java API We are providing experiments for lazy constraints and the symmetry breaking algo rithm as well as for baseline ILP for comparison We experimentally show that the performance upgrade is up to factor 7 but based on results of real applications we are still proposing to work on the performance issues in the future Keywords mapping scheduling real time application heterogenous multi core platform lazy constraints symmetry breaking ILP CPLEX Java API Abstrakt Tato diplomov pr ce se zabyv problematikou mapov ni a rozvrhov ni aplikaci pro v d nych v re ln m ase na heterogen multiprocesorov platformy Pro vy e en probl mu je v pr ci navr ena formulace celo seln ho programov n
5. 1 1 A44 1 because the objective function can not be any better in terms of length of 26 CHAPTER 5 LAZY CONSTRAINTS A14 1 As 1 1 Aga 1 5 7 Figure 5 5 Initial mapping and symmetries for the instance the period The best case would be the actor with a changed mapping would fill the empty spaces in the schedule This would result in the same length of the schedule and periodic phase would remain the same length This is an example for the input matrix d containing WCET of actors and symmetry matrix Sym both shown in figure 5 4 A11 Az2 A21 Aga lt 3 5 8 Ara Agi A23 Aga lt 5 9 Aza Aas Ar A22 3 5 10 Aga Ari t A22 lt 3 5 11 We introduce three functions in Code 5 4 The function removeSymmetries ac cept decision variable matrix A which represents mapping and matrix of symmetries Sym as input The algorithm does not provide any output because the main purpose of the algorithm is for it to add the lazy constraints to the active model based on the input variables If the algorithm finds that the mapping is determined in index 4 7 1 at line 4 it starts searching if there are any symmetries for actor 4 present in the model If there is a symmetry line 6 between actors 4 and js the algorithm checks whether or not actor js is mapped to the symme
6. 2 D Figure A 4 Modem with buffer 51 al0 hil Da 1 ha 0 a6 in 92 APPENDIX A GRAPHS USED FOR BENCHMARKING n 1 2376 2376 1 Figure A 5 H 263 decoder with buffer n 1 2 3 4 PP 33 RK 5 1 O CA RD 10 52 Figure A 6 Artifical instance 1 54 APPENDIX A GRAPHS USED FOR BENCHMARKING 4X0 2 3 4 2 7 4 5 o 2 A A lt 4 8 7 8 8 7 Figure A 7 Artifical instance 2 su 5 7 4 4 75 5 3 Figure A 8 Artifical instance 3 59 56 APPENDIX A GRAPHS USED FOR BENCHMARKING n 1 1 1 14 1 14 Figure A 9 Artifical instance 4 Figure A 10 Artifical instance 5 58 APPENDIX A GRAPHS USED FOR BENCHMARKING n 1 1 1 13 2 1 13 3 7 6 6 CO DJ Figure A 11 Artifical instance 6 59 n 1 1 1 13 2 gt _ U 2 4 4 13 DI Y Figure A 12 Artifical instance 7 60 APPENDIX A GRAPHS USED FOR BENCHMARKING n 1 1 1 14 1 Figure A 13 Artifical instance 8 61 n 1 10 2 4 1 Figure A 14 Artifical instance 9 62 APPENDIX A GRAPHS USED FOR BENCHMARKING 8 Figure A 15 Artifical instance 10 n 4 6 2 5 1 6 oC 1 2 1 1 4 12 7 1 12 7 12 4 2 3 7 Figure A 16 Artifical instance 11 63 64 APPENDIX A GRAPHS USED FOR BENCHMARKING n 1 1 1 1 10 2 17 1
7. Sou st pr ce je tak n kolik vylep en dan ho algoritmu za elem zv en jeho v konu Tyto p stupy jsou porovn ny na z klad experiment z hlediska v konnu jednotliv ch al goritm Pr ce poskytuje stru n n vod jak pou vat lazy constraints pomoc CPLEX Java API a experiment ln ukazuje e je mo n a sedmkr t zrychlit navr en al goritmus Na z klad experiment s re ln mi aplikacemi navrhuji d le pracovat na probl mech spojen ch s v konnem Kl ov slova mapov n rozvrhov n aplikace prov d n v re ln m ase hetero gen multiprocesorov platformy lazy constraints odstra ov n symetri celo seln programov n CPLEX Java API ix Contents 1 Introduction 2 Background 2 1 Synchronous Dataflow Model 2 2 Homogeneous SDF Model 2 9 System Architecture 24 Mapping and Scheduling relation 3 Problem Formulation 4 Baseline ILP 4 1 Formulation se 4 2 Baseline ILP improvement were ah wA em ipe ie 4 3 Linearization of the baseline ILP 5 Lazy constraints 5 1 Introduction to lazy constraints 5 1 1 Lazy constraints 5 1 2 Lazy constraints callback EROR E REM 5 2 ILP with la y constraints rar hr he dere tei qe de cese an 5 3 Symmetry breaking e 2 boy t
8. 13 Several algorithms implemented in the SDF tool were also used to prepare the data for our experiments 6 CHAPTER 2 BACKGROUND t periodic transien phase phase ajajaja a p time Figure 2 5 Schedule showing transient and periodic phase 2 3 System Architecture The architecture of the system has been considered in several ways Firstly we need to point out that all the considered architectures were multiprocessor multicore systems Communication between the cores and processors have not been considered due to their calculation complexities We assume m cores of k different types If m k we consider it as a fully heterogenous system This means that all actors have a different WCET for each processor If m gt k then the system architecture is only partially heterogeneous which means all actors have the same WCET for the same type of processor e g graphics processing unit GPU 2 4 Mapping and Scheduling relation Mapping refers to assigning actors to processors This means each actor will have a dedicated processor during the application runtime while the static schedule will ensure the processor will be available for concrete actor execution at a particular time Scheduling means building a static schedule for every single actor in the applica tion During this we optimize particular criteria For example the overall length of schedule or just part of the schedule e g
9. 2 3 is holding information that actors and 4 are symmetrical on processors 2 and 3 index starting at 0 1100 061 0 1100 _ 0 1 0 0 di 0022 Sum 0 0 0 2 3 0022 0 0 23 Figure 5 4 Symmetries for matrix introduced on figure 5 2 5 3 2 Symmetry breaking using lazy constraints callback Since we are adding lazy constraints to active models based on their values from the first integer solution we are not able to add those constraints into the lazy constraints pool before the solver starts Therefore we need to use lazy constraints callback as we discussed in section 5 1 2 The main idea for this approach is to remove all branches of the ILP tree which contains symmetrical solutions from the one which had been found Lets say the integer solution will find a mapping which can be seen in figure 5 5 We are able to express this mapping by a number of equations 5 7 Using this mapping we are able to build an equation which forbids symmetrical mapping Constraints which break symmetries is mathematically described in equations 5 8 5 11 These equations forbid symmetrical mapping While the solution consists of mapping decisions introduced in figure 5 5 we are forbidding any other symmetrical mapping Symmetrical mapping would for example A4 1 Agi 1 1 A44 1 violate the lazy constraints 5 8 and 5 9 Also please note that it is forbidden to map two actors on the same processor e g 1 A21
10. breakSymm Ps A length 1 void breakColumnSymmetries A i j js f Expression currentMappingEx new Expression int c 0 int T T i 1 Sym i js 0 1 0 T i 1 Sym i js 1 1 T js 1 5 1111 1101 1 0 T js 1 Sym i js 0 1 0 o for int a 0 T length a for int b 0 T al length b 5 if T allb 1 currentMappingEx addTerm 1 Ala 1 b 11 Expression breakSymm new Expression if symmetryMapping Sym il js1 0 4 breakSymm addTerm 1 A i 1 Symlil js 1 1 1 1 5 4 CPLEX PARAMETERS AFFECTING PERFORMANCE 29 breakSymm addTerm 1 A js 1 Sym illjsl 11 1 1 else breakSymm addTerm 1 A i 1 Sym il js 0O 1 1 breakSymm addTerm 1 A js 1 Symli js 0 1 1 model addEqutionLe currentMappingEx breakSymm A length 1 Code 5 4 Functions used callback of CPLEX Java to break symmetries 5 4 CPLEX parameters affecting performance Disabling cuts Since we are focusing on the performance upgrade of the proposed baseline ILP we should also target the CPLEX itself We have found that several parameters can affect the CPLEX solver performance and solving time W believe that parameters can affect different ILP formulations in different ways To determine this we are proposing to use experimental testing for any particular formulation Cuts commented in Code 5 5 were not helpful and did not impro
11. 2 2 EA Sea eva are Zu 62 A 16 Artifical instance 11 2 63 A 17 Artifical instance 12 ccana caa ca daa bI ME E ea 64 C 1 Content of attached CD 67 List of Tables 4 1 4 2 6 1 6 2 6 3 6 4 Input variables 4x9 La X RU RH ER Decision variables Size of the real application instances used for benchmarking Specification of hardware used for experiments Results of experiments using Kepler server for calculation Results of the experiments using lazy constraints on MacBook Pro XV 12 13 32 34 35 39 xvl LIST OF TABLES Chapter 1 Introduction With an increasing number of real time streaming applications we must also keep in mind the importance of satisfying real time constrains Besides non real time applica tions which do not have any requirements there are two types of real time constraints First there are soft real time constraints with requirements to avoid missing deadline but some misses are tolerable The second type is hard real time constraints which do not tolerate misses of any deadline Missing deadlines on hard real time constraint applications would have catastrophic consequences Normally applications have re quirements for execution time but streaming applications care about throughput which stands for the amount of data produced by the applic
12. Online accessed 20 April 2015 J Zhu I Sander and A Jantsch Energy efficient streaming applications with guaranteed throughput on mpsocs In Proceedings of the 8th ACM international conference on Embedded software pages 119 128 ACM 2008 Appendix A Graphs used for benchmarking In this appendix we are providing a set of inputs A subset of them were used in our thesis for experiments Nodes in the graph represents different actors Their names are written in each of the nodes Production and consumption rates are present on the edge beside each actor In the middle of the edge can be present initial token in the parenthesis Also we must mention that if no number of production or consumption rate is present the rate is equal to 1 The repetition vector is always listed in the rectangle beside each graph 47 48 APPENDIX A GRAPHS USED FOR BENCHMARKING Figure A 1 Satellite receiver with buffer 5 RN W W N RD NX oo CO 14 CA 5 1 n 147 147 98 28 32 160 1 1 1 1 1 Figure A 2 Sample rate converter with buffer 49 50 APPENDIX A GRAPHS USED FOR BENCHMARKING CD D ETFRRRRRRRRRRRRN a5 re Cao render gt a7 subbinvO a8 subbinvl DA Figure A 3 MP3 decoder with buffer n 1 2 1 1 16 16 1 1 1 2 1 1 1 1 1 1 al5 forkl Joa
13. input variables nor decision variables to the callback class Code 5 3 shows how you can use the lazy constraints callback using the CPLEX Java API The code in this example is the same as the code introduced in Code 5 2 But as you can see we have more control in terms of adding the lazy constraints The condition line 16 in Code 5 3 which determines whether or not to add the lazy constraints in the active model line 17 in Code 5 3 does not need to be in close relation to the constraint itself 1 20 CHAPTER 5 LAZY CONSTRAINTS public class ClassUsingLazyConstraintsCallback void run cplex use new CustomLazyConstraintCallback if cplex solve printFeasibleSolution else printInfeasibleSolution private class CustomLazyConstraintCallback extends IloCplex LazyConstraintCallback Override protected void main throws IloException if equation left side gt equation right side this add IloRange cplex le left expr right expr equation 1 Code 5 3 Using lazy constraint callback in CPLEX Java API 5 2 ILP with lazy constraints Experiments described in chapter 6 were initially performed over the baseline ILP model introduced in chapter 4 Based on the results of these experiments we decided to improve the performance in terms of solving time because it was not possible to run realistic examples in a reasonable time Since the model contains a large number of equations and
14. section 5 4 but we did not have a good experience with it in terms of performance results These and further results can be found in Chapter 6 We have experimented with both approaches Lazy constraints callback was faster in smaller instances instances with shorter solving times On the other hand lazy constraints function was better in terms of performance in bigger slower solving times instances Our theory is that the solver overhead caused by maintaining the lazy constraints pool is more complex than the instance itself 5 3 Symmetry breaking Instances which are not meant for fully heterogenous systems can contain a lot of mapping symmetries This means that entirely different mappings can result in the same schedule because WCET of actors are the same for different processors It can be caused by processors which are the same type Figure 5 1 shows how symmetrical schedules can look like Because the actors a and a9 have the same WCET on processors p and pa the schedule does not change while we change the mapping of A14 to A15 1 This means if we change the mapping of all of the actors from processor p to processor and processors are the same type all actors have same WCET on both processors the length of the schedule must remain the same Symmetries can be seen in figure 5 2 Values which equal 0 i e d 0 means the actor i cannot be mapped to processor 7 In other words actor i can be mapped only to processo
15. time jobs on a heterogeneous multiprocessor In Proceedings of the 7th ACM IEEE international conference on Embedded software pages 57 66 ACM 2007 A Nelson K Goossens and B Akesson Dataflow formalisation of real time streaming applications on a composable and predictable multi processor soc Journal of Systems Architecture 2015 K Rosvall and I Sander A constraint based design space exploration framework for real time applications on mpsocs In Proceedings of the conference on Design Automation i Test in Europe page 326 European Design and Automation As sociation 2014 A K Singh M Shafique A Kumar and J Henkel Mapping on multi many core systems survey of current and emerging trends In Proceedings of the 50th Annual Design Automation Conference page 1 ACM 2013 45 46 110 11 12 13 14 BIBLIOGRAPHY S Stuijk T Basten M Geilen and H Corporaal Multiprocessor resource allo cation for throughput constrained synchronous dataflow graphs In Proceedings of the 44th annual Design Automation Conference pages 777 782 ACM 2007 Ilog cplex 11 0 user s manual cuts description http www eio upc es lceio manuals cplex 11 html usrcplex solveMIP14 html state to 20 4 2015 Ilog cplex 11 0 parameters reference manual http www eio upc es lceio manuals cplex 11 pdf refparameterscplex pdf state to 20 4 2015 Sdf3 tool homepage 2015 http www es ele tue nl sdf3
16. to take into account the mapping decisions for the processors and communication channels because it has a big impact on the schedule Heterogeneous platforms have become very popular for their high computational performance at low power in the past few years To evaluate our approach we used real application benchmarks provided by the SDF tool 13 and 2 CHAPTER 1 INTRODUCTION also propose several artificial benchmark instances Mapping and scheduling should not take too long as to avoid impacting the design time of application Our goal was to synthesize an optimal mapping and schedule for the application model in the form of synchronous dataflow SDF graph with the goal of maximizing throughput It hold a set of tasks precedence constraints and a worst case execution time for each processor where the actor is able to execute This thesis consists of multiple parts Chapter 2 provides the background in formation required to understand the contributions of the thesis In Chapter 3 we formulate the problem for this thesis We present an ILP formulation which will serve as the baseline formulation for comparison in Chapter 4 Use of this ILP formulation has proven to be ineffective in terms of performance solving time on real application models Hence we are proposing improvement for this ILP in Chapter 4 while also using lazy constraints A description and usage of lazy constraints with a symmetry breaking algorithm can be found in Chapt
17. 2 92 78 132 98 76 89 217 134 80 83 80 87 d 96 97 68 45 68 39 74 96 69 38 94 88 114 18 68 68 69 37 78 69 49 57 98 45 78 6 10 2 5 12 1 1 4 1 5 1 9 2 9 12 34 7 42 45 87 67 56 12 19 47 21 12 9 86 9 7 76 35 12 46 5 89 6 9 51 46 36 92 122 89 90 42 48 9 85 27 5 4 1 9 1 18 19 1 1 21 1 2 67 89 Figure 6 4 Input of fully heterogenous instance instance of MP3 decoder 6 3 contains the same instances from sections 6 3 and 6 4 which have been calculated on the Kepler server It is important to point out that the smaller difference between the Period and the actual objective function was the better performance upgrade lazy constraints provided This is caused by the fact that if we bind the objective function which minimizes the period and solution with the lowest possible period is found the solver can stop its execution The results of the first three experiments described in sections 6 3 6 4 and 6 5 are graphically displayed in figure 6 5 The chart is made from the real time world clock and is clearly shown that the lazy constraints increase the performance of the CPLEX solver with a factor at least 2 This resulted in a significant speedup Our conclusion for these experiments is that rewriting this particular formulation into CPLEX Java API did not increased the performance Generally the CPLEX Java API seems to slow things down But by using lazy constraints in Java which had significant performance upgrades in all instances which was ab
18. 54 10 19010 Graphs used in this thesis for benchmarking have been attached in Appendix A We have faced performance problems with the SDF benchmark instances due to the particular problems complexity At the beginning we had to set a reasonable solving time which the experiment should finish This deadline for completion was set after the first experiments at 2 days 48 hours This breakpoint was satisfied in most instances but instance of the H263 decoder were not solved by the proposed ILP in the given time We are introducing several results in this chapter If any tables containing results have the X sign instead of a proper time it means the solver did not finished within the 48 hour limit We also propose a CPU time in some cases The CPU time is the sum of times of every thread used by the solver and the solving time refers to the real time which humans can perceive This metric would become very relevant in the case the implementation of CPLEX would handle parallel processing a better way in using lazy constraints Current implementation stop the other threads while the callback is executing and needs to wait until the lazy constraints callback finishes 6 1 BENCHMARK INSTANCES n 12 1 1 16 16 1 1 12 1 1 4 141 1 0 PEES KO CoNo N MU o 2 5 ft 6 Figure 6 2 Modem with buffer 33 34 CHAPTER 6 EXPERIMENTAL SETUP 6 2 Hardware Several exp
19. Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Graphics and Interaction DIPLOMA THESIS ASSIGNMENT Student Bc Jakub Kop iva Study programme Open Informatics Specialisation Software Engineering Title of Diploma Thesis Algorithms for Mapping and Scheduling Real Time Streaming Applications on Heterogeneous Multi core Platforms Guidelines This assignment addresses mapping of streaming applications on multi core systems It deals with design of algorithms for automated mapping and scheduling Then on the basis of the existing algorithms and literature research a mapping and scheduling algorithm will be designed The expected outcome of the assignment is an implementation of the proposed methodology Specify the mapping and scheduling problem and review existing solutions Study the designed algorithm and justify choices made during the design phase Describe possible design adaptations considering the specific problem Implement selected designed algorithm s using ILP CPLEX solver Evaluate the quality of the proposed algorithm in terms of time and space complexity Compare the designed algorithm with existing algorithms using existing models of real applications from the SDF3 tool 2 The ILP based approach in 1 3 will serve as baseline for comparison Bibliography Sources 1 Jing Lin Srivatsa A Gerstlauer A Evans B L Heterogeneous multiprocessor ma
20. T for those actors the worst mapping decision Period is determined as length of the schedule if every actor runs in parallel and executes on the fastest resource available Since every actor is mapped only to one processor and no actor can run simultaneously Period lt Period P 4 9 Period gt maxi di ni 4 10 iel Period gt Period 4 11 LB 1 n Period max mini di d 4 12 Because ILP formulation targets the period schedule it does not care about the transient phase Therefore a schedule could start with 5 gt 0 which would mean that some tasks would have started executing before time t 0 which is obviously impossible That is why we introduce equation 4 13 4 3 Linearization of the baseline ILP In this section we show how non linear equations can be converted for usage in ILP frameworks All non linear equations 4 3 4 5 contain a multiplication of integer decision variables and binary decision variables Hence we can use linearization proposed in this section since the binary variable can contain only values 0 1 To use the linearization we need to define auxiliary variables X Y Z with usual indices where 7 7 stands for actor and processor respectively and t stands for time we will use the variable M which is a sufficiently big enough integer Equation 4 3 can be linearized by equations 4 14 4 17 Those formulations can be expressed by condition If actor 7 is mapped on processo
21. ack with all of the CPU threads available cplex setParam IloCplex IntParam Threads 4 Code 5 6 CPLEX cut parameters Chapter 6 Experimental Setup In this chapter we introduce instances which have been used for experimental testing Experiments were performed in several setups Baseline ILP was benchmarked in IBM OPL Studio as well as in Java implementation with the goal to compare these two implementations Then we introduce benchmark results for lazy constraints and lazy constraints callback as well as the symmetry breaking algorithm used in lazy constraints callback We define what instances were used for our experiments in section 6 1 where you will find several example graphs All graphs can be found in Appendix A We also introduce a table comparing real application sizes You can find the specification of the hardware used in the experiments in section 6 2 In sections 6 3 and 6 4 we compare the differences between an implementation of the baseline ILP introduced in section 4 1 We have implemented the baseline ILP in IBM OPL Studio and Java with use of CPLEX Java API Experiments in sections 6 5 and 6 6 introduce the results of the two different approaches using the lazy constraints and the symmetry breaking algorithm 6 1 Benchmark instances Benchmark instances used for comparison of implementation in Java were selected from the SDF tool Our goal was to experiment with commonly known instances and also with real
22. an be used and how they work Unfortunately it does not describe how to implement lazy constraints and what the possible options are This was described in this thesis Chapter 8 Conclusion and Future work Our main goal was to find an algorithm that can map actors to processors and build a static schedule in a reasonable time A successful mapping and scheduling algorithm was introduced in the form of an ILP formulation which had been implemented in IBM OPL Studio as well as in Java using CPLEX Java API as a point of comparison The baseline ILP had several issues which needed to be solved in order for it to be implemented into a CPLEX framework Linearization of non linear equations was briefly described in section 4 3 The complexity of this problem is due to the mapping and scheduling which has to be done simultaneously because every mapping decision can have an effect on the schedule We have described the needs for this in section 2 4 Also the complexity is given by size of the instances While the instances do not seem complex at first the SDF graph is complex because presence of the precedence constraints in the form of consumption and production rates The drawbacks of the ILP formulation are clear It has serious performance is sues To solve it we have implemented lazy constraints into the model and proposed a symmetry breaking algorithm While the lazy constraints have good results in most cases the symmetry breaking algorithm i
23. ance improvements The disadvantage of using the lazy constraints is that the CPLEX solver can only use one thread while running Or to be more specific more threads can be forced by parameter but we did not have any good results doing so It is probably caused by the callback itself because every other branching needed to be stopped before the callback had finished executing Figures 6 6 and 6 7 displays the values of solving times in real world clock time and CPU time respectively From those figures it is clear that in most cases lazy constraints save CPU time but in world clock time are still outperformed due to a lack of multithreading computation 38 CHAPTER 6 EXPERIMENTAL SETUP Baseline ILP OILP with Lazy constraints with Symmetry breaking with Lazy constraints multithreading 100 7 10 1 01 0 01 Y A d gt Ed ES S ES SS a a AS S S gt gt C e A 5 Figure 6 6 Solving time of second instance set in real time Baseline ILP ILP with Lazy constraints ILP with Symmetry breaking 100 CPU time s 01 54 S of S gt e gt S E E E Figure 6 7 Solving time of second instance set in CPU time 39 6 6 SYMMETRY BREAKING EXPERIMENTS 9776 6 06 80670 G61 G 698 0 CAES 046 SEN 266706 66006 Scr vc 67c 6 666 066711 9989 OT 92ugjsut
24. application models That way we can compare our approach with others in terms of performance We have found that the size of the SDF graph expressed by the number of nodes actors and edges channels do not have a significant effect on the problem s in stance complexity in terms of performance In many cases it seems like the rates are more important because it determines the length of the schedule value of Tmar which has a larger impact on solving time Rates corresponding to the size of the HSDF are listed in table 6 1 With more actors present in the HSDF the periodic phase of the schedule consists of more actors firing This results in a longer period and longer constraints present in the model 31 32 CHAPTER 6 EXPERIMENTAL SETUP For example instance of the H263 decoder which look small at first shown in figure 6 1 is more complicated to solve using the ILP formulation than instance of the modem in figure 6 2 1 n 12376 2376 1 N 9 a by Ba gt n 9 gt EN M e a ae oS 9 a a 2 Figure 6 1 H263 Decoder with buffer Table 6 1 Size of the real application instances used for benchmarking Number of actors Number of channels Instance name SDF HSDF SDF HSDF Satellite receiver 22 4515 74 18723 Sample rate converter 6 612 16 2654 MP3 decoder 13 13 S 37 Modem 16 48 54 170 H 263 decoder 4 4754 10 19010 MP3 decoder fully heterogenous 4 47
25. ation per unit of time For example a video decoder would usually have a minimal throughput constraint of 30 or 60 frames per second FPS To satisfy throughput we need to consider the hard real time constraints Under best case violation of these constrains it will get you a lower frame rate on your TV but real time applications started to affect our life even more than ever before These applications are used in the automotive industry for keeping a car between driving lines and other safety features While the car still has breaks aircrafts do not while in the air a violation of the real time constraints could have catastrophic consequences We also need to consider the military usage of real time streaming applications e g in unmanned aerial vehicle UAV drones where camera feeds are transported across huge distances through satellite connections This means there is a limited bandwidth and a necessity to use a video encoder and decoder These constraints are largely related to the application The problem in this thesis is to synthesize an optimal mapping of application tasks to processor cores and derive a static execution schedule We are considering real time streaming applications on a heterogeneous multiprocessor system on chip MPSoC platform while making sure that we satisfy the real time constraints For this purpose we are using an integer linear programming ILP formulation In terms of finding an optimal schedule we also need
26. ation starts executing Initial token is shown as dot on the edge in a graph Figure 2 3 Graph containing self edge at actors a and a The self edge works in a way where the actor a takes 1 token the initial token to start firing While it is executing it cannot start executing again because there are not enough tokens at each input channel Once the execution stops the actor produces another token on the self edge and is ready to fire again The actor a also produces one token through channel al a To fire as actor needs to fire at least twice as seen from figure 2 3 Initial tokens can also be combined with a back edge to model a finite buffer between actors Figure 2 4 displays how it is possible to model a finite FIFO buffer introduced in figure 2 2 The application starts with initial tokens on the back edge aa 95 Right before a starts firing it will consume one token from the back edge which means it is reserving space in the FIFO memory this ensures there is space 2 2 HOMOGENEOUS SDF MODEL 5 for the output to be placed in the buffer Once actor az starts firing it will consume tokens on the channel az After the firing has finished it produces the same amount of tokens it consumed for the firing on the back edge ag a which means it frees the buffer memory for the amount of data used for firing 1 1 1 5 yt Y Figure 2 4 Graph shows how is possible model finite buffer from fig
27. ause we do not expect for it to be violated very often The equation forbids multiple executions of actor in time t For example where all the actors have self edge k with consumption rate equal or lower to production rate and initial tokens greater than those rates lt and Cj Oz We need to mention that equations 5 2 5 5 were selected because they need to be linearized This means we need to have 4 constraints in the model for each of these three constraints according the linearization proposed in section 4 3 Y Aig Eilt dis Viel te 0 Ta 5 2 jeJ Sl E lt 1 Wed ze P 5 3 1 l The results of using 5 2 and 5 3 were not significant so we have moved our focus onto the second candidates Equations 5 4 and 5 5 were considered again because they contain variable t This variable has a huge effect on prolonging the equation We are assuming this equation wont be violated before the schedule is nearly complete because it strongly relates to the period which is at the end of the schedule Tmax W aa m D W t 2 start t m y Aij dis Wel 5 4 t 0 jeJ t W t Silta E t2 Vi I tE 0 Tmar 5 5 t2 0 Experiments were performed multiple times with those equations Firstly we tried adding only one equation as the lazy constraint There were performance improve ments but nothing that proved significant When we combined two lazy constraints equations it resulted in a s
28. but just the necessary usage of variables In addition they only use small instances of problems but are able to map and schedule multiple applications running simultaneously to satisfy the real time constraints on each of them 41 42 CHAPTER 7 RELATED WORKS A way to find energy efficient mapping and schedule is proposed in 14 How ever they use heuristics for both mapping and scheduling They also only consider homogeneous systems This is the same system architecture that we consider but we are more general with the architecture They consider two different power states for chips and memories They try to find a minimal energy consuming schedule while satisfying the throughput requirement It was done by slowing down the processors or by changing the size of the memories These calculations are done with usage of a ILP model The downside to their approach was that they were not able to ensure optimality of the schedule or the mapping because they used heuristics 6 deals with scheduling and mapping in a different manner They use time division multiple access TDMA to schedule the actors and build a static schedule from it They use CP for the scheduling Every time a partial schedule is found the maximum cycle mean MCM algorithm runs to check if this solution is still feasible Their objective function is to maximize processors utilization subject to minimum average throughput constraints This means they are targeting at differen
29. col 1 col 1 then 20 Symli li2 add j1 jad 21 Sym is 1 add X 1 7 end 22 Jatt end 23 i end 24 194 25 21 end return Symmetries Algorithm 1 Finding symmetries We propose an algorithm which is able to find all symmetries based on the input containing WCET of each actor on each processor As mentioned before the input 5 3 SYMMETRY BREAKING 25 can contain values equal to 0 i e dij 0 Algorithm 1 shows the basic implementation in pseudocode on how it is possible to find symmetries Becouse the size and complexity of the problem is based on building the schedule the algorithm which finds the symmetries in matrix d can be simple in terms of complexity Because we do not expect instances with a huge number of nodes or processors this algorithm will run only once before CPLEX begins its solving phase Our algorithm can find symmetries only when the WCET of actors involving symmetry mapping is exactly the same This is because we cannot be sure that the symmetrical mapping with different WCET will not affect the schedule Symmetry matrix for WCET matrix d introduced in figure 5 2 can result after using our algorithm 1 into a matrix which look likes the matrix Sym in figure 5 4 As you can see regarding algorithm 1 the Sym matrix containing indices of actors SYMi ip vector pair which means what processors are symmetrical to the actor respectively For example Syma3
30. ctor Wi t t Wilt Silto Viel te 0 Tmar 4 6 t2 0 Since the periodic phase can be repeated infinitely we are interested in the schedule of the transient phase and the schedule for one period therefore we can introduce equation 4 7 which makes sure there is only one start of periodic phase start t in the whole schedule Tmax start t 1 4 7 Objective function 4 8 is minimizing the period A period starts at time t 1 if and only if start t 1 Minimizing the period has the same effect as maximizing the throughput due to inverse function Throughput Plu Tmax minimize Period Tar ba start t 4 8 t 0 4 2 BASELINE ILP IMPROVEMENT 15 4 2 Baseline ILP improvement Part of this thesis is also an extension of the baseline ILP by equations introduced in this section The objective function has been bounded by two equations 4 9 and 4 11 Those equations dramatically decrease searching space of the solver which result in a greater solver performance We are able to calculate the period lower bound Period and period upper bound before the solver begins because we use only input variables for the calculation i e equations 4 10 and 4 12 are not part of the ILP model and ther results can be considered as input variables Period P is determined with the assumption that we have only one processor hence every actor needs to run sequentially and we assume the worst WCE
31. e mi eai euet uds 5 3 1 Algorithm to find symmetries 5 3 2 Symmetry breaking using lazy constraints callback 5 4 CPLEX parameters affecting performance 6 Experimental Setup 6 1 Benchmark instances 6 2 Hardware A he Sent doe va ne tst ter d cusa d 6 3 Baseline experiments using OPL 6 4 Baseline ILP 6 5 Baseline ILP with lazy 6 6 Symmetry breaking experiments 2 2222 a 7 Related works 8 Conclusion and Future work 1 Ww 11 11 15 15 17 17 17 19 20 22 23 25 29 31 31 34 34 34 35 36 41 43 xii CONTENTS Bibliography 45 A Graphs used for benchmarking 47 B Nomenclature 65 C Content of attached CD 67 List of Figures Zl erapbi da e a a 2 2 Example of instance which has finite memory buffer between actors 2 3 Graph containing self edge at actors aj and az 2 4 Graph shows how is possible model finite buffer from figure 2 2 2 5 Schedule showing transient and periodic phase 2 6 Multiprocessor system 4 1 SDF graph with indexes introduced in equation 4 2 4 2 Schedule showing variables used by equation 4 5 5 1 Symetrical schedules 5 2 Matrix of WCET wi
32. equations with multiplications of Aj and start t needed to be linearized Linearization of the ILP is described in section 4 3 In order to use this formulation we considered in section 2 3 we need to ensure the actor is mapped only on one processor j This is what equation 4 1 ensures gt Ajj 1 Viel 4 1 jeJ 11 12 CHAPTER 4 BASELINE ILP Table 4 1 Input variables Variable Variable meaning name I Set of actors J Set of processors K Set of channels between actors which set the precedence con straints containing P Number of produced tokens by the source actor on channel kek Number of consumed tokens by the destination actor on channel k K Oy Number of the initials tokens on channel k dij Worst case execution time for actor i I on Processor j J Imaz Time estimation of the transient phase and the length of one pe riod in the schedule Ni Repetition vector for actor i I Period P Period upper bound Period Period lower bound Based on the SDF model we are using introduced in section 2 1 we need to ensure that the target actor 7 can not be executed until it has all the data provided by the predecessor ts The indices in example SDF graph can be viewed in figure 4 1 In other words equation 4 2 ensure that tokens produced by channel source actor a until time plus initial tokens of channel k must be greater than all tokens consumed by the targe
33. er 5 In Chapter 6 we introduce an experimental setup with benchmarks which came from the proposed algorithms This chapter contains tables and charts that can easily be compared to the baseline ILP We are introducing related works which has been done in the field of mapping and scheduling for real time streaming applications in Chapter 7 The conclusion is described in Chapter 8 where we summarize what has been achieved in the thesis as well as future work suggestions Chapter 2 Background In this section we discuss the application model used as an input to our algorithms system architecture and the close relation between mapping and scheduling The application must satisfy real time requirements and between those require ments they must satisfy the throughput Throughput is basically a rate in which real time streaming applications can produce data for the output Throughput needs to be satisfied e g in order to produce a sufficient frame rate for a video streaming application 2 1 Synchronous Dataflow Model Synchronous dataflow models are used to describe real time applications A SDF model can be easily represented by a graph SDF graphs consists of nodes called actors a and directed edges called channels which represent the connections between actors where the data is transported from one actor to another The data is usually represented as tokens a batch of data with an exact size Tokens are transported through the channe
34. eriments used server Kepler based in the CTU in Prague Faculty of Elec tical Engineering and a Mac Book Pro laptop Specifications for both devices can be found in table 6 2 Table 6 2 Specification of hardware used for experiments Kepler server CPU Intel Xeon E5 2620 v2 2 1GHz Turbo Boost up to 2 6GHz 6 cores RAM 64 GB Laptop Mac Book Pro CPU Intel Core i5 2 4GHz Turbo Boost up to 3 1 GHz 4 cores RAM 8GB Experiments introduced in sections 6 3 6 4 and 6 5 were performed on the Kepler server After series of experiments the Kepler server had been allocated to different research and we tested the symmetries on a Mac Book Pro laptop Instances that used the Kepler server were unsolvable for the laptop in a reasonable time and we had to use another set of instances This was not an issue because the instances used on Kepler server did not contain any symmetries For the symmetry breaking algorithm a new set of instances were generated by the SDF and adapted to contain a different number of symmetries 6 3 Baseline experiments using IBM OPL Studio OPL Studio was used for its relatively fast implementation It served as our baseline benchmark which we used later on to compare our experiments Table 6 3 shows the results of experiments executed on the Kepler server while table 6 1 offers an overview of the complexity of the problem It seems like the size of the HSDF does matter in t
35. erms of problem complexity Size of the HSDF is related to the production and consumption rates of the channels The considered platform for instances listed in table 6 1 was partially heterogenous which means actors were able to fire on multiple processors not all of the processors were considered due to their performance issues The input matrix d for MP3 decoder instance can be seen in figure 6 3 with a different WCET All experiments using instances from table 6 1 were executed on the Kepler server specified in section 6 2 6 4 Baseline Java experiments To understand what the effect is of rewriting the same algorithm from the IBM OPL Studio to Java using CPLEX Java API we prepared the exactly same set of benchmarks for a Java implementation The results can be seen in table 6 3 6 5 BASELINE ILP WITH LAZY CONSTRAINTS 35 0 19 0 0 0 32 0 0 37 0 21 0 0 20 0 19 0 37 0 0 32 0 0 21 0 0 83 0 0 0 0 8 0 48 0 0 0 0 132 0 0 0 132 0 0 145 0 0 0 0 0 0 0 0 0 89 0 0 0 0 30 88 0 0 132 0 0 132 0 0 0 0 0 80 83 80 87 d 0 0 6 0 68 0 0 0 69 0 0 0 0 0 0 68 0 0 0 69 0 0 0 0 D 6 0 0 0 0 1 1 0 1 0 P 0 2 034 0 42 0 0 0 0 0 0 4 0 0 9 0 0 7T 0 O 12 0 5 0 0 0 0 0 36 0 0 0 0 42 0 0 0 0 5 4 1 0 1 0 0 1 1 0 0 2 0 0 Figure 6 3 Input of partially heterogenous instance instance of MP3 decoder Table 6 3 Results of experiments using Kepler server for calculation IBM OPL Studio Java CPLEX API Lazy constraints Ins
36. hing from a common ILP model to a model which uses the lazy constraint callback functions is easier than using the callback described in section 5 1 2 Because the CPLEX cares about the lifecycle of the lazy constraints we are freed from im plementing whether or not the constraints need to be added into the model CPLEX manages those decisions for us Constraints can be added into ILP model using func tion addLe addGe and addEq for operations gt and respectively Code showing how equation 5 1 can be inserted into a model is shown in Code 5 1 This can easily be converted and the CPLEX can use a lazy constraint for the same eguation as it is shown in Code 5 2 left expr lt right expr 5 1 cplex addLe left expr right expr eguation 1 DE k N o A Code 5 1 Add constraint to CPLEX model cplex addLazyConstraint IloRange cplex le left expr right expr equation 1 Code 5 2 Add lazy constraint to CPLEX lazy constraints pool Methods whose only function is adding lazy constraints has many pros The most important part is that the CPLEX can handle the lifecycle of the constraints by itself It usually is more effective than we would be doing it manually Mainly because the CPLEX can take the lazy constraints from the active model and put it back into the lazy constraints pool On the other hand it cannot decide whether it should add the constraints during certain ci
37. is would have a greater use in any industry by prolonging the battery life and also during an energetic crisis which the world is currently in Power efficiency is becoming an important design phase a is why heterogenous architectures are becoming so popular Bibliography 1 2 3 4 5 6 7 8 9 M Geilen Reduction techniques for synchronous dataflow graphs In Design Automation Conference 2009 DAC 09 46th ACM IEEE pages 911 916 July 2009 N K nny and S F T th A cutting plane method for solving harvest schedul ing models with area restrictions European Journal of Operational Research 228 1 236 248 2013 J Lin A Gerstlauer and B L Evans Communication aware heterogeneous multiprocessor mapping for real time streaming systems Journal of Signal Pro cessing Systems 69 3 279 291 2012 J Lin A Srivatsa A Gerstlauer and B L Evans Heterogeneous multipro cessor mapping for real time streaming systems In Acoustics Speech and Signal Processing ICASSP 2011 IEEE International Conference on pages 1605 1608 IEEE 2011 M Lukasiewycz and 5 Chakraborty Concurrent architecture and schedule op timization of time triggered automotive systems In Proceedings of the eighth IEEE ACM IFIP international conference on Hardware software codesign and system synthesis pages 383 392 ACM 2012 O Moreira F Valente and M Bekooij Scheduling multiple independent hard real
38. l jeJ te 0 Tmar 4 24 lt M 1 start t W t Viel jeJ te 0Tmar 4 25 Chapter 5 Lazy constraints Lazy constraints are constraints which can be added to ILP models during its solving process It is useful when the model consists of a large number of constraints and it is not plausible for those constraints to be violated The main motivation for using lazy constraints is to increase the performance in terms of solving time But do not get mistaken because you can increase the performance by using only one lazy constraint Also the solver can be slowed down if you use one badly chosen constraint This is because the overhead for checking violations of these constraints and the process of adding the constraints to the model during the runtime can be much higher than having the constraints in the model from the very beginning Lazy constraints can be implemented in two ways by using IBM CPLEX Java application interface API Both choices offers their pros and cons which we discuss further in this chapter in sections 5 1 5 4 5 1 Introduction to lazy constraints In this section we introduce lazy constraints usage in CPLEX Java API We decided to use Java API because of its multi platform usage We will take a look into the lazy constraints function in subsection 5 1 1 and lazy callback class in subsection 5 1 2 We consider both approaches because we would like to compare the difference between having full control over lazy con
39. l in FIFO order al al 1 2 Figure 2 1 Simple graph Each node is denoted by a worst case execution time WCET which is usually placed inside the node WCET represents the execution time in the worst case scenario i e actors cannot execute slower than WCET The channel is the connection between two actors A channel starts in source actor a and is directed to target actor A channel is denoted as production Pas a and consumption rate Ca a next to the source 4 CHAPTER 2 BACKGROUND and target actors respectively We should also introduce terms of firing which means firstly the actor atomically consumes a number of tokens from each incoming channel the number of tokens consumed is equal to the consumption rate on the particular incoming channel After the actor executes for its WCET and atomically produces a number of tokens which are equal to the production rate of each outgoing channel For example in order to fire actor az from figure 2 1 we need to fire the actor a at least twice This would ensure we have enough tokens for actor a for it to be ready to fire FIFO Finite memory with capacity 5 tokens 7 Figure 2 2 Example of instance which has finite memory buffer between actors A self edge with one initial token from actor aa to az ensures the actor will only fire once in a given moment can be seen in figure 2 3 The initial token is the amount of data present on the channel even before the applic
40. le to be solved in Java without usage of lazy constraints We assume the Java API can approach the branching and cutting strategies differently therefore the performance can vary in different instances 6 6 Symmetry breaking experiments In this section we introduce a new set of experiments which are tested in several ways The main purpose was to have baseline results which we can use and compare to our results for the symmetry breaking algorithm All the instances were generated by the SDF tool 6 6 SYMMETRY BREAKING EXPERIMENTS 37 Studio Java CPLEX API Lazy constraints 1000 100 X B 10 1 01 NT Q A L L K 59 a S w S 3 ES e SU amp 4 SS S gy e S Figure 6 5 Solving time of first instance set in real time Table 6 4 contains the results from our various experiments In the prepared set of instances we found that the lazy constraints method approach can increase the solver performance but only in terms of CPU time Also the experiments show that instances Modem Artificial instance 4 and Artificial instance 10 which does not contain any symmetries are slowed down by the symmetry breaking algorithm because the algorithm does not reduce the searching space The algorithm executes every time the solver finds an integer solution On the other hand instances Artificial instance 2 and Artificial instance 8 recorded CPU time perform
41. lightly im proved performance Based on the empirical results our theory behind that is that when we add one constraints e g 5 4 the other constraint 5 5 that uses variable W t is still present in the model This shows that the ILP needs both bounded equations to be present in the model 22 CHAPTER 5 LAZY CONSTRAINTS By combining three or more equations with different variables we recorded a lower performance We decided to add one more equation which will be strongly related to the objective function In expectation that the performance will increase Equation 5 6 is related to the objective function because it limits only one periodic phase start in hopes this relation can improve the performance We thought that the performance would not increase by adding constraints that would be violated in most cases Such as in equation 4 1 which ensures every actor will be mapped on one processor and equation 4 2 which ensures a correct firing of the actors regarding to the precedence constraints of the application model This was experimentally proven Tmax Y start t 1 5 6 The speedup of adding equation 5 6 as a lazy constraint with previously present lazy constraints 5 4 and 5 5 was significant An important thing to point out is that while CPLEX uses lazy constraints callback CPLEX computes only on one core by default We can force CPLEX to run on multiple cores by switching one CPLEX parameter This is introduced in
42. ncreases the performance only when a large number of symmetries are present in the input We assume this is because the over head of the lazy constraints callback and lack of multithreading Because the CPLEX needs to pause solving and wait until lazy constraints callback is finished every time the callback is invoked We had proven that fully heterogenous instance of MP3 decoder can be solved more effectively with the usage of lazy constraints We have accomplished a speedup with a factor of more than 7 On the other hand the symmetry breaking algorithm did not have a significant speedup We assume this was caused by the overhead of the lazy constraints callback together with the inability to use multiple threads effectively We have partially solved the problem in some cases but it still remains open for a large number of instances since we have not found solution for performance issues in general We have proven the problem is not trivial for real applications and it is not largely scalable for now Also the provided solution does not cover real architecture because it is missing essential components e g interprocessor communication and finite memories Hence we would suggest to continuing with further performance 43 44 CHAPTER 8 CONCLUSION AND FUTURE WORK upgrades We think it would be worth trying a heuristic brute force algorithm for mapping Once we have the mapping decided the ILP formulation can be simplified for equations which
43. needs to be linearized It is important to find a way to determine the mapping and static schedule for applications with communication such as in network on chip NoC Real systems have to communicate between processors which need to be scheduled This was done in 3 but since it is an extension of the ILP proposed in this thesis it would most probably suffer from the same performance issues as the baseline ILP As we have proven not all the real application models provided by the SDF would execute in a reasonable time using the baseline ILP formulation e g H 264 decoder with a buffer is too large for modern computers and the calculation take up a lot of memory and time Also the mapping of data to memories used in the architecture is an important extension of this Real systems have different types of memories with different sizes and access types Exploring this can enable applications to execute more efficiently Decisions about this are not trivial and does not necessary means if it fits in local memory use it as data storage The problem is that there is a lot of data sitting in a local memory wasting space before it is actually needed Things that are worth mentioning is that finding mapping and schedule in such a way that we minimize the energy consumption on the model with multiple power modes of processors and find out how those units need to be configured to ensure the real time constraints and minimize energy consumption Th
44. or int j2 0 j2 lt Aljs length j2 5 10 AT Aljs 32 1 11 12 12 13 14 if sMap j 15 breakCrossSymmetries A i j js sMap 16 else 17 breakColumnSymmetries A i j js sMap 18 19 20 21 22 24 2 void breakCrossSymmetries A i js sMap 5 27 Expression currentMappingEx new Expression 28 int c 0 29 int T A 30 T il Sym i js 0 1 0 T il Symlil js 1 0 32 T js L Sym i js 0 0 T js Sym il js 0 1 0 35 for int a 0 T length a 36 for int b 0 T al length b 5 I 37 if T a b 1 38 currentMappingEx addTerm 1 Ala 1 b 1 39 10 41 42 Expression breakSymm new Expression breakSymm addTerm 1 A i 1 Sym il js 0 1 1 15 breakSymm addTerm 1 A i 1 Symli js 1 1 1 o breakSymm addTerm 1 A js 1 Sym i js 0 1 breakSymm addTerm 1 A js 1 Symlil js 11 1 CHAPTER 5 LAZY CONSTRAINTS breakSymm addTerm 1 A i 1 j 1 1 model addEqutionLe currentMappingEx breakSymm A length 1 Expression breakSymm2 new Expression breakSymm2 addTerm 1 1 1 Symli js 0 1 1 breakSymm2 addTerm 1 A i 1 Symlil js 1 1 1 breakSymm2 addTerm 1 A js 1 Symlil js 0 1 1 breakSymm2 addTerm 1 A js 1 Symlilljs 1 1 1 breakSymm addTerm 1 A js 1 sMap 1 1 model addEqutionLe currentMappingEx
45. ow to use this information to remove symmetries from the solution 5 3 1 Algorithm to find symmetries Algorithm 1 takes input d which consists of the WCET of all of the actors assigned to a single processor In every step it compares two rows to find the symmetries 1000 0100 2 1000 all 0001 0001 0010 Figure 5 3 Symmetrical mapping 24 CHAPTER 5 LAZY CONSTRAINTS by freezing one row row of d matrix and compares the values with the other row row It is comparing two columns of these rows col and col If all those values in the selected rows and columns match the information that processors have the same WCET for those actors is recorded to symmetry matrix Sym at index SyM ow row and also diagonally symmetrical SyMpow2 row While we read the symmetrical matrix and Sym is to mean there symmetries between actors i and ig on the processors listed in values recorded to the matrix Sym at index i i5 1 Input di 2 Output Set of Symmetries S 3 21 0 4 for i d length do 51 row dfi 6 Ig 1 1 7 for i9 lt d length do 8 row d ta 9 J 0 10 for j lt row length do 11 col row ji row j1 12 if col 0 0 OR col 1 0 then 13 continue end 14 jo it 15 for ja lt row length do 16 col row ja row ja 17 if col 0 0 OR 0 then 18 continue end 19 if col 0 col 0
46. periodic phase 2 4 MAPPING AND SCHEDULING RELATION 7 General purpose Graphic processor processor Network processor DSP Figure 2 6 Multiprocessor system Mapping and scheduling are strongly connected one to another There can be a big difference in schedule if an actor is mapped to a different processor due to the possible difference in WCET or by overwhelming the processor by actors while other processors are free A schedule consists of a transient and a periodic phase as displayed in figure 2 5 While the transient phase is executed only once after the application has started the periodic phase is periodically executed to generate the output Inversion of the length of periodic phase Fa is called throughput and it represents the frequency the output generates by the application To generate an output an application usually needs to run through multiple states Each application state is also represented in the graph where channels contain different numbers of tokens in each state A period is the amount of time in which the application runs and data is generated on output ofthe application After the beginning of the periodic phase the application returns to the initial state of the period in terms that channels are holding the exact same amount of tokens and actors are in the same phase of execution period can start during firing of actor Repetition vector has been briefly int
47. pping for real time streaming systems Acoustics Speech and Signal Processing ICASSP 2011 IEEE International Conference on pp 1605 1608 22 27 May 2011 2 Stuijk Sander Marc Geilen and Twan Basten SDF3 SDF For Free Sixth International Conference on Application of Concurrency to System Design 2006 3 J Lin A Gerstlauer and B L Evans Communication aware heterogeneous multiprocessor mapping for real time steaming systems Journal of Signal Processing Systems Vol 69 no 3 2012 Diploma Thesis Supervisor Dr Benny Akesson MSc Valid until the end of the summer semester of academic year 2015 2016 g Ji ra CSc He d of Department prof Ing Pavel Ripka CSc Dean Prague March 24 2015 11 Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering Master s Thesis Algorithms for Mapping and Scheduling Real Time Streaming Applications on Heterogeneous Multi core Platforms Bc Jakub Kopriva Supervisor Dr Benny Akesson MSc Study Programme Open Informatics Field of Study Software Engineering May 10 2015 iv Aknowledgements I would like to thank both my consultant Ing Premysl S cha Ph D and my super visor MSc Benny Akesson They were inspiring me and leading me through the master thesis Also I can not forget to thank all OF my family for supporting me during the studies vi vi Declaration
48. processor proc The schedule is divided into multiple time units Because we are using multiple types of processors we are proposing to use units which are universal in respect to the world clock e g nanoseconds We need to keep in mind that actors firing is nonpreemptive and channels with production rate P and consumption rate C constraints in our model are in terms of precedences As mentioned in section 2 1 minimizing the length of the peri odic phase is the same as maximizing the throughput because of inverse equation Throughput pzz is true Because the mapping determines the length of the schedule we are facing a com binatorial explosion For each mapping the schedule can be different and unique We are looking for an optimal mapping If we were to use brute force method on this problem we would get a total of J possible mapping decisions For example if we have 5 actors and 4 processors we would get total of 4 1024 possible mappings with possibility for 1024 different schedules with 1024 different period lengths Unfortunately mapping decisions are not the only problem Building the schedule is also difficult because of the constraints which are determined by consumption and production rates and the cyclic nature of the schedule and graph states 10 CHAPTER 3 PROBLEM FORMULATION Chapter 4 Baseline ILP formulation described in section 4 1 was introduced in 4 The formulation was not complete and it needed
49. r j e g A 1 then 16 CHAPTER 4 BASELINE ILP auxiliary variable X t E t d This means that the sum introduced in 4 17 is the same as the sum as in the original equation 4 3 Xij t Aij M Viel jeJ te 0 Tmar 4 14 Xi t gt M 1 E t di Viel j J te 0Tmar 415 Xo LM ARES MEL ges BE TV 051 S t Y X Viel 0 4 17 Another equation which needs to be linearized is equation 4 4 This is accom plished by constraints 4 18 4 21 The sum introduced in 4 21 counts the difference between and Ej if and only if the binary variable is equal to 1 ie if A 1 then Y t S t E t The original equation is again the same as 4 21 Y E lt Aj Viel j J tel 0Tmar 418 Y t Si t 3 Viel jeJ 0 7 4 19 Yo 2 Aig Si 80 1 Viel FE OT naz 420 lt 1 Viel jeJ 0 7 421 1 l Last but not least the equation which needed to be linearized is the constraint 4 5 Linearization of this equation is performed in the same way as in the examples before We are using the auxiliary variable Z which is filled by condition if start t 1 then Z t W t which can be seen in equations 4 22 4 25 where 4 22 represents the original equation W Tmaz Z t teT Ni NA dij Mel jeJ te 0 Tmar 4 22 jEJ lt start t M Viel jeJ te 0 Tmax 4 23 Z t gt M 1 start t Wt Vie
50. r j while gt 1 For example two mappings which can be symmetrical are shown in figure 5 3 The difference is that we have flipped the mapping of actors with different processors of the same type with exactly the same WCET as before In terms of scheduling there is practically no difference 5 3 SYMMETRY BREAKING 23 a a P a a time time Figure 5 1 Symetrical schedules ho ho Figure 5 2 Matrix of WCET with symmetries The schedule for A will most likely be the same as the schedule for A The solver can determine the execution of actors sequences with little difference and not violate the constraints while the length of the period remains the same This is because the model does not distinguish the processors in any other way than by execution times for particular actors This is also because communication between processors is not considered in this particular model Otherwise a mapping could result in a longer communication overhead while the other would not Since the symmetries are based on the WCET of actors assigned to specific pro cessors all of the symmetries can be determined from the input variables To do this we propose an algorithm which finds all the symmetries in section 5 3 1 Later in subsection 5 3 2 we discuss h
51. rcumstances e g based on the value of particular decision variables or a heuristic algorithm which uses the values of decision variables 5 1 INTRODUCTION TO LAZY CONSTRAINTS 19 5 1 2 Lazy constraints callback In order to use a more complex algorithm which will determine whether to add lazy constraints in the active model or not we need to use lazy constraints callback A callback is an abstract class in CPLEX Java API that can be extended Its main function can be overridden by our custom implementation This main function will run every time the integer solution is found Because we have access to all parameters input and decision variables we are able to decide whether or not to add the lazy constraints to the active model based on the values of the partial solution Adding constraints to the active model can effect the solution and its feasibility That means we need to be very careful when adding the constraints and our decision needs to be thoroughly justified Implementation of the lazy constraints callback is done by telling the CPLEX solver what class the CPLEX should consider when executing The CPLEX decides what particular class should be used as the lazy constraints callback by its classtype The best practice for lazy constraints callback is to have the implementing class of the lazy constraints callback as an inner class in Java Because you have access to the decision variables from the callback you do not have to pass these
52. roduced in section 2 1 A vector consists of the number of executions needed for each actor during one period A repetition vector expresses how many times an actor needs to fire to ensure the application will generate data for the applications output CHAPTER 2 BACKGROUND Chapter 3 Problem Formulation We have been given a SDF model to work with The model was briefly described in section 2 1 We also assume that analysis in terms of calculating the correct repetition vector was performed The repetition vector is also briefly described in section 2 1 and 2 4 With the SDF model given the matrix d consists of WCET for actors per possible mapping decision d WCET Matrix d relates to considered architectures If dij dij Vi E I ovre deg mand 1 2 m where n stands for total number of actors and m stands for number of resources processors cores we are considering fully or partially heterogenous architecture Our goal is to construct an optimal schedule as to minimize the length of the peri odic phase introduced in section 2 1 To do so we also need to determine mapping for the actors to processors The relation between mapping and scheduling was described in section 2 4 Output for our algorithm will be a static schedule with the optimal criterion for the length of the period In this schedule the actors will be statically mapped to particular processors In this thesis the variable A 1 means actor a is mapped to
53. straints compared to leaving the responsibility up to the framework itself Also the symmetry breaking algorithm introduced in section 5 3 cannot be achieved by the method proposed in subsection 5 1 1 5 1 1 Lazy constraints function Firstly we would like to show how CPLEX can handle the lifecycle of those con straints We only denote which constraints can be handled as lazy constraints and let CPLEX decide whether or not to add those constraints in the active model Con straints denoted as lazy are placed in a pool of lazy constraints and are pulled out only if they are violated in a specific last found integer solution The way CPLEX decides if the constraints are violated or not is simple After CPLEX finds an integer 17 k 18 CHAPTER 5 LAZY CONSTRAINTS solution it also checks if the lazy constraints placed in the pool have been violated or not by instating variables used in the constraint This can be done because the CPLEX has the values of all of the variables decision and input by the time it finds an integer solution After the lazy constraint is pulled from the pool it is placed in the active model and the model continues to calculate Of course the integer solution where the constraints have been violated is infeasible The solver can also add the lazy constraints to the active model and place it back into the lazy constraints pool if the constraints have not been used for bounding the objective function for a while Switc
54. t actor a until time t Cy S t Pe Ei Ox vt 0 Tmas 4 2 a a lt Figure 4 1 SDF graph with indexes introduced in equation 4 2 While we need to count the number of started executions 5 of actor i and the number of ended executions we are introducing equation 4 3 which anticipates that the firing of actors is not preemptive Actor i has to start firing in time when 4 1 ILP FORMULATION 13 Table 4 2 Decision variables Variable Variable meaning name Silt Number of started firings of actor 7 I up to time t 0 Tmax E t Number of ended firings of actor i I up to time t 0 T max Ais Aj 1 Actor i I is mapped to processor j J Aij 0 Otherwise start t start t 1 Start of periodic phase is in time t 1 t 0 Tmax start t 0 Otherwise W t How much time is actor i J executing up to time t 0 T max it has ended firing decreased of its WCET i e ts te WCET Figure 4 2 shows how the variables 5 and E change based on time t in the schedule example Y Aig Eilt dis Vie I t 0 Tmar 4 3 jeJ 5 0 2 5 0 4 S 0 6 S 0 8 0 1 E 0 3 E 0 5 E 0 7 SI 0 1 S 0 3 S 0 5 S 0 7 S 0 8 E 0 0 E 0 2 E 0 4 E 0 6 E 0 8 a S 4 1 S 4 2 S 4 3 S 4 3 E 4 0 E 4 1 E 4 2 E 4 3
55. t goal while using a different optimization criterion A description of how the mapping and scheduling can be done using only an ILP formulation is presented in 4 This formulation is introduced in section 4 1 Since the formulation seems reasonable we decided to use it as our baseline ILP Unfortunately they still omit a few things in their formulation They do not propose any quantifiers and their formulation is non linear It is easy to linearize these equations and we show how it can be done in 4 3 for usage in an ILP framework An interesting part about this paper is that they try to minimize the linear combination of several real time constraints such as the length of the periodic phase latency and cost They propose using pareto fronts as to find all suitable solutions Each solution has another impact on the throughput processor cost function and latency That paper has an extension 3 published later on where the ILP formulation is extended with communication aware mapping This means that we have a reference for our experiments which we can compare with The baseline ILP served as starting point for our research Articles which explain the basic usage of lazy constraints for area harvesting prob lem is introduced in 2 Lazy constraints are meant to reduce the computation time of the solver which is related to our work We used this article as a starting point for the lazy constraints It briefly describes how the lazy constraints c
56. t survey that includes both what has been done in the past and what the modern trends are with a brief introduction to the selected problems The survey also considers a lot of works for non real time systems which are not completely relevant to our problem Different models for streaming applications are proposed in 5 They assume the application is a directed acyclic graph DAG This is similar to HSDF which are not allowed to have cycles HSDF is a special case of SDF graph where every actor fires only once per graph iteration period This is a restriction compared to what we consider For the computation they considered the mapping and scheduling part with communication over FlexRay bus between multiple applications For computation they decided to use an ILP The disadvantages of their approach compared to this thesis is that they firstly decide the mapping and then maked the schedule for the feasible mapping This does not guarantee optimality at first place Moreover they only assume homogenous platforms A constraint programming CP approach is explained in 8 They also propose an algorithm for conversion from SDF graph to Rate Homogeneous SDF where every actor fires once but not every rate is equal to one In practical use this is the same as using a HSDF They also consider the communication by adding two actors that send and receive messages between processing units On the other hand they do not describe the whole CP model
57. tance name Solving time s Solving time 8 Solving time 5 Satellite receiver 2247 14 X X Sample rate converter 212 66 X X MP3 decoder 53 62 54 78 25 26 Modem 0 38 0 58 0 54 H 263 decoder X X X MP3 decoder Fully heterogenous X 2408 322 Also to further prove our point that the lazy constraints can speed up the solving process we have created a new fully heterogenous instance of a MP3 decoder with an input showen in figure 6 4 This instance is used in experiments using Java API and in experiments using the lazy constraints for comparison Also it is essential to mention that the instance is man made This is because the benchmark instance did not provide us with realistic input for particular architecture We are not certain what caused a huge difference in some instances and no change in others We assume the solver core implementation in Java and IBM OPL Studio can vary and the solver would choose a different branching approach with result of different solving time 6 5 Baseline ILP with lazy constraints Our goal was to make the ILP faster which was achieved by adding the lazy con straints introduced in sections 5 1 1 and 5 1 2 We have introduced two results table 36 CHAPTER 6 EXPERIMENTAL SETUP 19 1 32 21 97 92 21 37 37 32 21 32 41 20 18 19 36 37 37 18 20 32 37 36 21 43 84 83 83 85 28 66 85 92 48 57 94 45 136 132 148 132 196 132 145 139 145 165 178 212 113 245 87 67 56 98 89 134 156 192 174 80 83 123 89 13
58. th symmetries 5 3 Symmetrical mapping ooa do a 5 4 Symmetries for matrix introduced on figure 52 5 5 Initial mapping and symmetries for the instance 6 1 H263 Decoder with buffer 6 2 Modem with buffer 6 3 Input of partially heterogenous instance instance of MP3 decoder 6 4 Input of fully heterogenous instance instance of decoder 6 5 Solving time of first instance set in real time 6 6 Solving time of second instance set in real time 6 7 Solving time of second instance set in CPU time A 1 Satellite receiver with buffer A 2 Sample rate converter with buffer decoder with 4 Modem with buffer Ad H 263 decoder with buffer 6 Artifical instance1 2 AT ARICA Astana ware a ty Artifical instahce O dual ku xe Ee e e E A 9 Artifical instance 4 Artifical instance 5 AT Artifical instance D 2 A 12 Artifical instance7 Se e At a Aca Artifical TS BB 5 402 rs ne e rui xiii xiv LIST OF FIGURES A 14 Artifical instance9 61 A319 Artifical instance Il
59. to be fixed Originally the ILP did not contain any quantifiers This ILP formulation does solve the given problem but based on the experiments it suffers from poor scalability The results of this ILP can be found in Chapter 6 This ILP experiments will serve as a baseline for comparison with proposed improvements To help the ILP with performance problems we have extended this formulation for equations introduced in section 4 2 which slightly improves the ILP in terms of performance and more accurate schedule This is because the ILP was originally targeting only the periodic phase but as mentioned in sections 2 1 and 2 4 the transient phase is also important We are introducing table 4 1 which contains a description of the input variables and table 4 2 containing all decision variables used in the ILP formulation 4 1 Formulation Formulation is trying to minimize the periodic phase in order to maximize the through put T hroughput where Period refers to overall length of the period in time units In order to determine the length of the period we need to build a schedule for particular mapping Mapping and scheduling is strongly related to each other therefore we need to determine the mapping and schedule together Implementation of the formulation was not given and we had to implement it into the IBM CPLEX framework Equations described in this section are nonlinear In order to use those constraints in the ILP framework
60. trical processor 72 line 10 While the actors i and js are not mapped to the same processor it adds the equations which forbids this symmetrical mapping by creating the lazy constraints found on line 15 If those actors are mapped to the same processor it forbids the mapping of both actors to the other processor by adding the constraints referred to on line 17 The function addTerm c var will add term gt var to the expression while function add EqutionLe Expression left Expression right will add the equation left lt right to the model The algorithm select one mapped actor as active The equation for symmetry breaking was created as a copy of the current mapping without the active actor and its symmetries For example we selected as active Its symmetrical actor mapping is 42 2 so the equation will be 3 Ay 4 Then we need to add a symmetrical mapping to our A and Ag with one of those actors Hence the final equation will be A12 A21 A22 A44 Or A12 A21 Ai This created equation must be lower or equal to the number of mapped actors decreased by 1 5 8 SYMMETRY BREAKING 27 void removeSymmetries A Sym 2 dnt X new int A length AL LO length for int i 0 1 lt A length i 4 for int j 0 j lt Alil length j 5 if A i lj 1 6 for int js i 1 js lt Symlil length js 7 if Sym il js null 4 8 sMap 1 9 f
61. ure 2 2 Schedule of application from the graph in figure 2 4 can be viewed in figure 2 5 Every schedule for real time application consists of a transient phase and a periodic phase The schedule for a transient phase can vary from the schedule of periodic phase because during the transient phase the application is preparing enough input data tokens on the channels for periodic execution Each time the periodic phase executes it generates an output for the entire application In this thesis we use the term of repetition vector It consist of the number of firings needed in the periodic phase for each actor The repetition vector for our example application in figure 2 4 would be n 2 1 2 2 Homogeneous SDF Model A homogeneous synchronous dataflow HSDF model is a special type of SDF model where every actor is executed only once during a period This means a repetition vector of HSDF model holds only items equal to 1 The repetition vector is discussed further in section 2 4 HSDF and SDF models of the same application is mutually convertible by an algorithm although the complexity of those algorithms can be very high Depending on the algorithm used and the graph complexity we can see the exponential increase of graph size in terms of the number of nodes and edges it is described more in 1 The article is also provides an algorithm which lowers the graph complexity of the SDF HSDF conversion This algorithm is part of the SDF tool
62. variables we decided to go with lazy constraints This is also because we expected some of the constraints to be rarely violated or not essential to the model from the beginning while the solver starts By using lazy constraints we expect the solving time will be lowered since fewer constraints will need to be verified and considered during the solving process and before the first integer solution is found Even if the integer solution is found it does not mean the lazy constraints were violated We only add them if it is violated to keep the active model as small as possible We have chosen candidate equations 5 2 5 5 for the lazy constraints based on the assumption that equations which are not linear and contain a multiplication of decision variables Decision variables used like this gluts solver the solving time as 5 2 ILP WITH LAZY CONSTRAINTS 21 it increase with the number of equations Also we assume that equations which are related to the time equations containing variable t are overwhelming the solver This is because t 0 Tmar and Tmar can be a large number It grows with each applications repetition vector and actors WCET This would exponentially increase the number of equations in the entire model Equation 5 2 was selected as the lazy constraint because it fills vector S on the basis of values contained in vector E i e it copies values from E to S only at proper time index t d Equation 5 3 was chosen bec
63. ve speed up time All the parameters are disabling cuts in CPLEX More about CPLEX parameters can be found in ILOG CPLEX 11 0 Parameters Reference Manual 12 or in ILOG CPLEX 11 0 User s Manual 11 The value of those parameters are set at 0 by the default This means the CPLEX will decide if cuts should be generated or not There is no explanation in the documentation about how these decisions are made Documentation says we quote Setting the value to 0 zero the default indicates that the attempt to generate cuts should continue only if it seems to be helping cplex setParam IloCplex IntParam DisjCuts 1 2 cplex setParam IloCplex IntParam FracCuts 1 3 cplex setParam IloCplex IntParam LiftProjCuts 1 1 cplex setParam IloCplex IntParam MCFCuts 1 5 cplex setParam IloCplex IntParam MIRCuts 1 6 cplex setParam IloCplex IntParam ZeroHalfCuts 1 Code 5 5 CPLEX cut parameters Lazy constraints callback multithreading Since the default implementation of CPLEX uses only one thread for computation the usage of lazy constraints callback can be slowed down because it does not use the full potential of the machine This can be affected by setting the number of threads which CPLEX should use as you can see in Code 5 6 This parameter is also useful when you need to use only a part 30 CHAPTER 5 LAZY CONSTRAINTS of the capacity of the machine We have also experimented using lazy constraints callb

Algorithms for Mapping and Scheduling Real-Time

Contents

Download Pdf Manuals

Related Search

Related Contents