Home

Hades — Fast Hardware Synthesis Tools and a - ETH E

1. Figure F 8 Wotan Microprocessor on XC6216 FPGA G Resources on the Web This section lists a few useful URLs pointing to pages with information about Trianus and Hades as well as RC boards These URLs are believed to be current However since the World Wide Web is an ever changing environment we also list AltaVista queries http altavista digital com below the URLs which should return the most recent pointer to these pages More URLs can also be found in the Bibliography http www cs inf ethz ch cs group wirth projects cad tools hades lola trianus Page of the Institute for Computer Systems ETH Z rich on CAD tools for hardware design From this page the Trianus Hades system can be downloaded http www inf ethz ch publications diss html host inf ethz ch dissertations Page of the Department of Computer Science ETH Ziirich listing available disser
2. Figure 5 5 Mapping of NOT Gate negations at their inputs and output in the left columns and their transformed representation in the XC layout editor in the right columns AND OR XOR a b a b a b a b a b a b a b a b a b a b a b a b a b b a at b b a a b a b a b a b a b a b 7a b a b a b a b a b a b a b a b Ca b b a fa b b a Ca b a b 7 a b a b a b a b 7 a b a b Ca b a b Cat b a b 7 a b a b Table 5 1 Binary Operators Left and Their Mapping Right As an example Program 5 2 lists the code to map an AND gate Note that since we traversed the data structure depth first the expression in the left and right subtrees have already been mapped and are in their correct form This is important as the handling of negations would not work otherwise Similar algorithms are used to map OR and XOR gates Figure 5 6 presents the algorithm in graphical form D Latch The latch has to be represented with a multiplexer where the output is fed back to the zero input In Lola the transformation can be described as x LATCH enable data gt x MUX enable x data As an XC6200 logic cell cannot have a combinational feedback loop an additional buffer node has to be inserted between the output of the multiplexer and the feedback input We
3. 10 0 L L L lao Figure 5 20 Spreading of the Wave with the current weight are processed 10 neighbor multiplexers to the north and east of d before the weight is increased 4 Eventually the east neighbor multiplexer of cell 2 2 will be processed and the destination coming from s will be found 6 Now backtracing can start and since the algorithm stored the direction from which it was coming from in the Lee map 8 the path back to the destination can be constructed Figure 5 21 shows the resulting route PH Er gt us 10 0 L L jal 1 14 0 Figure 5 21 Resulting Route 5 6 5 Interactive Routing and Scripting The user can influence the router in various ways By setting or clearing flags the user can determine which routing resources are to be used For example when routing small types which cover more than four cells but should not use the length 4 FastLANE resources their use can be disallowed Also the user can let the router use the bounding box of the placed type 5 Hades Software 94 or expand the type to the maximum possible size prior to routing This expansion step is also accomplished by using a topological sort of the type hierarchy but this time the outermost type is expanded first Or rather a
4. H EN E Th i i 1 ji Fe E4 LH Eh E K FE E Lo mi Ga EH EH Le CEH i i i ji LET E EE LET TH TI hu m ina m E UT Figure 6 3 Mapper Circuit with Placement Hints 144 LA Haa 143 x1e12 LA Telet 3 14 2 gen LA inta Le 0 2 es BL 1 Pilsen Pepo T do Lal LPG 1 00 e1 0 ep0 Lal ap O Figure 6 4 Comparator Schema 6 Application and Evaluation 112 Program 6 3 Comparing Two 5 Bit Characters TYPE Comparator Place CONST N 5 Log LOG 2 N 1 1 IN x y N BIT OUT eql BIT eql x y VAR e Log N BIT BEGIN Xnor gates FOR i 0 N 1 DO e 0 i 7 x i y i END e 1 0 e 0 0 first level of And gates FOR i 1 2 DO e 1 i e 0 2 i 1 e 0 2 1 END e 2 0 e 1 0 second level e 2 1 e 1 1 e 1 2 e 3 0 e 2 0 e 2 1 third level eql e 3 0 END Comparator cells in an area of 4x5 cells with 45 utilization The initial placement is routable and has a delay of 12 ns X gha Es edo d sto
5. X1 a X2 b je Y2 2 mss lo MUX a va b a AND b X3 a NS va 9 n X1 a X2 b Ne Y2 a o MUX a b b a XOR b 0 X3 b N y o Pa Figure 2 5 Mux Implementation of the AND and XOR Functions Ca 5 At Me 7a 75 aXorb Ma bb Table 2 1 Truth Table for AND and XOR Functions 2 Field Programmable Gate Arrays 13 the arrav s perimeter The longer connections are termed FastLANE connections This hier archy of routing resources provides shorter signal delays as the delays scale logarithmically with the distance between cells instead of linearly All signals are uni directional i e they have one source only Tri state buses must be emulated by using multiplexers On the first level of the routing hierarchy are neighbor connections Figure 2 6 The signal source of a neighbor connection can be either the F output of the function unit or one of the three inputs from the other cells For instance the north output of a cell is MUX sel0 sell E SIn Eln WIn Hence a cell can simultaneously be used for function generation and routing The selector signals sel0 and sel are determined by configuration bits At 4x4 block boundaries the neighbor connection can be driven by a length 4 FastLANE as well T T T T T T FER LE teg T Figure 2 6 XC6200 Neighbor Routing The
6. 129 Pattern Matcher with Hints 16 Patterns of 12 Characters Each 130 Wotan Instru tions 32033 ses xx Os e AR he 157 Control Uniti 223 52 a Ge et A ye A e eae E 158 Wotan Place amp Route Times 6 00000 165 Xi 3 1 3 2 3 3 3 4 3 5 3 7 3 8 3 10 4 1 5 1 5 2 5 3 5 4 5 5 5 7 5 8 5 10 5 11 5 12 5 13 5 14 5 15 5 16 5 17 5 18 6 2 6 3 6 4 6 5 6 6 6 7 6 8 6 9 List of Programs Ripple Carrv Adder Types in Lola oaaae 23 Ripple Carry Adder in Lola 00000 24 Ripple Carry Adder in VHDL o o o 26 Ripple Carrv Adder in Verilog uno 27 Definition Node 24 e S s b ker ann P SES 31 Definition oFObJect 2 OG as Aa PE Ge sek E le 31 Definition of Instance lee 32 Definition of Type LL 32 Definition of Wire use doom Ryo mom eR a en 33 Message Broadcast ses orire no pt GT wu Eu EE sk S 36 Lola Code for FPGA Control PAL o o 47 Overview of Mapping Algorithm 2 moon 62 Mapping of AND Gate CL eee 65 Mapping oft Lateh oe 27 esse ee na l bak ee oe 65 Input Variables and Scopes vr mn nn 69 Anonymous Expressions 4 eee 69 Buried Inputs and Outputs o vr vr rn nrk renn 69 Overview of Placement Algorithm vr rar rn renn 71 Placement of ArraysT e 2 24 4 Saks pe wes pese ER EE 73 Placement of Arrays IL L 000000 74 Placement of Nodes Doo o LL 75 Placementiof Nodes II sessi baw 2 2 00 a en 7
7. iren ph1 IR 21 instruction fetch or load we phl IR 21 store RAMWE 0 we RAMWE 1 we RAMWE 2 we condition set IR 21 0 or cleared IR 21 1 cond C IR 20 N IR 19 Z IR 18 IR 21 return or previous was load store as0 IR 18 phl branch or return or previous was load store as IR 23 IR 22 cond IR 20 IR 19 phl IF Place I THEN FOR i 0 DataN 1 DO alu i AluXOff AluYOff 2 i END F Wotan Microprocessor 164 FOR i 0 AddrN 1 DO cu i AluXOff 24 Alu YOff 2 i END FOR i AddrN DataN 1 DO zeroes i AddrN AluXOff 24 Alu YOff 2 END Dx AluXOff 2 Alu YOff 4 Dy AluXOff 3 Alu YOff 2 Dz AluXOff 3 Alu YOff 2 DataN zero AluXOff 2 AluYOff 1 one AluXOff 23 AluYOff 1 C AluXOff 19 AluYOff 2 DataN N AluXOff 22 AluYOff 2 DataN Z AluXOff 23 AluYOff 2 DataN as0 AluXOff 25 Alu YOff 2 AddrN asl AluXOff 25 Alu YOff 1 ds AluXOff Alu YOff 1 shen AluXOff 1 Alu YOff 1 xor AluXOff 20 Alu YOff 4 and AluXOff 21 Alu YOff 3 or AluXOff 22 Alu YOff 2 pes AluXOff 23 Alu YOff 4 FOR i 0 DataN 1 DO IR 1 2 i 1 END iren 1 2 DataN LS 3 2 DataN 3 ph0 3 2 DataN 2 phl 3 2 DataN 1 ph2 3 2 DataN END END Wotan F 3 Layout Synthesis As can be seen from the Lola code most of the gates are placed manually This is necess
8. lt ee gt gt Expression 143 A Syntax of Lola IfStatement ForStatement UnitAssignment ParameterList Constructor PosAssignment Position Statement StatSequence InType InOutType OutType TvpeDeclaration ImportList Module 144 IF Relation THEN StatSequence ELSIF Relation THEN StatSequence ELSE StatSequence END FOR Identifier Expression Expression DO StatSequence END Identifier Selector ParameterList Expression Constructor P Expression Expression Identifier Selector Position Expression Expression P Position 1 Position J Assignment UnitAssignment PosAssignment IfStatement ForStatement Statement Statement Expression BIT Expression J TS OC Expression BIT TS OC TYPE Identifier C IdList CONST ConstDeclaration IN IdList InType Y INOUT IdList InOutType OUT IdList OutType 5 VAR VarDeclaration BEGIN StatSequence END Identifier IMPORT Identifier 1 Identifier 5 MODULE Identifier ImportList TypeDeclaration CONST ConstDeclaratio
9. Total 5003 1408 112735 30618 Table 5 5 Comparison to CALLAS Total without GC 2187 Table 5 6 Memory Consumption for Compiling PatternMatch 16 x 12 The memory requirements of the tools are moderate During expansion of the Trianus data structure many nodes are allocated resulting in quite substantial memory requirements of the final data structure The pattern matcher application contains 3048 cells which roughly corresponds to about 3500 TriBase Nodes Additionally the design contains about 3000 labels TriBase Objects Most memory requirements of the Hades back end can be attributed to the router It allo cates a Lee map for the wave expansion on three hierarchical levels for a chip of dimension 64x64 8299 nets have to be routed and temporary storage has to be allocated Still even for an application filling most of an XC6216 FPGA the 2 MB of memory are very moderate if compared to commercial tools This statement is supported by the fact that we developed and used the Hades software on a Ceres 2 with only 4 MB of main memory It is remarkable that a small and slow machine is sufficient to place and route a design for which the commerciallv available tools require a PC with a Pentium processor with 16 or better 32 MB of main memory cf Table 6 5 5 11 Experiences with Our Programming Methodology and Oberon 5 11 1 Defensive Programming Pays Off without the Costs As was stated in Section 5 2 we used def
10. 6 Application and Evaluation Program 6 5 Complete Pattern Matcher I MODULE PatternMatch 115 type definitions as shown in Programs 6 2 6 3 and 6 4 CONST DataSize 5 PatternSize 4 NofPatterns 2 ResultSize 8 VAR input3 BuriedReg 8 input 3 LoadReg 8 in 32 BIT map Mapper the data stream we match against width of pattern data pattern size nof parallel comparators length of result vector uppermost input register lower input registers loadable input vector 1 map incoming data data PatternSize LoadReg DataSize the patterns we look for pat NofPatterns PatternSize BuriedReg DataSize the comparators cmp NofPatterns PatternSize Comparator eql i 0 pattern i matches egl NofPatterns PatternSize BIT patMatch NofPatterns BIT match BIT store result of previous match queue ResultSize 1 LoadReg 1 result ResultSize BIT shiftReg 32 BIT shift BIT BEGIN control logic Start shifting bv writing l into shiftReg Shift for 4 clock cycles by writing 1 I 2 match any pattern matches 3 result vector 4 control logic state machine control bit for shifting 5 1 I into shiftReg then stop Insert zeroes in shiftReg to allow for longer delays in circuit E g shiftReg 1 70 71 70 1 70 I gt every second clock cycle shifting is enabled shiftReg 31 REG 0 FOR i 0 30 DO shiftReg i RE
11. EM ar ag L 7 Y c EE L a0 EB th E l OU K at il Figure F 4 Control Unit Slice instructions shown in Table F1 Rows with an 1 indicate instructions with or without an immediate operand and rows with an n indicate instructions which do or do not negate one operand F 1 5 Sequencer The state machine controlling the operation of the microprocessor has three phases states shown in Figure E 5 In phase 0 ph0 ALU instructions and branches are executed and a new instruction is loaded into the instruction register fetch If the current instruction is a load or a store instruction it is executed in phase 1 ph1 during which memory is accessed After phase 1 a new instruction is loaded in phase 2 ph2 not Load Store pho Load Store instruction instruction ph2 Fetch Execute Figure F 5 State Machine F 2 Lola Code In the following the complete Lola HDL description of Wotan together with placement code and more detailed explanations are given The ALU consists of 24 bit slices Each slice contains an ALU with 2 inputs x and v and output Z The inputs come from the 8 registers The first register RO is special the others F Wotan Microprocessor 157 Instruction IR 23 IR 18 11 1000 11 0001 11 1001 110010 111010 Table F 1 Wotan Instructions are implemented as RegCells with two output muxes Figure F 6 shows the i
12. RBS94 RS94 RW92 RS97 Sha96 SKC95 Syn92 TI87 Tho96 Tra95 TWM95 E Oertli H Eberle Switcherland An Interconnect for Workstations Proc 21st EUROMICRO 95 Conference IEEE 1995 R Ohran Lilith A Workstation Computer for Modula 2 Dissertation 7646 ETH Z rich 1984 PCI Special Interest Group PCI Local Bus Specification Revision 2 0 PCI Spe cial Interest Group Portland 1993 C Pfister CALLAS A Physical Design Framework for Configurable Array Logic Dissertation 9940 ETH Ziirich 1992 Philips Semiconductors TriMedia Data Sheet 1995 D V Pryor M R Thistle N Shirazi Text Searching on Splash 2 Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1993 QuickLogic Very High Speed FPGAs Data Book 1994 R Razdan PRISC Programmable Reduced Instruction Set Computers Disserta tion Harvard University 1994 R Razdan K Brace M D Smith PRISC Software Acceleration Techniques Proc Intl Conference on Computer Design 1994 R Razdan M D Smith High Performance Microarchitectures with Hardware Programmable Functional Units Proc 27th Annual IEEE ACM Intl Symposium on Microarchitecture 1994 M Reiser and N Wirth Programming in Oberon Steps Beyond Pascal and Mod ula Addison Wesley 1992 P Roe C A Szyperski Lightweight Parametric Polymorphism for Oberon Proc Joint Modular Languages Conferen
13. System FPGA Figure 5 1 Hades Software Part within the Design Flow of Fig 1 3 93 5 Hades Software 56 5 1 Problem Statement and Motivation Digital circuit design is a difficult error prone task Like any other engineering discipline abstraction and modularization are the key to a successful completion of a system Only through abstraction is it possible to intellectually manage a digital circuit containing hundreds to possibly millions of functional units The aid of a computer during the design and imple mentation of a digital circuit is mandatory to manage the hierarchical data structures involved see Geh97 for a discussion of CAD frameworks As was presented in Section 1 4 the term hardware synthesis describes the process of generating an implementation of a digital circuit based on a specification which exists in the form of a schema or a program written in an HDL Hardware synthesis is composed of logic synthesis and layout synthesis The former is the problem of finding functional units gates which implement the specification The latter is the problem of fitting the functional units into the available resources of the target device be it custom logic standard cells or an FPGA 5 1 1 Layout Synthesis This chapter treats the subject of layout synthesis Our starting point is a Trianus data struc ture also called a netlist This netlist has been generated by the Lola HDL compiler or a schema editor
14. 3 2 4 Types Unit Assignments and Modules Type definitions are used to group related declarations and assignments together into an entity i e a macro describing a hardware component e g an adder A type definition consists of an interface definition input input output and output signals and local signal declarations and statements Types can contain instances of other types thus building a hierarchy This is very useful for structuring large designs into smaller reusable components Types can also be generic e g the word width of a circuit need not be known in advance Types are instantiated with concrete parameters in variable declarations Actual input signals are passed to an instance in a so called unit assignment see the example in Section 3 2 5 Signals defined in the OUT section of a type are visible in the scope of the composite variable and are accessed by means of selectors E g if a type A has an OUT signal co then with a being an instance of type A a co can be used in an expression to access this signal Modules are the textual unit of compilation They contain type and variable declarations and statements Types in modules may be exported and imported by other modules hence allowing the construction of libraries 3 25 Example Ripple Carry Adder Circuit The two types in Program 3 1 show the definition of a full adder and of a ripple carry adder of unspecified word width which is based on the full adder Progr
15. Ea Imp Figure 6 5 Comparator Circuit without Placement Hints Since the comparator circuit is so central to the size and performance of our pattern matcher it is advantageous to optimize its layout manually With the layout editor a good placement can be found quickly It is shown in Figure 6 6 9 cells in an area of 2x5 cells with 90 utilization Luckily the size of the tree structure is small enough that all AND gates fit into one column of the same height as the one containing the XNOR gates Note that one cell remains free which can be used to chain comparators together to form larger comparators as will be done subsequently The initial routing shown on the left in Figure 6 6 is not optimal due to the routing scheduling algorithm which routes the connection e 0 0 to e 3 0 before e 0 1 to e 1 1 A small routing script can be used to remedy this situation The result is shown on the right in Figure 6 6 Note that only neighbor routing resources are used to route the type since many 6 Application and Evaluation 113 EUR 3d fe 0 8 l L re are da g s imitita L 07 fira Figure 6 6 Comparator Circuit With Initial and Final Routing comparators at different locations exist and the routing resources at these locations differ The delay of the left circuit i
16. Figure 5 10 Tree Example In the next example in Figure 5 11 an array of expression trees is placed The expression tree itself is the one from Figure 5 10 The routing remains the same as the basic structure is the same Since the tree is as wide as tall vertical array placement is used and z 1 comes to lie above 2 0 5 Hades Software 78 TYPE Example2 N Forest IN a b c d e INI BIT OUT z N BIT BEGIN FOR i 20 N 1 DO zi ai x bi ci d i x e i END END Example2 il 1 id FET fi 8 8 Figure 5 11 Array of Trees Example The example in Figure 5 12 nicely illustrates our placement strategy which uses the height of one subtree to determine the vertical position of the other subtree The circuit grows in height from its leaves and the largest gap between two subtrees occurs near the root e g p 0 The first subtree of p 0 is comprised of all cells to the right of p 0 while the second subtree is comprised of the AND gate above p 0 Note that only neighbor routing resources are used to preserve position independence of the resulting instance It would be better to use the length 4 FastLANE connection to route the second argument of p 0 and p 1 The layout is far from optimal but it is predictable and most importantly repeatable i e a second invocation of the placer on the same input wi
17. spread use and availability of the Internet searching through large amounts of data such as electronic news becomes a daily task for computer users A text search application using a reconfigurable coprocessor board to find relevant articles based on user profiles in the daily news feed is presented in GMN96 6 2 1 Problem Statement In the following we build a simple pattern matcher optimized for finding one or more patterns in a text stream composed of 8 bit characters To reduce the space requirements and add tolerance to the search these 8 bit characters are mapped to 5 bit characters using the mapping shown in Table 6 1 In hardware this results in a reduction in area whereas in software nothing can be gained 6 2 2 Software Solutions In text searching it has to be determined whether one or more patterns occur in a stream of data and if so the patterns are to be located For single patterns software solutions use an algorithm such as BM77 CLR90 to find a pattern If multiple patterns are to be sought a finite automaton can be used ASU86 CLR90 To implement the aforementioned mapping the text stream has to be preprocessed using a mapping table 6 2 3 Hardware Solution In hardware a pattern is compared to a stream of data using a comparator circuit which tests for Boolean equivalence of the bits representing the data and the pattern respectivelv Since parallelism is easilv implemented in hardware we can detect the occ
18. wires SET wire directions which should be considered for the marking of map cost INTEGER current cost class for queue queue ARRAY CostClasses OF Node circular priority queue of next positions in the Lee map END these wires attach them to the correct source node and insert them into the correct instance This is necessary for the correct functioning of the router See Geh97 for a more detailed description of the extraction process 2 Only types that have not yet been routed must be processed This test is necessary as it is possible to use the router incrementally whereby certain types might have been routed during an earlier routing step 3 As multiple instances of the same type are placed at different positions across the chip not all instances have the same routing resources available at the same relative position But since the instances are of the same type they all must be routed the same way Consider the four instances of the same type shown in Figure 5 18 In i1 we could use the length 4 FastLANE signal to route the output of cell s to the input of cell d as done in i1r however in i2 this routing resource would not be available at the same position as shown in i2r Therefore the router first has to check what routing resources are available when routing a type This is determined by invoking procedure LegalHierar chies for each instance of the type to be routed This procedure determines the position of the fir
19. 6 3 4 Discussion When faced with choosing CAD tools for an FPGA users typically have only one choice namely the tools provided by the vendor of the FPGA To satisfy all possible demands users might have vendors provide tools with a plethora of features and options However they often neglect the quality of the underlying core algorithms which perform the real work in CAD tools The above observation also holds true for the new XACT step 6000 tools for the XC6200 FPGA Three different placement options with more suboptions are provided with which a user can influence the quality of the resulting placement As our experiment in Table 6 3 showed even using the highest effort and allowing for transformations of the types does not result in a routable design on the first try And after several ripup and reroute cycles the 6 Application and Evaluation 131 resulting design could still not be used as a coprocessor application Hence the user has to give placement hints which have to be back annotated to the HDL source code or the schema Since XACT uses among others a simulated annealing algorithm for placement the quality of the produced result can vary drastically The router sometimes runs slower sometimes much faster and the user might just be lucky and get a routable design or he or she must iterate ten times and try out different options to achieve a good result This is a very time consuming method to develop a design We believe that
20. BIT CPUDS BIT CPUBE 4 BIT XCRAMCE BIT XCRAMOE BIT XCRAMWE 4 BIT OUT RAMOE BIT RAMCE BIT RAMWE 4 BIT VAR XCSel RAMSel PortSel BIT select BIT BEGIN select CPUDS BoardAdr XCSel select A19 A18 RAMSel select A19 A18 PortSel select A19 A18 CPU read write CPU data strobe CPU byte enables 0 3 XC selects SRAM XC reads SRAM XC write enables 0 3 for SRAM RAM read enable RAM chip enable RAM write enable 0 3 00 xxx 01 xxx f 11 xxx for RAM access XC is NOT selected and CPU selects RAM and reads writes or CPU doesn t select RAM and XC reads writes RAMCE XCSel RAMSel PortSel XCRAMCE RAMOE XCSel RAMSel CPURW RAMSel PortSel XCRAMOE FOR i 0 3 DO RAMWE i 7 XCSel RAMSel CPUBE i CPURW RAMSel PortSel XCRAMWE i END END DecoderRAMCtrl E Hades RC Board Decoder 153 TYPE DecoderXCRW implemented in a PAL22V10 U9 IN Clk BIT BoardAdr BIT board is selected address lines needed for decoding A19 A18 A4 A3 A2 BIT CPURW BIT CPU read write CPUDS BIT CPU data strobe XCBusy BIT XC busy flag CPUDOIn BIT CPU D O input INOUT CPUDO TS CPU D O tri state output O
21. This introduces all kinds of special cases in the router during the spreading of the wave and also during back tracing A more regular multiplexer arrangement which might have used more silicon would have been advantageous at least from a software engineering point of view But as is often the case in hardware design certain features of the chip have to be fixed or by passed in software See also Section 5 7 for a further problem of the architecture whose solution was deferred to software A second complication during the development of the router was the support for type based routing Although it speeds up the routing of multiple instances by a factor proportional to the number of instances of a type and is a worthwhile capability the code to implement it 1s quite complicated and needs profound knowledge of the Trianus data structure 5 Hades Software 95 A resource of the XC6200 which we simply ignored is the magic routing resource X1196 It lets one of two inputs of the cell be routed to the next switch directly by passing in termediate cells Its use in type based routing is questionable as all instances of a type would be constrained to lie at the same position in a 4x4 block to make use of the magic resources Apparently the usefulness of the magic resources is limited as more often it is the function output of the cell that has to be routed to the next switch quickly Buc96 and not some input signal of a cell In en
22. Unos af 3 3 af U With Magie 29s 16245 los 3355 Umwes I 9 mf pf PreroutedTypes as mis 9 amp 9s 5735 Unos 6 m nr 5 map 76s 33s 147s 2835 1015 SpedporHades 43 51 mil 133 Table 6 3 Pattern Matcher without Hints 2 Patterns of 4 Characters Each The following list contains explanations for the corresponding labels in Table 6 3 1 This row lists the time needed by XACT to read the design files which were produced by Hades The files are in the same format as the ones used by XACT itself Hence the delay encountered would be the same if a different tool was used to produce the files such as a commercial schematics editor or an HDL compiler 2 The Hades router does not support the Magic routing resources cf Section 2 3 2 and Section 5 6 The times for Hades are hence listed in the Without Magic rows As XACT has an option to allow or disallow the use of the Magic resources we list the times for both cases 3 This row lists the number of unrouted nets i e connections for which the router failed to find a path 4 Hades preroutes all types by default All instances of the same type have the same routing In XACT the user can preroute the types i e enforce the same routing on all instances of that type In this row we list the routing results when types are prerouted 6 Application and Evaluation 128 5 This row lists the sum of map place an
23. a 1 3 Adder add 0 0 0 AddElem s 0 0 co 0 1 h 1 0 add 1 0 2 AddElem s 0 0 0 s 1 0 2 The use of the layout editor and placement hints is exemplified in the pattern matcher application in Section 6 2 5 5 9 Floor Planner In addition to the automatic placement algorithm we developed a simple floor planner It makes use of the placer but omits the last stage of placement namely the placement of the top level instances and expression trees Hence all types are placed by the placement algorithm but top level instances are placed by the user Typically the floor planner is used if a design is too big to be placed automatically The user can preplace certain components using a drag and drop mechanism then optionally adjust the placement of certain types and finally back annotate the Lola program with position statements Then a new compilation and placement step will most likely yield a fully placed design The floor planner was used during the layout of the Wotan microprocessor presented in Appendix F 5 5 10 Discussion As will be shown in Chapter 6 the simple approach used by our placement algorithm leads to very fast placement times at the cost of quality However for regular designs such as data paths the array placement heuristic yields quite satisfactory results The structure of the Lola code is reflected in the quality of the placement in the sense that the algorithm performs quite well on small t
24. also require that the latch is named that is that the latch node is rooted in a signal node This is necessary for ensuring the proper working of other tools such as the extractor and the checker It is not a major restriction but rather enforces a good design style Figure 5 7 shows a graphical representation of the transformation step and Program 5 3 lists the code necessary for doing the transformation In the code we generate two new nodes and change one existing node into a new form 5 Hades Software 65 Program 5 2 Mapping of AND Gate And prev points to previous node n points to current node IF n x fct TriBase Not amp n y fct TriBase Not THEN Swap n x n y a b gt b a END nx n y OR n x ny OR nx n y IF n x fct TriBase Not amp n y fct TriBase Not THEN IF prev fct TriBase Not THEN proper And a 7b gt a b upper part of Figure 5 6 n fct TriBase Not change n into Not n x fct TriBase Or change n x into Or N X y n y X eliminate Not on n y delete node n y superfluous node ELSE prev fct TriBase Not gt Nand Ca b gt a b lower part of Figure 5 6 prev fct TriBase Or change prev into Or prev y n y x eliminate Not on n y n N X X change n delete former nodes n n x n y END END Program 5 3 Mapping of Latch Latch IF prev is not a label THEN report error ELSE BUF prev node NewNode TriBase Buf prev NIL MUXI BUF data node NewNode T
25. co q i cig END co c N 1 co END Counter ml pe os pas e EH i E i EH i EH ceo ct c2 cs Figure 5 16 Counter Example 5 Hades Software 83 TYPE AddElem Full Adder IN x y ci BIT OUT s co BIT VAR h BIT BEGIN h x y s h ci co MUX h x ci END AddElem TVPE Adder N bit Adder IN x y EN BIT ci BIT OUT s N BIT co BIT VAR add N AddElem BEGIN add 0 x 0 y 0 ci FOR i 1 N 1 DO add i x i y i add i 1 co END FOR i 0 N 1 DO si add i s END END Adder H mas yah add E Sp st es a Figure 5 17 Adder Example 5 Hades Software 84 plied to the layout shown in Figure 5 16 and 5 17 for instance the text shown in Program 5 12 is produced This text can be copied into the Lola description such that subsequent compi lation and placement steps of the program yield the desired placement Note that the layout information for a type is only displayed once as all instances have the same placement e g c 0 c 1 and add 0 add 1 Program 5 12 Textual Layout Information er Counter c 0 0 0 CountElem q 0 0 co 0 1 c 1 1 0 CountElem q 0 0 0 q 1 1 0 co 1 1
26. we discuss the hardware involved if any and the software to program this hardware The list is not exhaustive but presents one or two exponents of a particular approach to reconfigurable coprocessors and related synthesis software If a project is not listed here it does not mean that it is not relevant The wealth of such projects simply makes a complete listing impractical Steve Guccione s WWW list of custom computing machines lists over 50 entries and is growing steadily Guc94 7 1 Custom Computers The first category we discuss is that of custom computers We define a custom computer to consist of several FPGAs with attached RAM Custom computers in contrast to reconfig urable coprocessors typically use a host computer only for data management i e input and output Custom computers are used to implement large applications in hardware 7 1 4 Programmable Active Memories One of the first custom computers was implemented at the Paris Research Laboratory of Dig ital Equipment Corp BRV89 The pioneering work of the PRL group in the late 80s and early 90s culminated in a paper first published in 1993 titled Programmable Active Memo ries Reconfigurable Systems Come of Age VBR96 This group had extensive experience implementing successful high performance applications on their custom computer Perle 1 is the successor project to Perle 0 and consists of an array of 4x4 XC3090 FPGAs from Xilinx representing about 100K logic gat
27. 2 41 CAL The predecessor of the XC6200 the CAL Configurable Array Logic by Algotronix is not available any more but is presented here for historical reasons Alg90 Kea89 It was one of the first fine grained SRAM based FPGAs featuring 32 by 32 cells each of which could implement any function of two inputs or a latch Figure 2 9 shows the function unit of the chip It had neighbor to neighbor connections just as the XC6200 but these were the only routing resources Figure 2 6 X1 Ya o x2 Y2 e 0 F Y3 1 1 0 Figure 2 9 CAL Function Unit The CAL was used in reconfigurable coprocessors such as the CHS2x4 from Algotronix Alg91 and the Chameleon computer developed at the Institute for Computer Systems at ETH 2 Field Programmable Gate Arrays 16 Z rich Hee93 HP92 The CAL may be regarded as a pioneering work on fine grained archi tectures Its main drawbacks were long propagation delays due to the lack of long connections and the presence of level sensitive latches instead of edge sensitive registers 2 42 AT6000 The AT6000 architecture from Atmel is a slight variant of the Concurrent Logic CLi6000 architecture Atm95 It is used in a laboratory for a digital design course for computer science students at ETH Z rich GLW94 Wir95 The AT6000 is a fine grained SRAM based FPGA although less fine grained than the XC6200 Each cell has three inp
28. 5 3 Runtime System A reconfigurable coprocessor is of little use if it cannot be accessed in a convenient way from the software side In Hades the coprocessor application is described in the form of a Lola program Interaction between the host and the coprocessor is only possible through input and output signals defined in the Lola program These signals can be implemented using IOBs i e physical connections between the FPGA and the bus cf Section 4 5 1 or as logical inputs and outputs in the form of buried IOs cf Section 5 4 The latter is more flexible and the preferred way of interfacing with a coprocessor application in Hades The advantages of buried IOs are the following e No wiring to IOBs is needed e The application is relocatable within the FPGA e The application is portable to different hardware platforms using the XC6200 FPGA 5 8 1 Automatic Interface Generator Hades features an automatic interface generator It generates an Oberon module from a Tri anus data structure Such an interface module constitutes a driver module for the hardware application The module contains variable definitions of interface objects representing Lola variables in the design Each interface object has associated with it a map register value and a column position Arrays of bits are translated into correspondingly sized basic types of the Oberon language For instance a bit vector of length 16 is represented by an interface object of type Int
29. 5 ns speed grade DIP 24 300 mil Lattice GAL22V 10B 7LP 3 DIP 24 300 mil socket with 100 nF capacitor 2 A L S541 J Octal buffer and line driver with 3 state output 4 A L S645 J Octal bus transceiver with 3 state output 1 A L S679 J 12 bit address comparator 7 DIP 20 300 mil socket with 100 nF capacitor 1 40 MHz oscillator 2 330 Ohm resistor 2 270 Ohm resistor 4 10 kOhm resistor 6 8 resistors in a DIP 16 300 mil 6 DIP 16 300 mil socket 148 D Components for a Hades Board 0 22 uF capacitor 18 100 nF capacitor if sockets with capacitors are not available 47 uF capacitor 96 pin Euroconnector male angled 96 pin Euroconnector female straight 32 pin connector male 100 mil pitch A bit DIP switch N 149 E Hades RC Board Decoder The following three programs list in full the Lola code for the PALs implementing the con trol interface on the Hades RC board The first program DecoderXCCtrl lists the code for the XC6216 control PAL the second DecoderRAMCtrl lists the code for the SRAM access controller and the third DecoderXCRW lists the code for the communication port controller 150 E Hades RC Board Decoder 151 TYPE Decoder XCCtrl implemented in a PAL22V10 US IN Clk BIT BoardAdr BIT board is selected A19 A18 A4 A3 A2 BIT address lines needed for decoding CPURW CPUDS BIT CPU read write data strobe RESET BIT master reset OUT XCCS
30. 7 Ci J ai JH sm ba Figure 5 8 Mapping of Register propagated upwards in the inclusion hierarchy This is accomplished by a separate pass over all input variables of instances declared in a type after these inner instances have been mapped If an input variable of an inner instance refers to an input variable in the current instance the latter must be duplicated Program 5 4 illustrates this point Note in the example that the carry output of the previous adder element which is passed to the carry input of the next adder element is an output variable and does not have to be duplicated Anonymous Expressions Another problem occurring with nested instances is that the actual parameter in a unit as signment can be an arbitrary expression not only a signal name These expressions are not anchored in a signal variable and are therefore anonymous in the current scope Their roots are the respective input variables of the instances to which they are passed Program 5 5 illus trates this problem Again as with the duplication of input variables explained above these anonymous expressions are mapped during the same pass over all instances and their input variables Global and Buried Inputs and Outputs The mapper is also used for transferring positional information
31. D Cliff A Thompson N Jakobi Evolutionary Robotics at Sussex Proc Intl Symposium on Robotics and Manufacturing 1996 B Heeb Design of the Processor Board for the Ceres 2 Workstation Technical Report Nr 93 Dept Informatik ETH Zirich 1988 B Heeb Debora A System for the Development of Field Programmable Hard ware and its Application to a Reconfigurable Computer Dissertation ETH Ziirich 1993 Bibliography 172 HN91 HP92 HP96 HW96 Hig69 Hof96 IEE87 ICS96 Ise96 IS95 JG93 Kar86 Kea89 KG89 Kea96 KB92 KNS96 KL70 B Heeb and I Noack Hardware Description of the Workstation Ceres 3 Tech nical Report Nr 168 Dept Informatik ETH Z rich 1991 B Heeb and C Pfister Chameleon A Workstation of a Different Colour 2nd Intl Workshop on Field Programmable Logic and Applications LNCS 705 Springer 1992 J L Hennessy D A Patterson Computer Architecture A Quantitative Ap proach Second Edition Morgan Kaufmann 1996 J P Heron R F Woods Architectural Strategies for Implementing an Image Processing Algorithm on XC6000 FPGA Proc 6th Intl Workshop on Field Programmable Logic and Applications LNCS 1142 Springer 1996 D Hightower A Solution to Line Routing Problems on the Continuous Plane Proc Design Automation Workshop 1969 D Hofmann Minterms A Program for Logic Minimization 1996 IEEE Standard 1076 1987 IE
32. Kosper D Kunze D Lopresti S Lucas R Min nich P Olsen SPLASH A Reconfigurable Linear Logic Array International Conference on Parallel Processing 1990 D E Goldberg Genetic Algorithms Addison Wesley 1989 M Gschwind V Salapura A VHDL Design Methodologv for FPGAs Proc Sth Intl Workshop on Field Programmable Logic and Applications LNCS 975 Springer 1995 S Guccione List of FPGA based Computing Machines http www io com guccione HW list html 1994 S Guccione and M Gonzalez Classification and Performance of Reconfigurable Architectures 5th Intl Workshop on Field Programmable Logic and Applica tions Springer 1995 S Guccione Programming Fine Grained Reconfigurable Architectures Disser tation University of Texas at Austin 1995 J Gutknecht Do the Fish Really Need Remote Control Proc Joint Modular Languages Conference 1997 B Gunther G Milne L Narasimhan Assessing Document Relevance with Run Time Reconfigurable Machines Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1996 J D Hadley B L Hutchings Design Methodologies for Partially Reconfigured Systems Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1995 M Hanan P K Wolff Sr B J Agule A Study of Placement Techniques Journal of Design Automation and Fault Tolerant Computing Vol 1 1 Oct 28 61 1976 I Harvey P Husbands
33. Simplified 2 Field Programmable Gate Arrays 19 previously presented FPGAs using representatives with equivalent gate capacity Each feature is also evaluated with regard to coprocessor applications and suitability for hardware synthesis We exclude the CAL architecture and we list the XC4020E instead of the XC4028EX as even the smallest EX device has a higher gate capacity than the other two FPGAS in the comparison XC6216 AT6010 XC4020E moce 195 194 194 mum of cells 645644096 80x80 6400 B mes 1624 10 3 ee simple medium complex tipos 4096 Vcl 6400 Gea 2016 Geil 10 subir gt duepuh i randomloge 00 synthesis L e EE EN MUI NR Crouing hierarchical single local express single double long rastcamy mo direction ui aras unis eise ESO EUER ms sped mes mm m LA omm 13m Sm para yes yes mo ser LL md ow p omen mu nh Write bitstream public proprietary proprietary Table 2 2 Comparison of Different FPGAs 2 5 1 Logic Cell The XC6200 features a simple cell which can implement any two input Boolean function or a multiplexer Technology mapping is easy cf Section 5 4 The XC4000 has a very complex cell which can implement from two functions of 4 inputs each up to certain functions of 9
34. The task is to implement the circuit represented by the netlist on an instance of the XC6200 FPGA architecture There are several degrees of freedom on how this circuit can be implemented e Technology Mapping how are the basic gates of the netlist mapped onto the available cells of the FPGA e Placement how are the mapped cells arranged on the rectangular grid of the FPGA e Routing how are the placed cells interconnected using the available wiring resources of the FPGA 5 1 2 Intractabilitv The problems above are not independent from one another Routing depends on the placement of cells which in turn depends on the mapping of gates onto cells In general the problem to find the optimum solution for any of the above tasks is NP hard Coo71 GJ79 Kar86 Because of the intractability of solving these problems optimally layout synthesis tools often use heuristics which approximate an optimal solution as closely as possible Aside the same statement holds for most optimization problems for example logic synthesis is another NP hard problem 5 1 3 Current Approaches We give a short overview over current approaches in all of the three problem domains A longer introduction can be found in Pfi92 All these problems are normally solved using heuristics which are not guaranteed to find the global optimum Technology Mapping As was stated above technology mapping is the task of assigning the function units in a netlist to the
35. Type bound procedures are used to read and write the values Writing a value only makes sense if the object represents a register A write has no effect on combinational logic gates as long as the register is not used for constant generation in these gates Xil96 Program 5 18 shows an excerpt of the interface generator Types are derived from a generic type Interface and represent the basic types of Oberon such as CHAR INTEGER etc Once an Oberon module is generated for a coprocessor application the software program mer can use interface objects to safely interact with the application The interface preserves type safety on the software side as no low level features of the Oberon language must be used to interact with the hardware part of the application Obviously type safety on the hardware part is non existent as untyped bit values are manipulated The software programmer can augment the automatically generated interface with additional code for instance to initialize an array of bytes with one procedure call In Chapter 6 an interface for a pattern matcher application is automatically generated and used to steer the data flow to and from the application 5 82 Future Work Every computer has an operating system managing its resources such as disks graphics and input output devices A coprocessor as defined in Section 1 1 is not managed by the operating system but by the programming language compiler e g a floating point coprocesso
36. Verilog code and Martin Radetzki for the VHDL code Vi e Peter Alfke for background information on the XC4000 and the FPGA market e David Hofmann for the logic minimizer e my proofreaders Stephan Gehring Taylor Hutt Cheryl Lins and Nels Vander Zanden I owe you a dinner in Silicon Valley This work would have been impossible without the love and support of my beloved wife Irene Thank you for everything during the past nine years I love you Acknowledgments Kurzfassung Abstract 1 Introduction Lil COproCessotS os unt E RASE Gees B 1 2 User Configurable Hardware 1 3 Reconfigurable Coprocessors 1 4 Hardware Synthesis o 15 Contributions e T T ha 1 6 Overview of Thesis o a 2 Field Programmable Gate Arrays 2 1 Background 22 General Structure o 2 3 The Xilinx XC6200 o 2 4 Other Architectures 2222 nennen 25 EvaluatioN ee 3 Foundations Lola and Trianus 3 1 Hardware Description Languages 3 2 The Hardware Description Language Lola 2 297 Mahis A EX N Lat mx ed us 3 4 Discussion L 4 Hades Hardware 4 1 Motivation 2 vr een 42 Design Alternatives 2 2 llle 4 3 Overview of the Hades Reconfigurable Coprocessor 4 4 Choice of Host Workstation varar evas 4 5 Architecture of the Hades Board 4 6 Constructing
37. XCOE BIT XC is selected may drive pins XCAOE XCDOE BIT XC may drive A D buses XCReset XCGCIr BIT XC reset global clear CPUAEN CPUDEN BIT CPU drives the A D bus XCStep BIT single stepping VAR select write XCSel RAMSel PortSel BIT oe BIT register BEGIN select CPUDS BoardAdr write select CPURW XCSel select A19 AIS 00 xxx RAMSel select A19 A18 FOI xxx PortSel select A19 A18 A4 11 0xx XCCS XCSel 107000 disable 107001 enable OE oe REG write A19 A18 A4 A3 A2 XCOE oe 10 010 XCReset write A19 AI8 A4 A3 A2 RESET 10011 XCGCIr write A19 AI8 AA A3 AZ 10 100 generate one clock pulse high gt low high XCStep write A19 A18 AA A3 7 A2 CPU uses data bus PUDEN XCSel RAMSel PortSel CPU drives address bus PUAEN XCSel RAMSel OOS XC may drive data bus XCDOE oe XCSel RAMSel PortSel CPURW XC may drive address bus XCAOE oe XCSel RAMSel END DecoderXCCtrl E Hades RC Board Decoder TYPE DecoderRAMCtrl IN Clk BIT BoardAdr BIT 152 implemented in a PAL22V10 U10 board is selected address lines needed for decoding A19 BIT A18 BIT CPURW
38. a conventional SRAM cf Section 2 3 Therefore the coprocessor board looks like a memory board to the CPU Accessing the XC6216 takes the same amount of time as accessing normal memory 240 ns The board features a 16 bit wide address and a 32 bit wide data bus to access the XC6216 and the local memory Data and Adr in Figure 4 4 16 bit addresses and 32 bit data can address 256 KB of local memory This memory could hold for instance three 320x256x8 bit images or 1 5 seconds of stereo CD quality audio data Although we could have used the XC6216 s capability to generate the control signals needed by the CPU to access the XC6216 X1196 we decided against this option as we did not have any software to generate the configuration bits nor hardware to test the decoding circuitry itself at the time Instead we use three 22V 10 PALs for interface and control logic Decoder and Control PALs in Figure 4 4 Two expansion ports are provided to allow for hardware extensions Series resistors pro tect the FPGA pins connected to these ports from possible damage caused by high currents 4 5 1 Host Interface The Ceres 2 bus is clocked at 25 MHz One memory cycle takes six clock cycles The data bus is 32 bit wide therefore a peak throughput of 25 6 MHz 4 Bytes 16 7 MB s could be achieved Using the CPU only a transfer rate of 12 5 MB s can be achieved in practice by today s standards a truly antique value Program 4 1 lists the Lola cod
39. and EEPROM based FPGAs to do so an unlimited number of times Programmability comes at a cost however The logic implemented in an FPGA is less dense usually 1096 of an MPGA and also slower 2 10 times than its gate array implementation DeH96 This is mainly due to the large amount of wiring resources needed up to 8596 and the on chip con figuration store up to 10 leaving only a small fraction of the chip area for active circuitry as little as 596 The feature of reprogrammability makes FPGAs useful in several areas e Replacement of glue logic This continues the trend of other PLDs such as PALs and CPLDs e Replacement of MPGAs This leads to a reduction of costs as a design can be pro duced in shorter time and design changes can be accommodated even when the circuit is already installed in a system Speed of design cycle The FPGA can be programmed in the system and tested right away without lengthy simulation cycles Reconfigurable coprocessor FPGAs can be programmed to implement parts of a time consuming algorithm in hardware 2 Field Programmable Gate Arrays 9 e Logic emulation Instead of simulating a netlist FPGAs can directly implement a cir cuit e Teaching Students can implement complex circuits in a short time verifv them us ing real hardware and upon completion delete the design and move on to the next assignment GLW94 Wir95 Wir96bj In 1995 the total size of the programmable logic mark
40. be routed manually and the designer has control over Which resources may or may not be used by the router Integration In software development one means for higher productivity are integrated software develop ment tools Pioneered by Borland in their Turbo Pascal product todav s software tools are all integrated This means that all steps from the writing of the program code to the debugging of the program can be performed within the same environment This is not true for synthesis tools Various tools from various vendors interoperate only through files A schematic editor software produces an EDIF file which has to be read by the place and route software The output of HDL compilers is a netlist in the form of a file which has to be read and parsed again by back end tools These translation steps are costly in terms of time and memory as files have to be written and read again Section 3 3 and Geh97 treat this subject in more detail Hades is based on the Trianus framework and achieves integration through using one cen tral data structure Transparency Current hardware description languages and associated synthesis tools often try to abstract too much from the underlying hardware This is done among others for economic reasons To support many different devices and target technologies it makes sense to abstract from them in one way or another Reuse of design is an often heard goal In the software world this is common practice Pro
41. boundaries L d E j t 18 4 IL L EE IG Figure 3 5 XC6200 Layout of AddElem In addition to just displaying synthesized layouts the layout editor can be used to create circuits from scratch It is possible to set each cell s functionality separately using popup menus Connections are drawn with the mouse or with the aid of popup menus Groups of cells and interconnections can be combined into instances of a type hard macro and stored in libraries for later use Thus a circuit can be constructed interactively by plugging together instances of prefabricated and tested types The editor is a type based tool i e when a design contains several instances of the same type any change made to a single instance is broadcast to all instances of that same type This feature allows for the rapid manual construction of bit sliced designs Furthermore the editor can be used for floor planning by laying out empty instances and filling in functionality only later Quick View Updates are Essential The editor framework of Trianus supports quick view updates Especially in design automa tion tools where hundreds or thousands of gates and wires have to be drawn it is crucial for interactive performance that local changes to a design only cause local screen updates Also when the de
42. circuit without having to connect the registers to I O buffers We have not made use of the fast reconfiguration speed of the XC6200 FPGA In fact any memory location in the XC6200 can be altered just as quickly as the user registers Therefore we studied the possibility of improving the performance and reducing the size requirements of the comparator circuits by applying this feature Making Use of Reconfigurability In Figure 6 4 the basic structure of a 5 bit comparator was shown The data and the pattern registers are compared using an XNOR gate which yields one if both inputs are the same Note that the value in the pattern register does not change Therefore we can propagate this constant value through the XNOR gate and replace it with a buffer or an inverter Figure 6 11 shows the two resulting circuits shown on the right when a constant zero or a constant one is present at one of the inputs of an XNOR gate shown on the left 7 data Jdata Em pe Opattern n rmal0 fusedO pee I pee cr ode n rmall fused1 Figure 6 11 Constant Propagation We can implement these negations with no extra cost in the XC6200 cell and can therefore implement a two bit comparator circuit using a single AND gate with appropriate negations on its inputs Once the pattern is known the buffer or inverter function in front of the AND gate can be programmed directly into the cell Figure 6 12 shows
43. coord field in Program 5 13 This is done so that only one update per 5 Hades Software 91 S Bounding Box of Sand D 25 Wider and Taller Full Chip Size Figure 5 19 Growing of Bounding Box type has to be made to the Trianus data structure using a type broadcast Program 5 17 summarizes the necessary steps to route a single net and the next section explains the process with a small example Example The routing of some expression trees was already shown in Section 5 5 7 For a detailed example consider Figure 5 20 where d has to be connected with s In the following numbers written in italics refer to the numbers in the comments of Program 5 17 First the algorithm marks the elements of the Lee map as Free and then marks the wires to the left of s and below d as Used not available for routing Then the wires leaving s to the north and the remaining neighbor outputs of cell s 2 are marked as possible destinations with Dest The algorithm starts to spread the wave at d namely at the input multiplexer for the upper input 3 Possible sources feeding that multiplexer are the neighbor outputs of the cells to the north south and east of d and the neighbor output of the switch to the west of d The neighbor output of the south cell is already marked as Used and can therefore not be used Additionally the length 4 FastLANE outputs of the switches to the north south east and west are possible sourc
44. cycles The interface is synchronous such that XCCS CPURW A and DOut are all sampled on a rising clock edge An access is initiated by XCCS being sampled low start of T3 in our case An access terminates if XCCS is sampled high at the second clock cycle It can be extended by keeping XCCS low during that cycle in which case the access terminates asynchronously as soon as XCCS goes high We use these extended access cycles as the CPUDS signal remains low during three full clock cycles T3 T4 T5 Signal CPUDS goes low during T2 and accordingly does XCCS XCCS is sampled synchronously at T3 together with the other aforementioned signals This starts the access cycle At the start of T4 XCCS is still sampled low initiating an extended access cycle The access is terminated asynchronously when XCCS goes high during the next T1 cycle This is no problem as there is enough time before XCCS could go low again during the next T2 cycle When the FPGA is written DOut gt DXCIn during a write access the bus is sampled synchronously at the start of T3 During a read access DXCOut gt Din the FPGA provides the data on the bus during the extended access cycle until XCCS goes high again T4 T5 T1 It is sampled by the CPU before the end of T5 4 5 4 Clocks Three clocks are provided as global signals in the XC6216 The main clock used by the XC6216 GCIk is the Ceres clock running at 25 MHz This guarantees the correct ope
45. data for the Hades back end modules The router and the loader are by far the largest modules of the back end Each constitutes more than half the size of the respective subtotals This clearly is a sign for the additional complexity introduced with hierarchical routing and the architecture s irregularities and peculiarities Lines Object XCMapper 35 10437 XCPlacer 1265 20358 XCFloorPlanner XCRouterBase XCRouter Hades Subtotal XCDriver XCBoard XCLoaderBase XCLoader Hadesinterface HadesInterfaceGen Subtotal Total 68 130401 Table 5 2 Hades Software Size 157 Nn o en Uy oo I 00 Nel ol ol amp oo Table 5 3 lists the sizes of the Lola compiler front end the Trianus front end including the Lola compiler back end the data structure and the checker and editor frameworks the editor and checker for the XC6200 FPGA and the timing analyzer It also lists the total size of the Trianus and Hades system and compares it to Version 0 3 5 of the XACT step Series 6000 software from Xilinx Development Corporation Scotland XACT is bigger by a factor of 2 and does not include a runtime system a driver for a coprocessor board or an architecture independent framework However it features a placer using various algorithms such as min cut simulated annealing and constructive placement Also the router supports the Magic routing resource and makes use of the
46. decoder chips for MPEG encoded video streams and so called multimedia processors TriMedia Phi95 MPact Chr95 processing long data streams such as continuous video or audio data However as with FPUs a coming trend in recent months was the introduction of single instruction multiple data instruction sets in general purpose CPUs also called multimedia instructions Examples are Intel s MMX for the Pentium Sun s VIS for the SPARC and Digital Equipment s media extensions for the Alpha Typically these instructions operate on 8 bytes or 4 half words ata time For an Add instruction for instance the carry path is broken up after every 8 bits This way 8 parallel adds can be executed with one instruction By incorporating some crucial instructions for DSP applications into the CPU it is possible to achieve the same processing throughput on a general purpose CPU as with a special ASIC It seems that as soon as there is a large enough user base CPU manufacturers will include traditional coprocessor like abilities in a CPU The result is often much higher speed than with a conventional coprocessor as the coupling between processing units is tighter and more parallelism between instruction pipelines can be exploited The result is that it becomes in creasingly more difficult to find applications for which it makes sense to build special purpose hardware 1 Introduction 3 1 2 User Configurable Hardware One consequence of t
47. direct user control is better since the produced result is the one expected The XACT router can be influenced in several ways as well Normally of course users will enable all routing resources to ensure the successful routing of a design but it is not at all clear if types should be prerouted or not For certain designs it results in faster routing with higher quality less unrouted nets but for other designs it can have the opposite effect It seems that the user is only left with a trial and error approach which is very time consuming if the design cycle time is in the range of minutes The main advantage of our approach is speed Using Hades a user can explore the design space much more quickly It becomes possible to give placement hints in the design descrip tion and see the effect of it half a minute later This enables a completely different style of constructing hardware as it allows the user s knowledge about the design to enter the design cycle much more easily The result is we believe that the design is finished in less total time than using smarter but slower tools 6 4 Hades in the World The Trianus and Hades systems are publicly available cf Appendix G Two groups outside the Institute for Computer Systems used Hades in the past to design circuits for the XC6200 FPGA Their work is presented subsequently Virtual Computer Corporation will distribute the Trianus and Hades system with their XC6216 based PCI board as an
48. generated into another Trianus data structure The expanded data structure can be passed to back end tools such as a placer and router for further processing As an example consider the code for the adder in Program 3 2 The first two FOR loops in the body of module Add will generate when interpreted a number of registers for the x and y vectors and some negations and AND gates for the load enable signals Likewise interpretation of the unit assignment of adder will connect the x and y vectors with the input vectors of an instantiated 8 bit adder which is generated by interpreting the code in the generic 3 Foundations Lola and Trianus Program 3 10 Message Broadcast CONST message broadcast selectors SelAll 0 SelType 1 SelRect 2 Sel Visible 3 SelTop 4 SelPlaced 5 TYPE MessageBase RECORD END 36 to all nodes wires to all instances of type msg type to all nodes wires partially visible within msg r to all nodes wires partially visible in open instances within msg r to all nodes wires completely within msg r not going into sub instances to all placed nodes wires NodeProc PROCEDURE node Node VAR msg MessageBase WireProc PROCEDURE wire Wire VAR msg MessageBase Message RECORD MessageBase r Rect type Type doNode NodeProc doWire WireProc END called for each node called for each wire PROCEDURE Broadcast inst Instance VAR msg Message sel SHORT
49. in principle the development of the mapping algorithms revealed manv subtleties which were not accounted for in the beginning In fact after dis covering a problem with the mapping of negations and binarv gates in an earlier version we rewrote the whole mapper from scratch using the approach described above The regular structure of the XC6200 and its simple cell were verv advantageous in the development of the mapping algorithm A direct approach only makes sense for simple cells In fact our mapping algorithm does nearly nothing in terms of technology mapping as most functions of the Lola description are directly implementable by a logic cell Grouping multiple expressions into one cell is not possible and we can refrain from using a graph based approach to find a mapping of the compiled netlist onto the target cells The complications that occurred during the development of the algorithm and which were discussed in the preceding paragraphs were founded more in the requirements of the Trianus 5 Hades Software 71 data structure than in the architecture of the XC6200 FPGA 5 5 Placer and Floor Planner The placement algorithm used by the Hades placer is constructive and deterministic It pro duces the same output for the same input always The main objectives during its development were quick response and usability for interactive iterative design Using a stochastic approach such as simulated annealing KGV83 was not viable due to t
50. input of dst insert wires into inst handle clock clear and constant signals separately mark all wires of src as destinations 1 w src wire WHILE w f NIL DO TriBase2 AbsWire w wu wv SetMap r wu wv w to Dest w w next END mark unused outputs of src as destinations 2 FOR dir North TO West DO IF r maplsu sv dir from gt Free THEN SetMaptr su sv dir Dest END END calculate bounding rectangle into r minU r minV r maxU r maxV find path by wave expansion start from inputs of dst 3 FOR dir West4 TO North BY 1 DO QueuePos r du dv input dir 1 END LOOP INC r cost 4 IF r cost MaxCost THEN EXIT END GetPos r u v to tu tv newTo consider cells in cost class r cost 5 WHILE to TriBase Void DO GetMaprr u v to mark IF mark from Dest THEN EXIT found it result in u v newTo to 6 ELSIF mark is free entry THEN spread 7 mark cell as visited with direction where we came from 8 SetMap r u v to newTo tu tv Spread r u v to 9 END GetPos r u v to tu tv newTo 10 END END done to TriBase Void IF done THEN generate path while backtracing 8 WHILE not at dst DO from to to newTo append information for new wire to r coord PositionWire r coord inst src u v from to get next entry GetMaprr u v to mark newTo mark from END END 5 Hades Software 93
51. msg router send mark message to all wires which intersect with mark r TriBase Broadcast msg router mod mark TriBase SelRect END 5 Hades Software 90 same column are routed before other nets This scheduling policv achieves quite satisfactorv results in that it routes the nets in a similar wav an experienced designer would A good scheduling policv is crucial for the performance of the router Experience with this simple policv was quite positive especiallv since we did not schedule the nets at all in the beginning We refrained from examining alternative policies but as thev are so crucial to the performance of the router it might be worthwhile to trv out different ones A similar scheduling policv is also used by DM95 5 6 4 Routing a Net In the following we explain the steps and data structures involved to route a single net that is a connection from a source node to a destination node also called terminals As was said before the router uses a Lee map to represent the routing resources This is a two dimensional matrix of four routing resource entries where each entry represents a routing resource at a certain position in a certain direction e g the length 4 FastLANE multiplexer in north direction in column 15 row 4 The indices of the matrix are the horizontal and vertical coordinates of the routing resource A wave is spread in this map to find all possible paths between two terminals Prior to routing the map i
52. of arrays of expression trees rooted in signals Instances have a predetermined size as their type already has been placed In stances themselves are placed on top of each other at the next free position that is found in the free space bitmap If the boundary of the chip is reached the horizontal coordinate is incremented by the widest width encountered since the last reset of the vertical coordinate and the vertical coordinate is reset Hence instances that are not part of an array are placed going from the bottom to the top and from the left to the right of the chip 5 5 5 Placement of Expression Trees Expression trees are rooted in an object representing a signal name The algorithm proceeds from this root into the leaves of the tree pre order traversal Since every node of a Trianus data structure occupies at most one physical cell in the XC6200 FPGA the first node is placed at the current position Then the first and second subtree of the tree are traversed recursively MD95 Several nodes are put into one cell whenever possible For example an expression of the form a REG x y uses a single cell In doing this the placement algorithm determines the geometrical position of the nodes and thus performs a task traditionally associated with a technology mapper 5 Hades Software 75 When a node cannot be merged into the current cell it is placed to the right of that cell if it is in the first subtree or above that cell if it i
53. of digital computing Chips with a feature size of 0 35 um running at 500 MHz are common nowadays and the next shrink down to 0 25 um is within sight e Improvements in the architecture of processors such as multiple execution units and pipelining helped to lower the ratio of clock cycles per instruction The availability of fast and cheap memory allowed to incorporate cache memory on processors first and second level caches larger main memory and caches on disk con trollers The availability of larger main memory allowed a space for time tradeoff resulting in faster execution of algorithms e Improvements in algorithms and data structures reduced the runtime of operations Still this level of performance is barely sufficient for modern applications such as image and sound processing compression decompression three dimensional graphics rendering speech recognition and so forth As computers become faster new application domains are tackled requiring more computational power and programmers pay less attention to the effi cient implementation of algorithms Therefore new ways for speeding up algorithms are still a strong driving force in hardware and software research 1 Introduction 2 1 4 Coprocessors In computer system there is normally one central processing unit CPU which can be pro grammed to execute a task Computationally intensive tasks are usually accelerated through algorithmic and programming techniques such a
54. of number of statements to number of lines of code is roughly 75 But this ratio varies between 65 and 120 indicating different coding styles Therefore the number of statements is a better measure for source code complexity Comparison to CALLAS Cuno Pfister and Beat Heeb implemented a system similar to Trianus Hades for the Algo tronix CAL architecture Hee93 Pfi92 Table 5 5 compares the back ends of the CALLAS and Hades systems As can be clearly seen the additional complexity of the XC6200 FPGA reflects itself in the complexity of the loader and the router software of Hades In addition the more complex but also more flexible Trianus data structure adds complexity to the algo rithms The increase in the number of statements and object code is nearly a factor of 4 These are quite high costs for the support of the supposedly moderately more complex XC6200 FPGA 5 10 2 Memory Consumption We measured the memory consumption of the Trianus front end and the Hades back end dur ing the compilation of the big pattern matcher 16x12 described in Chapter 6 Table 5 6 sum marizes the data All numbers are in Kilobytes and were measured under Windows Oberon Version 2 0 from the University of Linz The total does not take into account that some of the memory consumed in previous stages might be recycled by the garbage collector 5 Hades Software 103 J Hades Stats CALLAS Stats Hades Obj CALLAS Obj Mapper 42 1087
55. perform well in the first area the RC should reside as closely as possible to the CPU where it can execute for instance a time critical loop of some algorithm To perform well in the second area the RC should be near the input output ports or include these ports on the board The general architecture of an RC shown in Figure 1 2 consists of one or more FPGAs connected to local memory and an interface to the CPU As will be seen in Chapter 6 the speed of this interface is essential for achieving good performance Often a general purpose connector where extension boards can be mounted is included as well The connection be tween the system bus and the local memory of the RC shown in Figure 1 2 is not necessarily present in an RC It is however very convenient to have as the CPU can then directly access the board s local memory and does not have to go through the FPGA which might need to be reconfigured to allow for that possibility CPU 7 Memory System Bus Input FPGA s Output General Purpose Connector Local Direct Access Memory to Local Memory RC Board Figure 1 2 Typical Reconfigurable Coprocessor In the past most RCs have been realized as big external boards attached to the system bus via an additional interface ACC95 ABD92 BRV89 Ber93 GHK90 But as FPGAs get denser modern RCS are realized as smaller extension boards LSC96 Sha96 which can be plugged directly into the sy
56. process can be understood 5 4 Mapper A technology mapper is a program that transforms a hardware independent description of a circuit into a form suitable for the technology available in the target device In our case this means mapping a Trianus data structure produced by the Lola HDL compiler onto the XC6200 FPGA architecture Note that although we refer to the Lola compiler as being the tool to produce the data structure it can also be produced using a schematics editor cf Section 3 3 6 In a Trianus data structure described in Section 3 3 the logic of a circuit consists of a list of named binary expression trees where the name corresponds to variables in the Lola program The expressions can be composed of unary or binary operators of Boolean logic Not And Or Exclusive Or constants Zero One multiplexers registers SR latches and latches Except for the last two a cell of the XC6200 can directly implement any of these operators In fact the match between the Lola constructs and the possible cell configurations of the XC6200 is almost perfect One might suspect that Lola was designed for the XC6200 but this is not the case Rather the XC6200 was designed to be as flexible and simple as possible as was Lola and the resulting basic operators are very similar A mapper for the XC6200 should therefore be quite easy to develop as almost no work has to be performed to do the mapping It is for this reason that we have chose
57. ren 154 Register Slee o aksler AA K EA SG GE MEG ren 155 Data Flow bot me le ee se R RE SE AG EG epe 155 Control UmtShce 2 4 ok be BRERA o SUR SE SR SEN 156 State Machine vas sos ethan oh peces x PE HS SVGS SG 156 ALU Slice Signals sen k m Sand iba pa te berges G de sg 157 ALU SHCE x 2 Did ee Sr SR pie nz Gre ARS 164 Wotan Microprocessor on XC6216 FPGA 04 166 2 1 2 2 3 1 3 2 3 3 4 1 42 5 1 5 2 5 4 5 5 6 1 6 2 6 3 6 4 6 5 F 1 F2 F 3 List of Tables Truth Table for AND and XOR Functions 22 av 2 onen 12 Comparison of Different FPGAs 6 19 Eola Operators y sanere en SAGA SG Sk SANS vere S AES 22 Lola vs VHDL vs Verilog lee 25 fct Values and their Meaning rv rann vr renn 34 Memory Map of Hades Board Address is Relative toa Base 48 Worst Case Access Time to Local SRAM 52 Binary Operators Left and Their Mapping Right 2 2 2 64 Hades Software Size ee 101 Total Size of Trianus Hades System 2 2 moon 102 Oberon Compiler Size e 102 Comparison to CALLAS lee 103 Memory Consumption for Compiling PatternMatch 16x 12 103 Mapping of 8 Bit to 5 Bit Characters av vr verk nr vr vr en 109 Searching MODU ThroughputinKB S o 124 Pattern Matcher without Hints 2 Patterns of 4 Characters Each 127 Pattern Matcher with Hints 2 Patterns of 4 Characters Each
58. represents a decision node OBDDs represent Boolean expressions in a memory efficient and canonical form and are thus well suited for comparison The main drawback of OBDDs is that their size heavily depends on the ordering of the input variables The Trianus system however does not suffer from this potential prob lem in practice as expressions are usually small and contain only few variables The OBDD for the carry out co signal of type AddElem from Program 3 1 is shown in Figure 3 6 CO XXy hxci e Pd 0 al d f bt Figure 3 6 OBDD for Carry Out of AddElem 4 A circuit extractor constructs connectivity information from a design edited with the lay out editor This information is used by the circuit checker to verify the correctness of a circuit A circuit extractor is needed because the layout editor does not keep connectivity information consistent when a design is being changed Like other tools it operates on types and calculates net connection information only for a type before propagating it to instances The extractor and checker are also used to allow the mixing of manual and automatic routing and for checking the correctness of the routed result see Section 5 6 3 Foundations Lola and Trianus 39 Browser The browser is used to translate a Trianus data structure back into textual information Also it serves to show the interface hence its name of Lola types and is an indispensable tool for browsing libra
59. shared configuration RAM in a length 4 FastLANE switch Xil96 The latter is a constraint in the routing architecture which the XC editor back end of Trianus does not support cf Section 5 6 7 In this table we also give the size of the object code for the Ceres workstation 5 Hades Software 102 Subsysiem Modules Lines Statements Object Cer Lola Compiler 2 750 946 1589 10184 Trianus Front End 18 7270 S789 127931 81388 TXCEditoriChecker 9 5428 4406 96235 70020 Timing Analyzer 6 3630 3433 68217 50032 Suppor 8 2469 1620 39679 25340 Subtotal B 19577 16194 347351 236964 xar 1 l foss Table 5 3 Total Size of Trianus Hades System It is interesting to note that the Trianus front end is quite large when compared to the rest of the system This indicates that a lot of functionality in a CAD system can be made independent of the target architecture To put the Trianus and Hades systems into perspective Table 5 4 lists the size of the Oberon 2 compiler for the Intel 1386 architecture which is about a third of our tools Modules Lines Object 10 11954 8705 165880 Table 5 4 Oberon Compiler Size On a quantitative issue Intel object code size divided by number of statements consistently gives a factor of 22 Intel object code is roughly 40 larger than object code for the Ceres National Semiconductor 32000 The ratio
60. the comparator from Figure 6 6 together with a pattern register on the left and the same comparator after constant propagation on the right The number of cells and the space required to implement a 5 bit comparator is reduced by a factor of three The original comparator had a delay of 9 5 ns while the compact comparator s delay is 5 7 ns This is an improvement of just 3 8 ns but the size advantage of the compact comparator is drastic While the bounding box of the 16x12 pattern matcher shown in Figure 6 10 is 60x61 cells the bounding box of the compact version is just 29x61 cells Pipelining To reduce the delay for large pattern matchers one can introduce pipeline registers in the comparator s eql chain This approach however would require additional data registers to 6 Application and Evaluation 126 II HE 490 4 LET is PEZAN Figure 6 12 Conventional and Fused Comparators compensate for the additional delay in the eql chain By using an AND gate tree instead of a chain we can omit these delay registers This way the critical path for the 16x12 pattern matcher can be reduced from 137 ns to 45 ns which again is caused by the mapper circuit 6 2 13 Discussion The pattern match application presented in this section was a proof of concept application Performance of the circuit itself is quite good even though the performan
61. the eql chain should be placed within the comparators as indicated in Figure 6 6 The layout of Figure 6 8 is therefore optimized manually and the Lola code is augmented with position statements using the back annotation capability described in Section 5 5 8 Based on this code the Hades tools produce the layout shown in Figure 6 9 To the left there is a block containing the shift register leftmost cells the mapper big block on top and the input register to the right of the shift register Further to the right are the data registers Then the patterns and comparators follow which are packed together optimally The AND gate of the eql chain is located in the lower right corner of each comparator The result vector appears on the right i aaja aja s ajaja SZ 3 d ale ene 4 3 318 aja Figure 6 9 Pattern Matcher with Placement Hints Note that all registers needed by the software interface have the same pitch distance be tween bits and start in the same row This ensures that t
62. tigen als Compiler zur ber setzung von Softwarebeschreibungen Heutige Synthesewerkzeuge benutzen stochastische Algorithmen um ihre Resultate zu erzielen und das Wissen des Benutzers ber ein Design kann nur schwer in den Entwicklungszyklus eingebracht werden In dieser Dissertation wurde ein komplettes Hardwarebeschreibungssystem entwickelt Es besteht aus einem konfigurierbaren Koprozessor und entsprechender Layoutsynthesesoftware Der konfigurierbare Koprozessor von Hades besteht aus einem einzelnen XC6216 FPGA und lokalem Speicher in der Form von statischem RAM Der Koprozessor ist mittels einer Speicherkartenschnittstelle mit einer Arbeitsstation verbunden Die Hades Software besteht aus einem Layoutsynthese Backend f r die XC6200 Archi tektur Als Frontend zu unserer Software dient Trianus ein Ger st zur Entwicklung von FPGA Designs Die Hardwarebeschreibungssprache Lola dient zur Beschreibung der Algorithmen Die Hades Software besteht aus e einem Technologie Mapper e einem deterministischen und konstruktiven Plazieralgorithmus der sich auf Plaziervor gaben des Benutzers verl sst um dichte Layouts zu erzielen e einem Labyrinth basierten Router der durch den Benutzer in verschiedener Weise be einflusst werden kann e einem Generator von Konfigurationsinformation und e einem Schnittstellengenerator welcher zu einer Hardwareapplikation automatisch eine Softwareschnittstelle generiert Das resultierende System hat
63. to the data structure of a type and never to that of an instance Type Definitions The following sections describe the types occurring in a Trianus data structure as they are defined in the Oberon programming language RW92 In Trianus Oberon s type extension is used for specializing behavior and describing the semantic difference between the types When describing extended types those inherited fields which have a different meaning than in the base type are listed in parentheses and an explanatory comment is added Node Node is the basic type for building a data structure describing a circuit Its definition is shown in Program 3 5 The fct field describes the function of a node For a pure node this is one of the basic operators of Lola e g not and mux register For extensions of Node the function is listed in the respective paragraphs below node per se exists in the data structure only to describe operators and abstract syntax trees e g the operator in type AddElem of the Add example of Program 3 1 See Figure 3 4 for a graphical representation of an operator and operand tree and Section 3 3 4 for an explanation of the use of abstract syntax trees 3 Foundations Lola and Trianus 31 Program 3 5 Definition of Node Node POINTER TO NodeDesc NodeDesc RECORD fct SHORTINT operator Zero Not And Reg X y Node operands link Node next in same instance outer Instance outer instance type
64. with Marco Sanvido the author wrote a tool to produce CFG configuration files which are used by the Xilinx XACT step Series 6000 software Using this tool it is possible to generate a design within the Trianus system and use the Xilinx software to place and route the design This is useful for comparing the Hades and XACT back end tools The converter makes use of two other data structure modules which are very similar One is to store associations between strings and strings and one is for associations between integers and strings Again both modules could be merged if generic types would be available in the Oberon language 5 Hades Software 101 5 10 Quantitative Issues In this section we evaluate the complexity of the Trianus and Hades systems and compare them to the commercially available tools for the XC6200 FPGA available from Xilinx We also analvze the memory consumption of the tools when a large coprocessor application is compiled placed and routed 5 10 1 Code Complexity In the following tables we list the number of source code lines Lines the number of state ments as reported by the Analyzer tool Statements and the number of bytes of object code for the Intel 1386 processor Object We believe that the number of statements is a better measure for source code complexity than the number of lines of code as it is independent of the coding style and the presence of comment lines in the source code Table 5 2 presents the
65. 0 0 pat00 HI map 11 pat 0 0 val OX similar for remaining patterns HI map 0 1 5 0 31 HI mapi1 0 31 HLInitDescriptor data 0 data0 HI map 7 data 0 val OX similar for remaining data unneeded interface objects manually removed END END Init BEGIN Init END PatternMatchint 6 Application and Evaluation 123 Program 6 9 PatternMatch Application PM PatternMatchint PROCEDURE Search r Files Rider pat ARRAY OF ARRAY OF CHAR VAR pos LONGINT i j INTEGER BEGIN ensure that setting of map register via PM in works for other interfaces as well ASSERT HI SubMap PM shiftReg PM in amp HI SubMap PM result PM in 100 i 0 load the patterns WHILE i lt LEN pat 0 DO PM PutPat i pat i INCG END WHILE i lt LEN HI pat 0 DO PM IgnorePat i INCG END HI SetMap PM in PM shiftReg val 0 2 4 6 only every second clock cycle WHILE r eof DO read 4 characters store them into input register shift 4 times Files ReadBytes r PM in val 4 PM in Write PM shiftReg Write read 4 characters store them into input register shift 4 times Files ReadBytes r PM in val 4 PM in Write PM shiftReg Write read back result PM result Read IF PM result val 0 7 ft THEN found something in prev 8 characters report position based on set bits in PM result val and file position END END END Search 6 Application and Evaluation 124 marginal Additionally w
66. 00 can be useful for certain applications but their absence in the XC6200 has its advantages as well as it is not possible to destroy the chip with a faulty configuration This can be useful both in education GLW94 and research HHC96 Tho96 2 5 3 Coprocessor Suitability So far most reconfigurable coprocessors use either the XC3000 or the XC4000 series from Xilinx cf Chapter 7 Reconfiguration times although faster in newer devices are quite slow ms with regard to coprocessor applications and partial reconfiguration is not possible Also reading user registers is not very fast ms and it is not possible to set them Data path intensive circuits map well onto the logic cell as does random logic The proprietary bitstream format hinders the development of new tools by third party vendors or universities Such tools could give better support for coprocessor applications The AT6000 has relatively fast reconfiguration times and also supports partial reconfig uration For a long time being the only FPGA to support these features the AT6000 and its cousins were a favorite among researchers for exploring partial reconfigurability EH94 HH95 LD93 WH95 The cell supports data path type applications well while it is less suited for random logic As with the XC4000 the proprietary bitstream format hinders the development of new tools The dedicated memory mapped processor interface the possibility of fast padless I O and the fast reconf
67. 4 IN 161 F Wotan Microprocessor 162 AOE BIT INOUT A AddrN TS D DataN TS OUT RAMWE 4 BIT RAMOE BIT VAR IR DataN BIT alu DataN ALUslice Place cu AddrN PCslice Place Dx Dy Decoder Place Dz EnDecoder Place N Z C BIT ds im pcs shen shd and or xor neg BIT state machine LS ph0 phl ph2 BIT controls ren iren cond as0 as1 BIT Zero one we BIT zeroes DataN AddrN BIT BEGIN instruction register and decoders FOR i 0 DataN 1 DO IR i REG iren D i END Dz ren IR 15 17 destination register select Dx IR 12 14 src register I select Dy IR O0 2 src register 2 select ALU slices zero 0 one 1 alu O D O IR 0 cu O pc alu 23 r alu 1 so zero alu 1 r neg one ds im pcs shen shd and or xor neg Dz y Dx y Dy y FOR i 1 11 DO alu i D i IR i cu i pc alu i 1 so alu i 1 so alu i 1 r alu i 1 r alu 1 1 co alu i 1 zo ds im pcs shen shd and or xor neg Dz y Dx y Dy y END FOR i 12 AddrN 1 DO alu i D i IR 11 cu i pc alufi 1 so alu i 1 so alu 1 1 r alu i 1 r alu i 1 co alu i 1 zo ds im pcs shen shd and or xor neg Dz y Dx y Dy y END FOR i AddrN DataN 2 DO zeroes i AddrN 0 alu i D i IR 11 zeroes i AddrN alu i 1 so alu i 1 so alu i 1 r alu i 1 r alu 1 1 co alu i 1 zo ds im pcs shen shd
68. 5 13 5 14 5 15 5 16 5 17 5 18 5 19 5 20 5 21 5 22 5 23 Cl Fl F2 F 3 F 4 F 5 F 6 F 7 F 8 Multiplexer Example e 80 Shift Register Example 000000 81 Parallel to Serial Converter ees 82 Counter Example sc coo me mos SE e ee 82 Adder Example 50303 20 22 22 so bk a te ke ads e kb Sans 83 Routing Resource Conflicts 2 2 Comm 89 Growing of Bounding Box L 91 Spreading of the Wave 2 oo onen 93 Resultins Route o as see Ber Sk ie ARA Rr oe fane te dar 93 XC6200 Function Unit ele 96 Inversions on Inputs 2 N aly Sa a CLE EE 96 Pattern Matcher suo 2 4 0 a GE be dd mb s 108 Mapper Circuit without Placement Hints 110 Mapper Circuit with Placement Hints 0040 111 Comparator Schema e 111 Comparator Circuit without Placement Hints 112 Comparator Circuit With Initial and Final Routing 113 Data and Pattern Registers lees 114 Pattern Matcher without Placement Hints 117 Pattern Matcher with Placement Hints 00000000 118 Large Pattern Matcher with Placement Hints 2 rav vr vr vrien 119 Constant Propagation eee 125 Conventional and Fused Comparators llle 126 Schema of Hades RC Board V LL 000 146 Photograph of Hades RC Board o 147 Floorplan of Wotan Microprocessor o o e knr
69. 6 Textual Layout Information 2 2 lle 84 Router Data Structure 00000 86 Overview of Routing Algorithm I vnr reke 87 Overview of Routing Algorithm II 88 Marking of Wires Running over Instances 89 Routing otaNet rude sense but Sa eb ar en 92 Interface Objects o 5 ste eer e GRE P Bg OS e me ee 99 Control Flow as Seen From Software llle 108 Mapping of 8 Bit to 5 Bit Characters 2 vr vr rv nn 110 Comparing Two 5 Bit Characters ooo 112 Loadable and Buried Registers 2 ev vr vr llle 114 Complete Pattern Matcher L 115 Complete Pattern Matcher U 116 PatternMatch Software Interface L 121 PatternMatch Software Interface U 122 PatternMatch Application 2 22 oo nn 123 xii Kurzfassung Das Aufkommen von benutzerprogrammierbarer Hardware entfachte ein Interesse an kon figurierbaren Koprozessoren welche zur Beschleunigung von zeitkritischen Softwareteilen benutzt werden k nnen indem diese in Hardware gegossen werden Applikationen welche auf konfigurierbaren Koprozessoren ausgefiihrt werden werden mittels Hardwarebeschrei bungssprachen oder schematischen Eingabesystemen beschrieben Diese Beschreibungen wer den in Logikgatter bersetzt f r welche ein Layout Auslegeplan gefunden werden muss Logik und Layoutsynthese sind zeitintensive Vorg nge f r welche heutige Hardwaresyn thesewerkzeuge bis zu vier Gr ssenordnungen mehr Zeit ben
70. 6 1 lists a pseudocode description of this process A more detailed description of the software part and the interface is given in Section 6 2 10 6 Application and Evaluation 108 Data Patterns Preprocess peer MA gt gt N De le gt m gt Length E zur of Pattern gt gt gt l Y Match Figure 6 1 Pattern Matcher Program 6 1 Control Flow as Seen From Software load patterns WHILE data available DO load 4 characters into input register perform 4 shift steps comparisons happen load 4 characters into input register perform 4 shift steps comparisons happen read back result report matches in 8 previous characters END 6 Application and Evaluation 109 6 2 5 Preprocessing To be case insensitive our pattern matcher works with 5 bit characters Before we put them into the circuit we preprocess the patterns according to the mapping shown in Table 6 1 For performance reasons the data stream is preprocessed in the circuit itself since the source of the data might be a network adapter in which case the data should not have to pass through the CPU Char 856 5635 Table 6 1 Mapping of 8 Bit to 5 Bit Characters Program 6 2 shows the logic equations implementing the mapping We tabulated the bit m
71. Coprocessor Board Using Xilinx s XC6200 FPGA An Experience Report Proc 6th Intl Workshop on Field Programmable Logic and Applications LNCS 1142 Springer 1996 W Luk N Shirazi P Cheung Modelling and Optimising Run Time Reconfig urable Svstems FPGAs for Custom Computing Machines 96 IEEE Computer Societv Press 1996 P Lvsaght J Dunlop Dvnamic Reconfiguration of FPGAs More FPGAs Proc 1993 Intl Workshop on Field Programmable Logic and Applications 1993 L M Monier J Dion Recursive Layout Generation Proc 16th Conference on Advanced Research in VLSI IEEE Computer Societv Press 1995 E Mirskv A DeHon MATRIX A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deplovable Resources Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1996 E F Moore Shortest Path Through a Maze Annals of the Computation Labora torv of Harvard Universitv Harvard Universitv Press 1959 H M ssenb ck N Wirth The Programming Language Oberon 2 Structured Programming Vol 12 No 4 1991 Motorola Fast Static RAM Databook 1995 P M ller Arithmetische Einheiten auf FPGAs Arithmetic Units on FPGAs Term Project Institute for Integrated Systems ETH Z rich 1997 National Semiconductor Series 32000 Microprocessors Databook 1988 Bibliography 174 OE95 Ohr84 PCI93 Pfi92 Phi95 PTS93 Qui94 Raz94
72. Culbertson P Kuekes G Snider Plasma An FPGA for Million Gate Systems Proc Intl Symposium on Field Programmable Gate Arrays ACM 1996 J M Arnold D A Buell E G Davis Splash 2 Proc 4th Annual ACM Sympo sium on Parallel Algorithms and Architectures 1992 P M Athanas H F Silverman Processor Reconfiguration Through Instruction Set Metamorphosis IEEE Computer Vol 26 No 3 March 1993 Atmel Configurable Logic Design amp Application Book 1995 P Bertin D Roncin J Vuillemin Introduction to Programmable Active Memo ries Systolic Array Processors Prentice Hall 1989 P Bertin M moires actives programmables conception r alisation et program mation Programmable Active Memories Conception Realization and Program ming Dissertation Paris University 1993 P Bertin H Touati PAM Programming Environments Practice and Experience Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Com puter Society Press 1994 T Blickle Theory of Evolutionary Algorithms and Application to System Synthe sis Dissertation 11894 ETH Ziirich 1996 R S Boyer J S Moore A Fast String Searching Algorithm Communications of the ACM Vol 20 No 10 October 1977 168 Bibliography 169 IBRA96 Bre96 BG95 Bre77 BFR92 BKV96 Bry86 Bry92 Buc96 BJL92 Cas96 Chr95 CKW95 CH96 Con96 Coo71 CLR90 C0093 Berkele
73. D list list next END within the current type Therefore all instances contained in the type must be processed and all input signals of those instances must be connected to the sources which are contained in the current type 6 All nets to be routed are collected in a data structure rooted in the router data structure The nets are first sorted according to a distance criteria discussed in Section 5 6 3 and are then routed using a Lee map algorithm which is introduced in Section 5 1 3 and described in more detail in Section 5 6 4 The information on wires that need to be inserted into the Trianus data structure is stored in the router After all nets have been routed this information is broadcast to all instances of the current type using a type broadcast 5 6 2 Finding the Nets To find out which nets of a type need to be routed all placed nodes of the type are traversed sequentially using the list rooted in type y cf Program 3 7 If a node represents a gate its x and y subtrees point to the source signals read by that node If the position of the source node is different from the current node s position the net is appended to a list of to be routed nets which is rooted in the router data structure field nets in Program 5 13 For each net this list contains the source and destination nodes as well as the input to which the source has to be connected e g the upper or the lower input of an AND gate 5 6 3 Scheduling the Net
74. EE Standard VHDL Language Reference Manual Institute for Electrical and Electronic Engineers 1987 Institute for Computer Systems The Oberon Archive ftp ftp inf ethz ch pub software Oberon C Iseli Spyder A Reconfigurable Processor Development System Dissertation 1476 EPF Lausanne 1996 C Iseli E Sanchez A C Compiler for FPGA Custom Execution Units Syn thesis Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1995 H Johnson and M Graham High Speed Digital Design A Handbook of Black Magic Prentice Hall 1993 R M Karp Combinatorics Complexity and Randomness Communications of the ACM Vol 29 No 2 February 1986 T A Kean Configurable Logic A Dynamically Programmable Cellular Ar chitecture and its VLSI Implementation Thesis CST 62 89 Univ of Edinburgh 1989 T A Kean J Gray Configurable Hardware Two Case Studies of Micro Grain Computation Systolic Array Processors Prentice Hall 1989 T A Kean Xilinx Development Corporation Scotland Personal Communica tion 1996 T A Kean I Buchanan The Use of FPGA s in a Novel Computing Subsystem Proc 1st Intl ACM SIGDA Workshop on FPGAs ACM Press 1992 T Kean B New B Slous A Fast Constant Coefficient Multiplier for the XC6200 Proc 6th Intl Workshop on Field Programmable Logic and Applica tions LNCS 1142 Springer 1996 B Kernighan S Lin An Efficient Heuristic Procedure for Par
75. ER BEGIN set pattern i FOR j 0 TO 3 DO write character into input register in val ORD pattern j HI SetMap in in Write shift once gt input register is mapped into data register 3 HI SetMap shiftReg shiftReg val 1 shiftReg Write read mapped value from data register 3 HI SetMap data 3 data 3 Read load mapped value into pattern register HI SetMapf patli j patli j val CHR ORD data 3 val MOD 32 patli j Write END END PutPat written by programmer 2 PROCEDURE IgnorePat i INTEGER VAR j INTEGER BEGIN load pattern i with unused pattern not occurring characters FOR j 0 TO 3 DO patli j val IgnoreChar HI SetMap patti j patli j Write END END IgnorePat continued in Program 6 8 6 Application and Evaluation 122 Program 6 8 PatternMatch Software Interface II continued from Program 6 7 automatically generated PROCEDURE Init BEGIN HI Load PatternMatch XC6Bits load bitstream IF HLres HI Done THEN error processing ELSE HI map 0 1 31 0 31 HI mapl1 0 0 31 HL InitDescriptor in in HI map 4 in val 0 HI map 0 1 8 0 31 HLmap 1 0 31 HI InitDescriptor result result HI map 17 result val 0X HI map 0 1 31 0 31 HLmap 1 0 0 31 HI InitDescriptor shiftReg shiftReg HI map 0 shiftReg val 0 HI map 0 1 5 0 31 HI map 1 0 31 HI InitDescriptor pat
76. F 2 Register Slice The layout of one complete ALU slice is shown in Figure F 7 A data flow diagram of the ALU is shown in Figure F 3 An ALU instruction has two source x y and one destination register Z operand operand T T select z A RO R7 operand Figure E3 Data Flow F 1 3 Control Unit The control unit consisting of the program counter and address generator circuitry is 16 bits wide and is used to address external memory The address used in a particular step is either the value of the program counter for fetching the next instruction the value from the instruction register for jumps or the value from the ALU for return jumps Figure F 4 shows two bits of the control unit The program counter is shown on the left and the multiplexers for the address selection are shown on the right F 1 4 Decoders and Instructions Control signals for the various multiplexers in the ALU and the program counter are generated by the decoding circuits below and above the ALU and the control unit The decoding circuitry takes as inputs the instruction register holding the current instruction Wotan implements the F Wotan Microprocessor 156 dx i d n a0 E
77. FPGA and local memory possibly with DMA capability We now discuss these alternatives presenting their advantages and disadvantages 4 2 1 FPGA Attached to the CPU Conceptually the first alternative from the list above is the cleanest as it most closely resem bles the traditional definition of a coprocessor cf Figure 4 2 If the FPGA is attached to the CPU via a coprocessor interface Alt 1 then it can access memory only through the CPU If it is attached to the memory processor bus Alt 2 then it can access memory directly Both alternatives have one major advantage namely that the latency of data transfers between the CPU and the RC is as small as possible Lower latency can only be achieved by incorporating the FPGA directly on the CPU chip BRA96 DeH96 ECF96 Raz94 It is a good setup when an RC is used to implement small statement sequences within a software 4 Hades Hardware 42 Attached via Processor BUs Processor BUs T FPGA Attached via Alt 1 Coprocessor Interface System Bus FPGA Memory d Alt 2 nage Input Output Figure 4 2 FPGA Attached to the CPU Two Alternatives loop where just a few words of data need to be exchanged between the CPU and the RC If required the RC has access to main memory just as the CPU and can easily share data with a software application But there are also several problems with the two alternatives Alt 1 If the FPGA is
78. G shiftReg i 1 END shift shiftReg 0 continued in Program 6 6 6 Application and Evaluation 116 Program 6 6 Complete Pattern Matcher II continued from Program 6 5 input register direct memory write from host input3 input 2 shift input3 q shift characters down FOR i 2 0 1 DO input i shift input i 1 q END map input 0 q strip input characters to 5 bits stream of input data data PatternSize 1 shift map out shift characters down FOR i 0 PatternSize 2 DO data i shift data i 1 q END instantiate pattern registers FOR j 0 NofPatterns 1 DO pattern matchers FOR i 0 PatternSize 2 DO compare data with pattern equal chain cmp j i data i q pat j i q eql j i eql j i 1 cmp j i eql END cmp j PatternSize 1 data PatternSize 1 q pat j PatternSize 1 q start eql chain eql j PatternSize 1 cmp j PatternSize 1 eql END patMatch 0 eql 0 0 FOR i 1 NofPatterns 1 DO Or gate patMatch i eql i 0 patMatch i 1 END match patMatch NofPatterns 1 Does any pattern match result queue queue O shift match FOR i 1 ResultSize 2 DO queue i shift queue i 1 q END END PatternMatch 6 Application and Evaluation 117 1 Variable in is only needed to make the variables input3 q and input 0 q input 2 q acces sible under one name cf Section 6 2 10 2 eql represents the AND gates that are used to link individual character comparators to
79. Hades Fast Hardware Synthesis Tools and a Reconfigurable Coprocessor Diss ETH No 12276 Hades Fast Hardware Synthesis Tools and a Reconfigurable Coprocessor A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH ETH ZURICH for the degree of Doctor of Technical Sciences presented bv Stefan Hans Melchior Ludwig Dipl Informatik Ing ETH born Mav 21 1966 citizen of Schiers Graubiinden accepted on the recommendation of Prof Dr N Wirth examiner Prof Dr H Eberle co examiner 1997 Stefan H M Ludwig 1997 Fiir Irene Vanessa und Cyril Fiir meine Eltern Acknowledgments I would like to express my appreciation and gratitude to my advisor Prof Niklaus Wirth His striving for simplicity and understandabilitv are unparalleled If Hades is fast it is because of the constant fear of spending a cycle too much here or a byte too much there He isa fabulous teacher and it was a pleasure to work under his supervision I thank Prof Hans Eberle for being my co examiner His knowledge in hardware design was very welcomed during the development of the Hades hardware and his constant skepticism of the feasibilitv of reconfigurable coprocessors was a driving force behind this work Many thanks go to my colleague collaborator and office mate for the past four years Stephan Gehring for his excellent Trianus framework for the discussions and feedback for the criti cism and for his will
80. Hades reconfigurable coprocessor consists of a single XC6216 FPGA and local mem ory in the form of static RAM The coprocessor is attached via a memorv mapped interface to a workstation The Hades software is composed of a layout synthesis back end for the XC6200 archi tecture The front end to our tools is Trianus a framework for FPGA design The hardware description language Lola is used to describe the algorithms The Hades software is composed of e atechnologv mapper e a deterministic and constructive placement algorithm which relies on placement hints given by the user to achieve dense layouts e amaze running router which can be influenced by the user in various ways e aconfiguration bitstream generator and e an interface generator which generates a software interface to a hardware application automatically The resulting system achieves very fast turnaround times for layout synthesis on the order of seconds on contemporary hardware The Hades software is at least an order of magnitude faster than commercially available tools for the same FPGA architecture The fast turnaround times open up a new way for interactively designing hardware and effectively bring the de signer s knowledge into the design cycle XIV 1 Introduction Hades Greek god brother of Zeus lord of the underworld ruler of the dead god of wealth Ever since the conception of the first mechanical calculator Abacus several thousand years ago m
81. INT send msg to all nodes and wires in inst sel is the broadcast selector 3 Foundations Lola and Trianus 37 Adder type shown in Program 3 1 Within that instance 8 full adders AddElem are generated each consisting of the data structure shown in Figure 3 4 Compilation combined with expansion is several orders of magnitude faster than with tra ditional VHDL or Verilog compilers which notably also perform more work See Section 6 for performance data For a more detailed discussion of the translation and expansion steps see Geh97 3 3 5 XC6200 Layout Editor The layout editor is one of many back end tools in Trianus and is built upon the generic editor framework It provides a low level view of XC6200 designs as a matrix of cells with routing switches in between Since the XC6200 FPGA features a very simple cell its function is easily displayed by the editor as a gate possibly in conjunction with a register The design hierarchy and the signal names are also visualized to reflect the design structure This almost schematic like view helps the designer identify parts of a design quickly A displayed design is manipulated using the mouse Figure 3 5 shows an instance of an AddElem with three gates and the wires between the gates Note that the carry is implemented using the half sum and a multiplexer instead of two AND and an OR gate The squares represent logic cells and the rectangles between the squares represent switches at 4x4
82. N ir z ci asl as0 BIT OUT pc co a BIT VAR a0 BIT BEGIN increment pc pe REG a ci co a ci a0 MUX asO ir z a MUX asl pc a0 IF Place 1 THEN pe 0 0 co 0 1 a 1 0 a0 1 I END END PCslice 3 8 decoder instr reg alu output carry in selectors pc carry out address instr reg or alu output instr reg alu output or pc TYPE Decoder Place IN a 3 BIT OUT y 8 BIT BEGIN y 0 a 2 a 1 a 0 000 y l a2 a 1 a 0 001 y 2 a 2 a 1 a 0 010 y 3 a 2 a 1 a 0 011 y 4 a 2 a 1 a 0 100 y 5 z a 2 a 1 a 0 101 y 6 a 2 a 1 a 0 110 y 7 a 2 a 1 a 0 111 IF Place 1 THEN FOR i 0 7 DO y i 2 1 0 END END END Decoder 3 8 decoder with enable TYPE EnDecoder Place IN en BIT a 3 BIT OUT v 8 BIT BEGIN y 0 a 2 al a0 en y 1 a 2 a 1 a 0 en 000 001 y 2 a 2 a 1 a 0 en 010 y 3 a 2 a 1 a 0 en 011 y 4 a 2 a 1 a 0 en y S a 2 a 1 a 0 en y 6 a 2 a 1 a 0 en y 7 a 2 a 1 a 0 en IF Place 1 THEN FOR i 0 7 DO yi 2 1 0 END END END EnDecoder CONST DataN 24 AddrN 16 100 101 110 111 Place 1 AluXOff 4 AluYOff
83. NIL amp sig y NIL 110 PositionNode p sig u V to place sig and set sig id myWx Cell myHx Cell TriBase Buf TriBase Not merge negations and buffers into cell u v if possible TriBase And TriBase Or mapper guarantees next assertion ASSERT sig y fct TriBase Not 111 PositionNode p sig u v to place sig and set sig id IF sig x fct TriBase Not THEN x y x y place negation on first subtree into same cell u v PlaceArg p sig x u v 0 0 XCBase A myWx myHx ELSE place first subtree to the right at u Cell v PlaceArg p sig x u v Cell 0 XCBase A myWx myHx END place second subtree above at u v myHx PlaceArg p sig y u v 0 myHx XCBase B myWy myHy calculate this node s own width and height continued in Program 5 11 If a cell contains a register there may be a feedback from the output of the register to the 5 Hades Software 76 Program 5 11 Placement of Nodes II continued from Program 5 10 TriBase Reg try to put argument gate of REG into same cell as REG take special care in treating feedbacks x REG x y PositionNode p sig u v to place sig and set sig id regl sig y rlu u rlv v mapper guarantees enable one and regl y is gate not label ASSERT reg1 x TriGen one 112 ASSERT reg1 y fct IN TriBase BIT TriBase TS 113 to TriBase Void last gate of data is in same cell as register PlaceArg p regl y u v 0 0 XCBase Func myW
84. Serif represent labels in figures or iden tifiers in programs Names followed by an apostrophe represent signals which are active low F Compiler Corrections Technology Mapper Adjustments i Place amp Layout Route Editor Download amp Runtime System FPGA Figure 4 1 Hades Hardware Part within the Design Flow of Fig 1 3 40 4 Hades Hardware 41 4 1 Motivation Many reconfigurable coprocessor RC boards based on FPGAs exist today Guccione lists over 50 different designs Guc94 In Chapter 7 we give an overview of related work The architecture and structure of these boards are quite similar and the community working in the field of custom computing has a good understanding of what a reconfigurable coprocessor should look like Most boards contain one or more FPGAs and interface hardware which connects the board to a host computer These boards may contain some local memory which can be accessed by the FPGA s and sometimes by the host This memory is mostly used for caching to overcome the limited communication bandwidth between the host computer and the coprocessor board Ber93 One deficiency of most current systems is that they are ac cessed through a special interface over a system bus Compared to the usually very efficient protocol on a processor memory bus the protocol on a system bus often introduces undesired communication overhead multiplexed address and data
85. Type list type IF thisTvpe id f Routed THEN not yet routed 2 msg hierarchies allowedHier IF module ft thisType collect legal hierarchies for routing 3 msg doNode LegalHierarchies msg tvpe thisType init u v minU minV maxU maxV of msg TriBase Broadcast module msg TriBase SelType move thisType to coords of first instance thisType u msg u thisType v msg v END NewRouter r module thisType msg hierarchies IF module thisType THEN InitFrominst r module mark used wires RouteGlobals r route clock and clear signals ELSE mark all used wires which run over current type s instances 4 msg doNode MarkWires msg tvpe thisType msg router r TriBase Broadcast module msg TriBase SelType END collect all nets to be routed and store them into r RoutePlacedNodes r thisType continued in Program 5 15 5 Hades Software 88 Program 5 15 Overview of Routing Algorithm II continued from Program 5 14 process all objects declared in thisType obj thisType dsc WHILE obj NIL DO IF obj fct TriBase Inst THEN it is an instance next assertion is guaranteed by topological sort ASSERT obj id Routed 110 connect inputs in obj with sources in thisType 5 Routelnputs r obj TriBase Instance END obj obj next END process list of nets to route and send a type broadcast 6 RouteAndUpdate r undo move of thisType IF mod thisType THEN thisType u 0 thisType v 0 END EN
86. Type Composite or generic types modules Instance Instances of composite types Signals of type Object BIT TS OC incl elements of array type wire Node No corresponding Operators construct REG etc Figure 3 3 Trianus Types and Lola Constructs The possible values and their meaning for the function field which are relevant to Hades are listed in Table 3 3 Additional values are possible and are used for representing the abstract syntax tree of a Lola program Section 3 3 4 As an example we describe in Figure 3 4 the data structure generated for the AddElem type in the Add example from Program 3 1 A similar data structure exists for every instance of such an AddElem The dsc descender list of an 8 bit instantiation adder of the generic Adder type contains 8 instances of AddElem add 0 add 1 etc each containing the descender list X y etc shown in Figure 3 4 3 Foundations Lola and Trianus 34 Value of fot BI TS OC On Bu No An Or Xo SR ad CS tri state assignments ojs Reg Reg1 Table 3 3 fct Values and their Meaning T e f t d r AddElem Typ to module Expanded outer JIE lt A type Em dsc co BIT OUT outer next x Y OR outer x y AND AND outer outer x x y y Figure 3 4 Data St
87. UT XCRD 4 BIT XC decoded read signals 0 3 XCWR 4 BIT XC decoded read write signals 0 3 XCGo BIT XC go flag VAR select XCSel RAMSel PortSel ComSel BIT Port 4 BIT busy BIT register BEGIN select CPUDS BoardAdr XCSel select A19 A18 00 xxx RAMSel select A19 A18 OI xxx 10 101 ComSel select A19 A18 A4 A3 A2 11 0xx PortSel select A19 A18 A4 Port 0 PortSel A3 7 A2 11 000 Port 1 PortSel A3 A2 11 001 Port 2 PortSel A3 7 A2 11 010 Port 3 PortSel A3 A2 11 011 FOR i 0 3 DO XCRD i Port i CPURW XCWR i Porti CPURW END XCGo REG MUX ComSel CPURW XCGo CPUDOIn busy REG MUX ComSel CPURW busy XCBusy CPUDO ComSel CPURW busy END DecoderXCRW F Wotan Microprocessor F 1 Architecture and Principle of Operation F 1 1 Overview Wotan is a small microprocessor designed by N Wirth It contains a 24 bit wide data path realized as 24 ALU slices and a 16 bit wide address path The data path contains 8 registers and has support for a multiply divide step Figure F 1 shows the floorplan of Wotan which is implemented by the layout shown in Figure F 8 Address Bus Data Bus Decode Instruction Program Counter Register Address Generator Shift
88. a reconfigurable coprocessor Further information on configuration store technology can be found in BFR92 Among different FPGAs there is a wide architectural variety in the functionality of the logic cells and the structure of the routing network If a logic cell provides only simple func tionality e g any function of two inputs then it is called a fine grained architecture If it provides rich functionality e g any function of four or more inputs and one or several op tional registers then it is called a coarse grained architecture More complex logic cells often also require a more complex routing network 2 3 The Xilinx XC6200 As our first example of a commercial FPGA we examine the XC6200 architecture from Xilinx X1196 the successor of the CAL architecture from Algotronix Alg90 Kea89 which is described in Section 2 4 It is an SRAM based FPGA with an array of identical fine grained cells and a hierarchical routing network Two novel features of the architecture are a processor interface and very fast reconfiguration times which makes the chip suitable for coprocessor applications The first implementation of the architecture the XC6216 consists of an array of 64 by 64 cells 2 3 1 Logic Cell An XC6200 logic cell implements any logic function of two inputs or a multiplexer possibly followed by a register As shown in Figure 2 4 the Dynamic Mux to the right of Y2 is the only multiplexer that is controlled by a dynamic s
89. ades software design cycle It could be improved by implementing separate algorithms for special routing cases such as straight or L shaped connections The speed of wave expansion of the maze running algorithm could be improved by letting the wave spread from both ends from the single target and from all 8 Summary Conclusions and Outlook 142 source points DM95 Currently the router treats multi point nets as several two point nets Research results BKV96 show that a special treatment of multi point nets can be advanta geous Timing and Automatic Retiming In general layout synthesis tools should incorporate timing information to make better deci sions about the placement of cells and the routing of nets An algorithm for the automatic retiming insertion of pipeline registers could be useful during the evaluation of a circuit s performance Approaches such as LSC96 and Tra95 show promising results 8 6 3 Hardware Software Co Design Issues A topic not covered and not even mentioned in this thesis except in this section is hard ware software co design This is a very active area of research and promises a new way of designing electronic systems Design partitioning is done either automatically Bli96 or by the designer but with extensive tool support This thesis provides means to describe the hard ware part bottom up and provides a sufficiently high level abstraction of the hardware to the software programmer However th
90. al semantics for each construct The main advantage of this approach should be that a programmer does not have to learn a new language to describe the hardware We consider this as mistaken since the same language has different semantics depending on the context in which it is used 7 4 3 nlc Spvder A similar approach as in Transmogrifier C is taken in the nlc project from EPF Lausanne Switzerland Ise96 IS95 A C compiler was developed for generating netlists from C programs One goal of this project was to support simulation as well as synthesis That is the same C code can be compiled using a normal compiler to obtain an executable simulation of the hardware design Commercial tools are used to perform layout synthesis 7 5 The Need for Better Tools Most of the described systems sooner or later require the use of commercial place and route tools to produce the final layout of a design This is mainly due to the fact that the bitstream format for the used FPGAs was not available Hence these systems suffer from limited inter activity long run times and practically no integration cf Section 5 1 4 Some groups would implement their own tools if the bitstream format was made public Many groups report on the need to give manual hints to successfully place and route dense designs The Teramac group successfully implemented a custom computer with fully auto matic synthesis tools The cost for this was the development of a new
91. alternative to the Xilinx tools VCC97 6 4 1 Using Iris and Hades for DSP Algorithms The DSP laboratory of Queen s University Belfast coupled Hades to their Iris synthesis frame work Tra95 TWM95 TW96 Iris works on the building block philosophy where the de signer can define digital signal processing blocks and perform synthesis on these circuits At first glance this methodology may not appear attractive but DSP designers typically like to mix and match circuits using different number representations and clever circuit design techniques to achieve efficient FPGA solutions Iris achieves this by enabling the extraction of parameterized expressions from complex VLSI processing elements and using these ex pressions to achieve functionally correct solutions for circuits built from these processors Designers can quickly create and evaluate architectures that utilize existing hardware blocks Previously Iris generated structural parameterized VHDL code which was then synthe sized using the Synopsys VHDL compiler Syn92 Compilation times were on the order of hours Iris has then been closelv integrated with Hades resulting in a powerful system ca pable of quickly investigating an FPGA implementation This integration was achieved by developing a Lola interface which allows Iris to produce Lola code for the algorithm to be realized The systems are well matched as there is considerable structure within Iris which Hades can preserve and qu
92. am 3 1 Ripple Carry Adder Types in Lola TYPE AddElem full adder IN x y ci BIT inputs data carry OUT s co BIT outputs Sum Carry VAR h BIT half sum BEGIN h x y XOR s h ci XOR co x y h ci two ANDs and one OR END AddElem TYPE Adder N generic N bit ripple carry adder IN x y INI BIT ci BIT inputs 2 data vectors amp carry OUT s INI BIT co BIT outputs sum vector k carrv VAR add N AddElem instantiate N full add elements BEGIN add 0 x O y 0 ci unit assignment add two bits FOR i 1 N 1 DO add i x 1 y i add i 1 co END FOR i 0 N 1 DO s i add i s END co add N 1 co END Adder Thev are used to construct an 8 bit adder circuit shown in Program 3 2 The use of several operators and constant signals is shown as well as the instantiation of a generic type and the use of unit assignments The example is split up into two programs to make comparison with 3 Foundations Lola and Trianus 24 the subsequent VHDL and Verilog descriptions easier We will refer to this example in this and later chapters Program 3 2 Ripple Carry Adder in Lola MODULE Add types as defined in Program 3 1 CONST Bits 8 IN Idx rd BIT INOUT D Bits TS VAR adder Adder Bits instantiate generic X y Bits BIT BEGIN FOR i 0 Bits 1 DO store D bus if not read x i REG rd 1dx D i into x if ldx y i REG rd 1dx D i into y if ldx END adder x y 0 unit assig
93. am of the Oberon System which is used to find occurrences of a single pattern in a set of files To avoid file directory operations we merged the source code of the Trianus and Hades systems which constitute a total of 1 1 MB into one file We measured the time to search the word MODU in this file It occurs 659 times The software solution using the Boyer Moore algorithm achieves a throughput of 3055 KB s It transfers disk data using block reads and comes to within 70 of the disk read speed A naive string search algorithm has a throughput of 1358 KB s It too transfers disk data using block reads This indicates that the Boyer Moore algorithm can skip over a large number of characters eliminating unnecessary tests In fact for the pattern MODU 98 5 of the comparisons fail on the first compared character The software hardware solution shown in Program 6 9 achieves a throughput of 712 KB s Thus it is 4 3 times slower than Boyer Moore To see if this is mainly due to reading data word wise we used block transfers With a performance of 756 KB s the increase was only 6 Application and Evaluation 121 Program 6 7 PatternMatch Software Interface I MODULE PatternMatchint IMPORT HI Hadesinterface VAR in HI LIntDesc result HI CharDesc shiftReg HI LIntDesc pat ARRAY 2 4 OF HI CharDesc data ARRAY 4 OF HI CharDesc written by programmer 1 PROCEDURE PutPat i INTEGER pattern ARRAY OF CHAR VAR j INTEG
94. an interrupt The XC6216 can assert an interrupt line INT7 and a software handler in the user application can react to this signal 4 5 6 Local Memory The 256 KB of local memory are implemented with eight 64K x 4 bit 15 ns SRAM chips Mot95 The chips have separate output and write enable signals where the write enable overrides the output enable The decoder derives separate write enable signals for each byte from the host s byte enable and from the XC6216 s write enable signals The high speed of 15 ns is needed to enable a 40 ns memory cycle 25 MHz clock when accessing the SRAM The maximum access time to local memory can be calculated as shown in Table 4 2 all values being worst case The decoder PAL is included in the calculation as the write and output enable signals for the SRAM are generated by that PAL Delay ns Signal Clock to Function Output Neighbor to Pad 22V10 SRAM Access Pad to Neighbor Register Set Up Clock Skew Total w G Ko Table 4 2 Worst Case Access Time to Local SRAM An access time less than 40 ns is important such that it is possible to switch between reading and writing the SRAM in consecutive clock cycles Higher speeds can be achieved when data is only read or only written during reads the RAMOE control signal is kept enabled and only the addresses are changed during writes the RAMWE control signal is toggled and address and data are changed 4 6 Constructing the Board In th
95. and or xor neg Dz y Dx y Dy y END zeroes DataN 1 AddrN 0 alu DataN 1 D DataN 1 IR 11 zeroes DataN 1 AddrN alu DataN 2 so alu DataN 1 co alu DataN 2 r alu O so alu DataN 2 co alu DataN 2 zo F Wotan Microprocessor 163 ds im pcs shen shd and or xor neg Dz y Dx y Dy y status flags N REG ren alu DataN 1 z ALU z negative Z REG ren alu DataN 1 zo ALU z zero C REG ren alu DataN 1 co carry set PC and address generation cu O IR O alu O z iren asl as0 A 0 AOE cu 0 a FOR i 1 AddrN 1 DO cu i IR i alu i z cu 1 1 co asl as0 Ai AOE cu i a END FOR i 0 DataN 1 DO D i RAMOE alu i Y END control signals ds phl IR 21 data select load instruction im IR 22 immediate operand select pc for register file branch subroutine and link pes IR 23 IR 22 IR 21 IR 20 IR 19 IR 18 shen IR 23 IR 21 shift instruction shd IR 18 and IR 23 IR 19 IR 18 or IR 23 IR 19 IR 18 xor IR 23 IR 19 IR 18 neg IR 20 state machine LS IR 23 IR 22 load or store ph0 REGC iren phl REG phO LS load or store phase ph2 REG phl iren ph0 LS ph2 instruction reg enable ren ph0 IR 23 phl IR 21 reg enable ALU instr or load RAMOE
96. and routing resources The wildcard register is used to configure regular structures which occur for example in every second column of every fourth row of the chip The mask register is used to change only the relevant bits for example of a north multiplexer within a 2 Field Programmable Gate Arrays 15 cell For state access the map register is used to map bits on the data bus to individual cells in a column e g bit 0 of the data bus is mapped to the cell in row 1 bit I to the cell in row 3 etc CKW95 and Xil96 treat this subject in more detail 2 3 5 Summary The XC6200 has a simple regular structure with simple logic cells and a hierarchical uni directional routing network A fast programming and access interface can be used for rapid partial reconfiguration and for accessing reading and writing user registers without using routing resources On the downside inversions on routing and logic multiplexers complicate the implemen tation of various software tools and the magic routing resources break the symmetry of the other routing resources 2 4 Other Architectures In this section we give an overview of three other FPGA architectures namely the CAL by Algotronix the XC4000EX by Xilinx and the AT6000 by Atmel There are many more FPGAs on the market and discussing them all would be beyond the scope of this introduction A more detailed if somewhat older presentation of different architectures can be found in BFR92
97. and set reset latch operators Note that the register is not a special signal type but an operator it can appear anywhere in an expression This has the advantage that enable and special clock signals can be associated with the register in an expression instead of in the declaration part not negation or conjunction and disjunction exclusive or MUXG 5 REGAR er EACHD se Table 3 1 Lola Operators multiplexer s a s b register optional clock and load enable latch with load enable latch with set and reset both active low An expression can be composed of operators variables numeric values and numeric ex pressions If a variable is of type array a selector may be used to specify an individual signal e g a i or ali specify the i signal of an array If a variable is of composite type a selector may be used to specify an output signal of that type e g a co would specify the carry out signal of an adder instance The control statements FOR and IF are typically used to iterate over an array Variable and to treat special cases in generic types respectively Position statements can be used to annotate variable names with positional information They make a Lola program target dependent and their interpretation is left to the synthesis 3 Foundations Lola and Trianus 23 back end They are typically used to give hints to a placement algorithm or to specifv pin locations see Section 5 5
98. and wires of the data structure The broadcast invokes a procedure for each node to generate the configuration bits for a cell and on each wire to generate the configuration bits for the routing multiplexers The configuration bits are stored into an arrav which is a mirror image of the XC6200 s configuration memorv This arrav of bvtes can be stored into a file for later use or downloaded directly onto the Hades hardware Figure 5 22 depicts the function unit of the XC6200 This is the same figure as in Section 2 3 repeated here for convenience Inversion Compensation for Routing Multiplexers The major problem during the development of the bitstream generator is the presence of in versions on routing multiplexers Xil96 We already mentioned this problem in Section 2 3 A process called inversion compensation has to be implemented and executed for each cell input to determine the polarity of the input signal Once this is done the multiplexers of the cell can be programmed such that the cell implements the desired function For example if the cell should implement F a A b we showed in Section 2 3 that the upper part of Figure 5 23 implements the desired function provided that the polarities of the input signals are correct If for instance the d input passes from its source to the destination cell through an odd number of inverting routing multiplexers the AND function must be implemented as shown in the lower part of Figure 5 23 The w
99. ankind has striven to speed up the brain straining task of calculating with numbers Since the introduction of the digital computer in the late 1930s and early 1940s the speed at which computations can be executed has increased by six orders of magnitude 10 additions s in 1946 vs 10 in 1996 Likewise power consumption and cost have decreased dramatically Gordon Moore s law was stated in 1962 and is still valid today It is not a law but a prediction saying that the number of transistors on a chip doubles every 18 months Corollaries to this prediction are that the speed of a circuit doubles every 18 months or that the same performance can be bought for half the price after 18 months As an example the Intel 8080 microprocessor introduced in 1975 consisted of 4 500 transistors The Pentium Pro introduced by the same vendor in 1995 contains 5 5 million transistors As a consequence of this development we are able to buy a computer today early 1997 clocked at 200 MHz with a 32 bit wide data path executing half a billion instructions per second with 32 MB of main memory and 2 GB of disk space for 3 000 and put it on our desktop The same machine would have been termed a Supercomputer only two decades ago This dramatic increase in computational power can be attributed to several factors e First and foremost technological advances in circuit manufacturing pushed clock speed and circuit density to levels not imaginable in the beginning
100. appings manually and used a logic minimization program Hof96 based on the Quine McCluskey method to find a minimal expression tree for each output bit To reduce the result ing circuit subexpressions such as t1x0xxxx were defined manually Figure 6 2 shows the default placement of an instance of the Mapper type as obtained from Hades The expression trees are spread out and the resulting instance is quite big 38 cells in an area of 8x14 with 3496 utilization The layout is routable but can be reduced in size manually The input to output delay of the mapping function is 17 5 ns Next we preplace all output and variable signals with the help of position assignments to obtain the improved placement shown in Figure 6 3 38 cells in an area of 5x15 with 51 utilization The delay is increased to 21 ns A further reduction in size can only be achieved if we break up longer expressions into named subexpressions which we can then preplace with position assignments Ideally to have optimal control over the layout each operator in Lola should have a name which can then be placed manually However this is too cumbersome and not needed in most cases Since only one instance of the Mapper type exists in our application the layout shown in Figure 6 2 is sufficiently dense for our purpose 6 2 6 Comparators Figure 6 4 shows the schema of a 5 bit comparator circuit The data bits are compared to the pattern bits using XNOR gates These are linked togethe
101. ard is equipped with local memory The data to be processed by the FPGA can be transferred from the source main memory or IO board into local memory on the RC in one large chunk using programmed IO or DMA Hence the RC and the CPU can pursue work concurrently after the transfer and the RC is not loading the system bus with read requests This setup is very common for RC boards nowadays It decouples the RC from the rest of the system Thereby the control part of an application on the host side can remain small Ad ditionally since the card has local memory it can act as a smart input preprocessor performing some filtering operations on incoming data before sending the data to the host VBR96 The data might be transferred to the RC via DMA from an IO board or there might be a connector for hardware extensions included on the RC As with the solution in Section 4 2 2 the hardware for this card is relatively easy to build due to the well defined interface to the system bus The DMA control in Figure 4 3 could be left out as the FPGA with its local memory can act as a stand alone computational unit which has to communicate with the host only rarely 4 3 Overview of the Hades Reconfigurable Coprocessor The Hades hardware consists of an extension card for the Ceres 2 workstation cf Sec tion 4 4 It features one XC6216 FPGA and 256 KB of fast SRAM An address decoder is realized with three 22V10 PALs AMD95 Cyp95 The card is accessed like
102. ary to achieve a dense and fast layout Using the automatic placer of Hades without placement hints one ALU slice has a bounding box of 20x12 cells with a utilization of 18 The design does not fit into a XC6216 After manually optimizing the ALU slice 1t has a bounding box of 24x2 with a utilization of 88 The layout is shown in Figure F 7 The remaining logic is preplaced as well to ensure a routable design Floor planning is essential with this design as control signals such as the shift control signal shen need to be in the correct column To get a correct routable layout using the placer and router interactively about 12 hours were needed This includes the time to understand the design f L ae Ta Me Me Ta HT Ta lle Me He gt Bj EH ae 1 y pS pHs mal Ee His E et E an Figure F 7 ALU Slice The quick response from the Hades tools were a prerequisite to try out different placements of the ALU slice to see whether the resulting layout was routable or not Table F 3 compares the performance of the Hades tools with the XACT step Series 6000 software from Xilinx As already shown in Chapter 6 Hades is an order of magnitude faster First we let XACT place the design automatically This took 5 minutes and resulted in a F Wotan Microprocessor 165 layout with components left unplaced Then we placed the design by inserting hints into the Lola code Note that this process still took much longer than when using Hades becau
103. as before END AddElem VAR x y sub BIT add AddElem BEGIN add x y sub sub add or subtract XOR is anonvmous and only accessible through add y Program 5 6 Buried Inputs and Outputs MODULE Buried IMPORT Adders IN x y 2 BIT OUT s 2 BIT VAR a Adder 2 BEGIN a x y 0 FOR 1 0 1 DO s i a s 1 END END Buried buried inputs buried outputs ripple carry adder 5 Hades Software 70 xl Lad qp E J 0 s so pes 04 i HE Figure 5 9 Buried Inputs and Outputs Various Mappings In addition to the tasks described above the mapper translates coordinates of Lola position statements given in user coordinates cell based into model coordinates used by the Trianus framework The mapper also inserts a buffer between an output variable which reads directly an input variable hence an assignment of the form out in is translated into out BUF in Additionally it inserts a buffer between a global output or tri state signal and its corresponding gate This is needed for a process called inversion compensation which will be described in Section 5 7 So out a b is translated into out BUF a b 5 4 2 Discussion Although not a difficult problem
104. ation in the FPGA is driving them Enable Output Enable Reset the FPGA via the Reset signal Clear the register in the FPGA via the GCIr signal Issue a single clock pulse CoN QN a A Asynchronous communication A write to this region sets or clears the Go flag in the decoder depending on the value of data bus bit O This flag can be read by the FPGA A read from this region returns the value of the Busy flag on data bus bit 0 The Busy signal is driven by the FPGA 9 Reserved 10 General purpose read and write port GPP Individual read and write signals are gener ated from CPURW and the values of A 3 2 This region is for port 0 11 GPP 1 12 GPP 2 13 GPP3 14 Reserved 4 5 2 XC6216 Interface As shown in Figure 4 4 the west side left of the XC6216 FPGA is connected to the data bus and the east side right to the address bus This is demanded by the pinout of the chip The south side bottom connects to the control signals and the north side top can be used freely via the expansion connectors The XCCS chip select signal of the FPGA is asserted by the decoder whenever an access to the FPGA is made address space 1 in Table 4 1 The XC6216 s read write signal is driven by the R W signal of the Ceres bus The XCOE global output enable of the XC6216 is driven by a register in the decoder oe which can be set or reset by writing to address regions 3 or 4 This is a last measure against the case
105. attached to the CPU via a coprocessor interface then a speedup gained by using the FPGA could be offset by the lack of parallelism between the CPU and the FPGA as the former must be used to move data in and out of the FPGA Alt 1 The design of the RC is not portable to different CPUs and not even to different memory systems using the same CPU Alt 1 Such a setup would require the construction of a new processor board a task which was beyond the scope of this work Alt 2 If the FPGA is attached to the processor bus like normal memory then arbitration logic must guarantee that only one master is active and memory would not be available at all times due to refreshes Also modern memory systems tend to be very complicated for achieving the high speeds needed to fill processor cache lines quickly Hence they do not perform equally well when smaller amounts of data are transferred or when the memory access pattern is random Alt 2 Like any multiprocessor system concurrency between the RC part and the CPU part of an application is hindered by the additional competition on the memory bus Depending on the access pattern and the presence of a second level cache this problem can be very severe i e the memory bus can be saturated quickly The speed of applications executed on fast CPUs are often dominated by the memory access time not by the time of computation That is the CPU spends most of its time waiting for the memory system instead
106. auf heutigen Rechnern sehr schnelle bersetzungszeiten im Bereich von Sekunden Die Hades Software ist mindestens eine Gr ssenordnung schneller als die vom Hersteller erh ltlichen Werkzeuge f r dieselbe FPGA Architektur Die schnellen bersetzungszeiten er ffnen eine neue Art der interaktiven Entwicklung von Hardware und erlauben es auf wirksame Weise das Wissen des Entwicklers in den Entwicklungszyklus einzubringen xiii Abstract The advent of Field Programmable Gate Arrays has spurred an interest in building recon figurable coprocessors which are used to accelerate the time intensive parts of software by casting them into programmable hardware Applications running on reconfigurable coproces sors are developed using hardware description languages or schematic capture systems These descriptions are translated into logic gates for which a layout has to be found Logic and layout synthesis is a time consuming process and turnaround times of traditional hardware synthesis tools are up to four orders of magnitude longer than those of software compilers Current synthesis tools rely on stochastic algorithms to achieve their results and the user s knowledge about a design can enter the design cycle only with difficulty In the course of this thesis a complete hardware description system has been developed It consists of a reconfigurable coprocessor based on the Xilinx XC6200 FPGA architecture and corresponding layout synthesis tools The
107. available cells of the FPGA Consider the problem of mapping the full adder of Sec tion 3 2 5 to the cells of the coarse grained XC4000EX and the fine grained AT6000 and XC6200 respectively Ideally the three input two output full adder fits into one cell CLB of the XC4000EX and makes use of the fast carry logic On the AT6000 the carry logic is 5 Hades Software 57 implemented using NAND gates a full adder taking up three logic and one routing cell On the XC6200 an ideal implementation of a full adder uses two XOR gates and a multiplexer hence three cells Most current technology mappers use a graph based approach pioneered by Keu87 and further developed by DGR87 FRV91 BFR92 The netlist is represented in a canonical technology independent graph structure The various possible cell configurations are also rep resented in a similar structure By using graph matching a minimal cost cover of the netlist graph using cell configuration graphs is calculated Obviously as there are many ways how to cover the netlist using subgraphs this process can take a long time Many different mappings have to be evaluated to find the best one which is the result of a dynamic programming algo rithm or some heuristic The advantage of this approach is that it is technology independent The same mapping algorithm can be used for various FPGA devices and also for other target technologies such as standard cells To support a new architecture onl
108. box spanned by the source and destination nodes cf Figure 5 19 If routing fails within this rectangle it is enlarged to the size of the chip and a new attempt is made This two phase approach speeds up the routing of the top level nets by a factor of two on the average A priority queue of points from which the wave will spread further is maintained Each queue element is weighed according to the current direction of wave spreading and the cost of the routing resource as was explained earlier This priority queue ensures a breadth first spreading strategy of the wave and also helps to limit the use of costly routing resources and bends turns of the routing direction For example if the current routing resource is a neighbor north multiplexer then further possible points of spreading are the next north multiplexer further south the east multiplexer west of it and the west multiplexer east of it In this case the spreading will continue with the south multiplexer as the east and west multiplexers represent a change of direction which are penalized with a higher cost The spreading of the wave proceeds until an entry Dest is found Now backtracing can start By examining the entries in the Lee map the path back to the destination node can be constructed out of wire segments The segments themselves are not yet inserted into the Trianus data structure in the form of Wires but the necessary information is attached to the router data structure
109. buses arbitration With the Hades RC board we wanted to overcome this problem by providing a coproces sor board with local memory that looks like and behaves like conventional memory to the host computer especially in terms of latency The Hades board makes use of the memory interface of the XC6200 FPGA At the time of writing the Hades board for the Ceres 2 workstation together with the driver software is the only system presenting a memory card interface to the application which allows fast access to the FPGA and the local memory An architecturally similar board for the PCI bus LSC96 VCC97 does not yet support access through memory operations because the driver software makes it necessary to move data using slow I O commands This slows down communication with the RC considerably cf Chapter 6 for a quantitative analysis 4 2 Design Alternatives As described above our reconfigurable coprocessor should consist of one or more FPGAs attached to local memory We wanted to build a simple system for evaluating ideas and as we only had access to two FPGA engineering samples we opted for an RC with a single FPGA There are several alternatives to implement such an RC e an FPGA attached directly to the CPU with the possibility to directly access main mem ory e an extension card connected to the system bus containing an FPGA with DMA capa bility to access main memory e an extension card connected to the system bus containing an
110. can only be developed successfully if the programmer can allocate memory at will and release it again simply by removing references to it 5 11 3 Oberon 2 Language Hades is written in Oberon 2 MW91 Type bound procedures are only used in the interface generator to allow the programmer to extend the read and write methods with additional code Dynamic arrays and the read only export are used quite extensively especially in the support modules and the router Oberon 2 is an elegant language whose only downside is the lack of genericity for defining container data structures 5 11 4 Oberon System The Oberon System Version 4 is our host platform Its availability on most computers and operating systems is beneficial to the spread and acceptance of the tools by other researchers and developers The Oberon System makes for a very productive programming environment due to its fast compilation times and integrated environment The system can be extended very easily and new toolets small tools can be developed in very short time An example of such a toolet is the floor planner which allows the manual placement of instances It was developed in an afternoon Another example is a viewer associated with an Oberon background task The viewer displays information about the location of the mouse in a design such as the name of the label the function of the cell or the cell coordinates It was developed in half an hour A drawback of the Oberon System Ve
111. cation domains Or it could indicate that customers play it safe and go with the market leader Table 2 2 and the following sections summarize the differences and similarities of the 2 Field Programmable Gate Arrays 18 C1 C4 H Din H SR H Function SR G1 G4 Generator FI YQ D Q Function Generator e bia Y IN i L p SR C xQ FA FA Function D Q Generator amp e Pa F ES x Figure 2 12 XC4000EX Function Unit Simplified CLB CLB CLB gt Direct Connects gt Doubles Progr Progr Switch Switch gt Singles Matrix Matrix CLB CLB CLB Progr Progr Switch Switch Matrix Matrix CLB CLB CLB Figure 2 13 XC4000EX Routing
112. ce 1997 M Shand PCI Pamette VI http www research digital com SRC pamette 1996 G Snider P Kuekes W B Culbertson R J Carter A S Berger R Amerson The Teramac Configurable Compute Engine Proc 5th Intl Workshop on Field Programmable Logic and Applications LNCS 975 Springer 1995 Svnopsvs Inc VHDL Compiler Reference Manual 1992 Texas Instruments The TTL Data Book 1987 A Thompson An Evolved Circuit Intrinsic in Silicon Entwined with Phvsics Proc First Intl Conference on Evolvable Svstems from Biologv to Hardware LNCS Springer 1996 D W Trainor An Architectural Synthesis Tool for VLSI Signal Processing Chips Dissertation Queen s University of Belfast 1995 D W Trainor R F Woods J V McCanny Architectural Synthesis of an Image Processing Algorithm Using Iris Proc IEEE Workshop on VLSI Signal Process ing 1995 Bibliography 175 TW96 TM91 VSC96 VCC97 VBR96 WAL93 WG92 Wir95 Wir96a Wir96b WH95 Woo96a Woo96b WCG96 WLH97 Xil96 ZK97 ZR95 D W Trainor R F Woods Architectural Synthesis and Efficient Circuit Im plementation for Field Programmable Gate Arrays Proc 6th Intl Workshop on Field Programmable Logic and Applications LNCS 1142 Springer 1996 D E Thomas and P Moorby The Verilog Hardware Description Language Kluwer Academic Publishers 1991 J Villasenor B Schoner K N Chia C Zapa
113. ce is not seen by the user One reason is the inefficiency of the automatically generated interface In the future we intend to support the generation of inlinable interface procedures In Oberon this is possible through the use of code procedures One drawback is that the interface generator becomes target machine specific The gain in performance justifies this however Another reason for the inefficiency is the lack of a memory type interface on the PCl card The availability of the necessary driver software however is only a matter of time 6 3 Comparison to XACT step Series 6000 In this section we compare the semi automatic generation of the pattern matchers using the Hades software from Chapter 5 against the XACT step 6000 software Version 0 3 5 from Xilinx Development Corporation Scotland simply called XACT in the following We grate fully acknowledge the permission to perform this evaluation We do not compare Trianus to commercial front end tools i e HDL compilers or schematic entry systems as none were available to us We want to point out however that VHDL compilers have reported runtimes in the order of minutes to hours to compile small modules of the size of the PatternMatch application Woo96a WLH97 These tools are three to four orders of magnitude slower than the Lola compiler with the Trianus back end The files used by XACT were produced by Hades using the CFG file converter tool It generates design files in the in
114. cells Therefore these devices can be pro grammed easily rapidly in milliseconds and an arbitrary number of times No special pro gramming machinery is required By downloading a configuration bitstream to such a device 1t is possible to adapt the device to a specific task in a couple of milliseconds Typically the contents of the SRAM remain unchanged during an application session Chapter 2 gives a more detailed description of this technology 1 3 Reconfigurable Coprocessors The main drawback of special purpose hardware as defined in Section 1 1 is that it is spe cial purpose Once built it is not possible to change the hardware to accommodate slightly different needs A CPU can be programmed to implement any algorithm but an ASIC imple menting an MPEG decoder will do only that Another drawback is that for economic reasons it is not sensible to build special purpose hardware for an algorithm that is executed only by one user For example if a time consuming task is not popular enough to warrant the high design and manufacturing costs of an ASIC the user who has to solve that task has no other options than optimizing the software code or buying a faster machine both of which might not suffice to achieve the needed level of performance With the advent of FPGA technology however it is possible to construct a general purpose coprocessor which can be programmed for a specific task and then reprogrammed for an other task within m
115. ces may only be declared in VAR sections type points to the type structure of which this instance is an instantiation E g add N AddElem in type Adder of the Add example 3 Foundations Lola and Trianus 32 Program 3 7 Definition of Instance Instance POINTER TO InstanceDesc InstanceDesc RECORD ObjectDesc fct Inst x list of unplaced nodes y list of placed nodes mode VAR dsc Object list of interface signals and local variables instances open BOOLEAN are contained objects visible END Type Composite and generic types are represented by the Type type For each instance of a generic type a concrete composite type exists which contains the actual parameters Type is derived from Instance Its definition is shown in Program 3 8 Program 3 8 Definition of Type Type POINTER TO TypeDesc TypeDesc RECORD InstanceDesc fct Typ Module mode Expanded Parameterized code Object syntax tree for interpretation marked BOOLEAN for export color SHORTINT END The name of a type is the same as the one in the Lola declaration mode indicates whether it is a generic or a normal composite type code points to the syntax tree which is used for interpretation and for generation of expanded types out of generic ones The marked field is TRUE when the type is marked for export color is used by visualization tools Since a Lola module is both a description and the only instantiation of a circuit a type i
116. conventional memory using address and data buses and read write control signals schema of the copro cessor board is shown in Figure 4 4 The complete board layout and a picture of the board is shown in Appendix C 4 4 Choice of Host Workstation When we had to decide on a host computer for the Hades RC board the choice was between a commercial architecture like a PC Sun or Macintosh and the Ceres workstation which was developed at the Institute for Computer Systems Our Institute has a long tradition in building its own workstation hardware Ebe87 Hee88 HN91 Ohr84 The Ceres 1 Ceres 2 and Ceres 3 computers have been used since 1987 in research and education An experimental workstation named Chameleon used an array of six CAL 1 Alg90 Kea89 chips for a recon figurable coprocessor and a single CAL 1 for all control logic Hee93 HP92 For a digital design course a small extension board for the Ceres 3 the machine used in education based on the Concurrent Logic 6000 FPGA was developed GLW94 4 Hades Hardware Processor Bus CPU Memory Host Bridge Input Output System Bus DMA Control Local Memory Figure 4 3 FPGA with Local Memory on Extension Card 44 4 Hades Hardware 45 Expansion Connectors 645 Transceiver Data 31 0 541 Buffer 679 D
117. d a pipelined ripple carrv adder with a delav of 10 ns were built More details can be found in Mul97 6 5 Possible Future Applications In this section we present some possible future applications of the Hades RC board or variants of it Many applications found in the literature could and should be implemented on the Hades RC board to evaluate its architecture and also the architecture of the XC6200 FPGA They are not discussed in this section 6 5 1 Switcherland Reconfigurable Coprocessor Node One interesting application of an XC6200 based coprocessor board would be its incorpora tion into a network such as Switcherland OE95 The throughput of Switcherland is about 20 MB s which is well matched to the processing speed of an RC A pipelined version of the pattern match application from Section 6 2 could be used to filter a data stream and detect certain patterns in packets passing by for instance to gather statistical data A packet can be processed by simply routing it via the coprocessor board instead of directly to its destination Therefore the presence of an RC in the network is completely transparent to an application 6 5 2 Guard Evaluator for Active Oberon In Active Oberon tasks are synchronized using guards which are evaluated by the scheduler DR97 Gut97 If a guard is asserted a task can resume execution This guard evaluation step can be quite costly when the guards depend on global variables of the system One way to
118. d by the tools that generate the Jedec files PAL fuse maps Logic partitioning onto the three PALs and pin assignment was done manually The Lola code for the complete decoder is listed in Program 4 1 and in Appendix E A PAL burner was used to program the PALs based on the Jedec files 4 6 3 Assembling the Board After receiving the boards we tested them for short circuits The connections to the various chips especially power and ground were checked manually using a multimeter During this testing we found a net which was not correctly connected cf Section 4 6 4 After soldering the sockets for all chips and the connectors the board was inserted into a Ceres 2 which booted without problems One after the other the line drivers the decoder PALs which functioned right away and the SRAMs were added to the board We tested the relevant signals of each newly added component Finally the FPGA was inserted as well and the first design was downloaded 4 6 4 CAD Software Pitfalls To reduce the wire length of nets board routers perform a process called pin and gate swap ping Pins of a chip can be exchanged with each other if they implement the same function For instance the outputs of a transceiver chip 645 can be swapped as long as the inputs are swapped accordingly Whether certain pins can be swapped or not is described in a device description file in the case of CadStar Due to an error in our description of the 679 addre
119. d efficiently 5 3 Overview The Hades tools consist of a technology mapper using a direct approach Section 5 4 a con structive placer and floorplanner Section 5 5 a maze running router Section 5 6 a bit stream generator and loader Section 5 7 a hardware monitor and an interface generator Section 5 8 for providing a software interface to a coprocessor application running on the FPGA The tools use the Trianus framework Geh97 At all times they use and preserve the hierarchical design information given by the Lola HDL description The tools support an incremental design style by way of fast turnaround times and user control during all phases The following sections discuss the various tools and algorithms used in their natural se quence of the design flow as shown in Figure 5 1 Their complexity in source and object code is analvzed and put into perspective The chapter concludes with a few observations on the programming methodology used and discusses possible improvements The tools are evaluated in terms of speed in Chapter 6 When discussing an operation on a Trianus data structure in the following sections we often list Oberon code to better illustrate an algorithm This fairly low level description is meant to serve as a tutorial on using the Trianus data structure and an aid to the reader of our source code Using and manipulating a Trianus data structure correctiv is not trivial and the more examples are given the better this
120. d operations on it supporting both the placer and the router It is a simple linear list of node and coordinate pairs which is used to store temporary information about wires and nodes that are to be inserted into the TriBase data structure during a type broadcast cf Section 5 5 and Section 5 6 For efficiency reasons a hash table accessed by the coordinate pairs is superimposed on the list to speed up the search for already inserted wires during routing This hash table speeds up the routing of nets with high fanout by a factor of 3 Such a simple hash table could also be used to reduce the time needed to locate nodes and wires in the TriBase data structure a problem already mentioned in Section 3 4 5 9 2 Table Modules A table module was implemented to store coordinates of already placed nodes The table supports dynamic growth i e when new data is inserted the table grows in size when needed For the implementation of the net sorting algorithm in the router a similar container data structure with different contents was needed cf Section 5 6 3 Here the lack of support for generic types in the Oberon language became apparent as the two modules were identical except for the type they stored Developments such as RS97 will make the Oberon language more suitable for the development of general data structure container modules Currently the language and the Oberon System lack support for this 5 9 3 Xilinx Software Interface Together
121. d route times In each column the best route time is taken indicated in boldface Best means the lowest time with the lowest number of unrouted nets 6 This row lists the speedup obtained when Hades is used instead of XACT 7 The Hades column lists the time for Lola compilation and the generation of the Trianus data structure from it Only one route time is listed as Hades always uses type based routing and does not support the Magic routing resource 8 XACT has an effort option used during placement Low stands for little effort and fast run time high stands for high effort and long run time As can be seen from the size of the bounding box the placer packs the cells as closely as possible It makes no attempt at placing the registers next to the point of usage Consequently the number of unrouted nets is very high for all routing options we tried The routing rectangle used is that of the bounding box in this case If a larger rectangle is given the routing time increases dramatically 9 High uses the highest placement effort and trans indicates that instances of the same type comparators in our case may be transformed i e they do not all have to have the same placement as their type Evaluation Hades stands out with exceptionally fast compile map and place times less than a second The router is verv quick as well The resulting placement was alreadv shown in Figure 6 8 Not surprisingly it is no
122. d two procedure calls are invoked Additionally one of the procedure calls checks its arguments for validity precondition With the Hades coprocessor board for the Ceres 2 the final procedure call accesses the board at the same cost as normal memory cf Section 4 5 1 With a prototype PCI board for the PC LSC96 the final procedure call first stores the destination address into a latch on the PCI board and then accesses data on the board Hence accessing the PCI board is at least twice as expensive as accessing normal memory Additionally since IN and OUT commands of the Intel i386 CPU are used which have latencies of 20 cycles accessing the board costs at least 40 cycles Compared to the 6 cycles on the Ceres 2 this is a relative difference of more than a factor of 6 If we inline the communication code in the main search loop we can achieve the same speed on the Ceres 2 as with Boyer Moore Hence on the Ceres 2 the method call overhead is the reason for the slow performance of the initial solution On the PC although inlining increases throughput by 33 the hardware solution is still very slow compared to the Boyer Moore algorithm We anxiously await the new PCI board VCC97 together with appropriate driver software which lets us access the board using direct memory accesses 6 Application and Evaluation 125 6 2 12 Improvements Up to now we have only made use of the processor interface to store the pattern registers into the
123. dated for by the XC layout editor The same has to be done if both inputs of a gate before the register read the register s value This can occur with enabled registers reading their negated value Other cases should be considered as bad designs but the software must handle these cases nonetheless Figure 5 8 summarizes the mapping of registers Duplication of Input Variables Input variables declared in type declarations can occur more than once in expression trees of signal assignments Since the XC editor displays input variables at the point of use they must be duplicated to allow for multiple displays of the same input variable name in a design For instance considering the Adder example shown in Program 5 4 the carry input variable ci occurs more than once in the expressions of type AddElem and must therefore be duplicated For duplication the aforementioned id field used for marking mapped nodes comes in handy We can use it to determine whether an input variable is referenced more than once For every further reference a copy of the variable is created A tricky point in the development of this duplication code is the presence of scopes Since instances of other types can occur within a type and input variables can be passed as parameters to other input variables in inner instances the duplication of inner input variables has to be 5 Hades Software 68 a i REG d x d a Il REG en c G i REG en x a REG x
124. des and wires The next paragraph explains this process in more detail The broadcast mechanism starts at an instance and iterates over all nodes and wires in that instance proceeding recursively into sub instances contained in that instance It invokes if present the procedures stored in the procedure variables doNode for each node and extensions thereof and doWire for each wire encountered It is possible to send a message based on Message to e all nodes and wires e all instances of a certain type e all visible nodes and wires within a given rectangle and e all placed nodes and wires 3 3 4 Lola Compiler Back End A Lola program is translated by the Lola compiler into an abstract syntax tree Wir96b which can then be interpreted by a Lola interpreter to generate an expanded data structure representing the circuit Compiling type checking and generating the syntax tree is very fast as only a single pass over the source code is needed No logic minimization is performed since this would possibly break the correspondence between an instance and its type violating an invariant in Trianus The Trianus Lola back end translates the produced syntax tree of the Lola compiler into a syntax tree based on the Trianus data structure which is the same as for representing a circuit namely Node objects with special values in the function fct field The data structure representing the syntax tree is interpreted and the circuit is expanded or
125. direction if they are square or wider than high they are placed adjacent to each other in the vertical direction Instances with lower indices are placed first therefore an array is placed from left to right or from bottom to top The latter is especially important as the processor interface of the XC6200 allows to access columns of cells efficiently For multi dimensional arrays the algorithm alternates between vertical and horizontal placement when going from one dimension to the next Programs 5 8 and 5 9 show this algorithm in detail If the array elements are not instances but signals representing expression trees the dis tance between the elements is determined by the width or height of the trees Arrays of signals 5 Hades Software 73 Program 5 8 Placement of Arrays I PROCEDURE PlaceArrav p obj u v dim lastDim VAR w h IF dim lt lastDim THEN not last dimension REPEAT thisW 0 thisH 0 place array of next dimension recursive call PlaceArray p obj u v dim 1 lastDim thisW thisH p dir other direction p dir adjust u v w h according to thisW thisH p dir direction obj next in same dimension UNTIL obj NIL ELSE last dimension index 0 get pitch from first item IF obj IS TriBase Instance THEN if obj is higher than wide suggest horizontal placement IF obj h gt obj w THEN p dir Horizontal pitch obj w 1 u p minU start with lowest u coord ELSE p dir Vertical pitch obj
126. duced color display controller Ether net interface audio video interface all maintenance can be accomplished at the Institute e the Oberon operating system was written by members of the Institute and e it has a simple bus protocol At the time we started our project no RC board existed that used the XC6200 To gain ex perience with the Trianus Hades tool set as soon as possible availability of such a board was more important than performance or compatibility with commercial systems Therefore we chose the Ceres 2 as our target platform rather than a commercial architecture Furthermore we knew of other people working on a PCI board LSC96 Clearlv to gain wide spread acceptance a second generation Hades RC should be implemented as a PCI card with corre sponding driver software or the Hades software should be ported to a commercial board See Chapter 6 for a comparison of the PCI card LSC96 with our Hades RC 4 5 Architecture of the Hades Board The Hades RC board shown in Figure 4 4 consists of a single XC6216 FPGA in a 299 pin grid array package Xil96 and 256 KB of local memory implemented with fast SRAM 8 Motorola 64Kx4 bit SRAMs Mot95 The memorv card interface is realized with three 22V 10 PALs AMDOS Cyp95 which generate the necessary control signals for the XC6216 and the SRAMs See Appendix D for a complete list of hardware components used on the board The programming and access interface to the XC6216 is that of
127. dware as input instead of a graphical one Last but not least the demands posed on a computer svstem are much smaller when text is processed instead of graphics 3 2 The Hardware Description Language Lola Lola logic language was designed bv N Wirth in 1992 as a simple easilv learned hardware description language for describing svnchronous digital circuits In addition to its use in a digital design course for second year computer science students at ETH Ziirich GLW94 the Institute for Computer Systems uses it as an HDL for describing hardware designs in general and coprocessor applications in particular The complete syntax is listed in Appendix A 32 1 Overview The purpose of Lola is to statically describe the structure and functionality of hardware com ponents and of the connections between them A Lola text or program is composed of declarations and statements Statements consist of control statements and assignments program describes the hardware on the gate level in the form of signal assignments Signals are combined using operators thus forming expressions These expressions can be assigned 21 3 Foundations Lola and Trianus 22 to other signals Signals and the respective assignments can be grouped together into types Types can be composed of instances of other types therebv supporting a hierarchical design style An instance of a type is a hardware component such as an adder Types can be generic e g parameter
128. e Fr 15 AT6000 Function Unit lee 16 AT6000 Routing Network 2 2 2 vr vr e 17 XC4000EX Function Unit Simplified none 18 XC4000EX Routing Simplified L 18 Different Views in Tranys e 28 Lola and Trianus Part in Design Flow from Fig 1 3 29 Trianus Types and Lola Constructs llle 33 Data Structure for AddElem Type oo o 34 XC6200 Layout of AddElem o e 37 OBDD for Carrv Out of AddElem llle 38 Schema Showing an AddElem llle 39 Hades Hardware Part within the Design Flow of Fig 1 3 40 FPGA Attached to the CPU Two Alternatives 42 FPGA with Local Memory on Extension Card 2 222er 44 Hades Reconfigurable Coprocessor L 45 Interface Timing sa sk avd gk Mask jib Bl nei 50 Hades Software Part within the Design Flow of Fig 1 3 55 Routing Channel 58 Wave Expansion 1 2 3 4 with Resulting Route 59 XC6200 Cell Configurations without Registers 63 Mapping of NOT Gate e 64 Mapping of AND Gate e 66 Mapping Of Latch s 2 42 Sr AD vy Sek te R 66 Mapping of Register 00000 68 Buried Inputs and Outputs kr renn 70 Tree Example es ar ca ed Bl mr ette qus Gers SG ge dt m 77 Array of Trees Example 78 Selector Example 2 22 22 oa 22008 p rS 79 ix
129. e design cycle No translation steps are necessary when switching between different tools and especially no files need be written and read again Also from a software engineering point of view using only a single data structure instead of a multitude reduces system complexity and learning time The data structure is based on the constructs offered by Lola cf Figure 3 3 Hierarchi cal information is available and maintained across all tools Operators signal names types instances and modules exist in a Trianus data structure as well as wires which are used to connect operators Other than basic geometric information u v w h to in Program 3 5 no device specific information is stored Specificallv additional temporarv data needed bv a back end tool e g a placement algorithm has to be managed by that tool itself Type Based One fundamental principle in Trianus is that all instances of a type have exactly the same relative placement information and the same wiring That is the placement and wiring of a type determines the placement and wiring of all its instances This concept is ensured by the system and has to be ensured by all tools as well Trianus provides algorithms cf Section 3 3 3 for distributing information from a type to all its instances We call tools or algorithms tvpe based if they have this property of propagating information from a type to all its instances It implies that an algorithm makes only direct changes
130. e eliminated the overhead caused by calling the methods of the interface objects by inlining the coprocessor communication code into the search loop The resulting throughput was still only 950 KB s Although this value is 33 better than the solution using interface objects it is still 3 2 times slower than Boyer Moore If we search for a pattern occurring frequently 4 white space characters which does not allow long skips the throughput of Boyer Moore drops to 1828 KB s while the hardware solution still has the same throughput When we search for multiple patterns in the text using the hardware solution we achieve the same throughput as we can make use of the parallelism available in hardware Table 6 2 summarizes the throughput values for pattern MODU and also lists the values obtained on the Ceres 2 There disk read speed is very slow and both Boyer Moore and the hardware solution with inlined communication are bound by the disk read speed Disk Block Read Disk Word Read Table 6 2 Searching MODU Throughput in KB s Communication Bottleneck Why is the hardware solution so slow The main problem is the relatively high cost of com munication With the current software interface an indirect procedure call is used method to transfer a value to and from the coprocessor This is the cost that has to be paid to support ex tensibility and versatility of the interface during development During each access a method call an
131. e following we give a description of the construction process of a printed circuit board Experienced hardware designers may skip this section but readers new to the field may find some interesting information herein For a more detailed experience report see also Lud96 4 6 1 Describing the Printed Circuit Board Once the design of the RC was fixed on paper CadStar ZR95 was used to describe the board in a schema see Figure B 1 in Appendix B Based on this a printed circuit board PCB with 4 Hades Hardware 53 two signal and separate power and ground layers was defined A Ceres 2 expansion board is an extended double Eurocard 233 x 220 mm so there was ample space to place compo nents This was done manually and care was taken that all dual inline packages were facing the same direction such that power and ground pins were all at the same position which eases debugging We decided to use sockets for all chips and no surface mount technology was used After successful placement the board was routed automatically pin swap opti mization step reduced wire length by more than 10 but also caused a problem on the final board as described in Section 4 6 4 A PCB manufacturer produced four prototype boards 4 6 2 Describing and Implementing the Decoder The decoder s logic equations were defined in the Lola HDL cf Section 3 2 We simulated the decoder thoroughiv before translating the Lola code into the CUPL language Log91 use
132. e for the decoder chip controlling the XC6216 FPGA on the Hades RC board Subsequent sections refer to the variable names defined in this program Appendix E additionally lists the Lola code for the RAM control PAL and the code for the PAL implementing communication ports 4 Hades Hardware Program 4 1 Lola Code for FPGA Control PAL TYPE DecoderXCCtrl IN Clk BIT BoardAdr BIT A19 A18 A4 A3 A2 BIT CPURW CPUDS BIT RESET BIT OUT XCCS XCOE BIT XCAOE XCDOE BIT XCReset XCGCIr BIT CPUAEN CPUDEN BIT XCStep BIT VAR 47 PAL22V10 board is selected address lines needed for decoding CPU read write data strobe master reset XC is selected may drive pins XC may drive A D buses XC reset global clear CPU drives the A D bus single stepping select write XCSel RAMSel PortSel BIT oe BIT BEGIN select CPUDS BoardAdr write select CPURW XCSel select A19 A18 RAMSel select A19 A18 PortSel select A19 A18 A4 XCCS XCSel 107000 disable 107001 enable OE register 00 xxx Ol xxx 11 0xx oe REG write A19 A18 A4 A3 A2 XCOE oe 10 010 XCReset write A19 A18 A4 A3 A2 RESET 10 011 XCGCIr write A19 A18 AA A3 A2 10 100 generate one clock pulse high gt low high XCStep write A19 A18 A4 A3 AD CPU us
133. e latter style then describes the hardware on a higher level as only the functional blocks and their interaction is described rather than the whole circuit 5 2 Programming Methodology The Hades software was developed using the Oberon 2 programming language MW91 and the Oberon System WG92 We extensively used preconditions assertions and less fre quently postconditions to enhance the quality of our code and to improve the localization of 5 Hades Software 61 errors in the software This proved verv useful during the development of the Trianus frame work and the Hades back end When a precondition fails it is the user of the service who failed to meet the contract When a postcondition of the called service or an assertion check based on the assumption that the service performed its dutv fails then it is the implementer of the service who is at fault Additional convenient features of the Oberon 2 language and the Oberon Svstem are checked arrav indices NIL checks on pointers and automatic garbage collection mandatorv for extensible systems These features are suddenly of interest to the rest of the software world one might even dare to sav humanitv due to the advent and popularitv of the Java language All proved extremely useful and might be evidenced by the lack of a runtime debugger A few output statements and a comfortable post mortem trap viewer are all that was necessary to develop the Hades software successfully an
134. e placement of arrays instances and expression trees after giving a few preliminary explanations on free space management 5 5 2 Free Space Management Free space on a chip is managed using a bitmap When a single gate from the netlist is placed in a specific cell position a free bit in the bitmap is sought at that position or in its neighborhood 4 cells in each direction If no free bit cell is found placement fails and the user must provide a placement hint When an instance with a certain width and height is placed a rectangular area of that size is sought in the bitmap using a simple search strategv in the horizontal and vertical directions Again if no space is found placement fails The positions occupied by an instance or a gate are marked in the bitmap Overlaps of instances and other instances or instances and single gates are allowed and can be very useful to achieve a dense layout cf Section 6 A more clever way of free space management would involve the use of quad trees FB 72 However since the algorithm described here gives reasonable results we did not pursue this further 5 5 3 Placement of Arrays The algorithm used for placing arrays is as follows a sequence of instances is placed accord ing to the width and height of the corresponding type Remember that all instances of the same type share the same placement information If instances are higher than wide they are placed adjacent to each other in the horizontal
135. e routing algorithm Its main problem is the long runtime as the time required to spread a wave in all directions grows with the square of the distance Despite this the maze running router is still the most popular routing algorithm due to its flexibility It can be accelerated by several techniques so that the quadratic behavior affects the routing of only a few nets Di087 DM95 The router described in Section 5 6 is a maze running router 5 Hades Software 59 Figure 5 3 Wave Expansion 1 2 3 4 with Resulting Route 5 1 4 Problems with Today s CAD Tools Speed Many algorithms used in today s FPGA layout synthesis tools were adapted from packages used for ASIC design There utmost performance in both time and space of the resulting circuit is mandatory Long runtimes minutes to hours of the tools are widely accepted and also most often the only way to achieve a result in the target technology This can be tolerated when the target technology s turnaround time is in the range of days to weeks However with the availability of a silicon foundry on the desktop in the form of an FPGA system this approach is no longer viable The engineer can now work with and develop hard ware in the same way as is common in software development namely using iteration and ex ploration If the FPGA device can be programmed in a couple of microseconds then it should be possible to synthesize the necessary configuration bits within seco
136. easier to use as the pointer field being changed could simply be passed as a reference parameter In a language without this feature such as Java an additional parameter would have to be passed indicating which field needs to be updated or a function procedure would have to be used Constants One and Zero To conserve memory space the constants One 1 and Zero 0 occur only once as global variables in a compiled Trianus data structure But since they must be represented as actual gates at several locations of the FPGA a new node representing the constant must be generated for every occurrence in the expression tree The pointer field pointing to the unique constant node is updated to point to a new copy of such a constant node Not NOT operators associated with a signal name label or which are feeding a register occur as proper cells in the layout All others can be merged into the operators with more than one input see the NOT in expression y below Two NOTs in sequence are replaced by a buffer See Figure 5 5 for some code examples and their associated mappings And Or and Exclusive Or s can be seen in Figure 5 4 the XC layout editor does not allow to arbitrarily invert the inputs and outputs of AND and OR gates For instance an AND with two inverted inputs must be represented as a NOR gate Table 5 1 lists the AND OR and XOR gates with all possible 5 Hades Software 64 Z wn y ax b X a S
137. ecode this address range A problem resulting from this requirement is described in Section 4 6 4 The comparator s enable signal is the AV address valid signal of the Ceres bus see Figure 4 5 The comparator will assert signal BoardAdr whenever a correct address is on the address bus and signal AV is asserted The host is given priority over the XC6216 when accessing the address and data buses of the RC board For this purpose the decoder drives two output enable signals read by the XC6216 XCAOE and XCDOE for the address and data buses respectivelv A user configuration in the XC6216 may not drive the address and data buses when the respective enable signals are not active Memory Map The different memory regions selected by the decoder are shown in Table 4 1 The function Address ADAD 00000 3FFFF FPGA configuration register access 6 80000 8000F FPGA slobaldear XCGCK T 8 8001480017 Go Busy fass 9 80018 3001F reserved Table 4 1 Memory Map of Hades Board Address is Relative to a Base ality of the different regions are used as follows 1 Programming the XC6216 and access to the registers within the FPGA 2 Access to on board memory Individual byte accesses are controlled by the byte enable signals BE 0 3 Hee88 and RAMWE 0 3 cf Appendix E 4 Hades Hardware 49 3 Disable Output Enable signal of the XC6216 This is useful to regain control over the buses if a faulty configur
138. ecoder l North A D 31 D 0 bl G45 West East gt cs XC6216 cpupen FPGA BX 64Kx4 A17 A2 l 54 South Host Workstation i Ceres 2 Adr 15 0 1 CPUAEN FPGA Contro A341 A20 BoardAdr Las 72 Decoder Line Driver Control and FPGA gt SRAM AV SE Control R W Control PALs A 19 A 18 AA A3 A2 Pm 22V10 Control Signals RESET RW DS SRAM RAW Control gt Figure 4 4 Hades Reconfigurable Coprocessor We developed the Hades RC board for the Ceres 2 Ebe87 Hee88 the second generation Ceres workstation which has the following characteristics e 1985 1988 technology e National Semiconductor NS32532 CPU 25 MHz 40 ns cycle time with 512 byte of shared instruction and data cache NS88 proprietary bus with arbitration e 4or8 MB DRAM e I memory cycle equals 6 processor cycles 240 ns e memory transfer rate of 12 5 MB s as seen by applications e 80 MB harddisk e disk transfer rate of 120 KB s as seen by applications e Oberon operating system with all source code available The reasons for choosing the Ceres 2 as our host platform were as follows e the Hades RC board looks like a normal memory card to the Ceres 2 e the Ceres 2 has a 32 bit wide data bus which delivers satisfactory performance for our purposes e itis used in our group as the main computing platform 4 Hades Hardware 46 e its architecture is well understood e numerous extension boards have already been pro
139. ed If the grade is worse than before however the configuration is only accepted with a certain probability This allows the algorithm to leave local minima and is the key to 5 Hades Software 58 finding a placement which is close to the global minimum After each evaluation step the temperature is lowered i e the area within which a cell mav float around is made smaller Simulated annealing gives verv good results but its problems are long runtimes and due to its randomness non deterministic solutions One placement run mav give a completelv different result from a previous one despite being produced from the same input This can be real problem when iterative design methods are used or when device utilization becomes very high and manual support is needed In these cases non deterministic placement algorithms make things harder for the designer than necessary Min cut and simulated annealing yield the best results Constructive placement is the fastest Certain placement algorithms combine several of the above approaches and use the best result This process of course takes even more runtime Routing Routing is the process where placed cells on a rectangular grid are connected together using the available wires also called routing resources Taking the example from Section 3 2 5 the router has to connect the OR gate of the carry out with the two AND gates which in turn have to be connected to the XOR gate and the carry inpu
140. edure using the interface to communicate with the hardware The precondition ensures that the registers have the same pitch and the same vertical position The application then downloads the patterns using the utility procedure from the interface module It sets the map register and initializes the shift register interface object The stored value causes the circuit to take two clock cycles per character Now text can be read from disk and written into the in register A write into the shiftReg register causes 4 shift steps This process is repeated for the next 4 characters Finally the result vector can be read and the matching positions can be reported 6 2 11 Performance What is the speed limiting factor of a pattern matcher application For the simple case of searching through a text it is most likely the transfer rate from disk For modern PCs this transfer rate is between 3 and 8 MB s The 2x4 pattern matcher has a critical path of 45 ns hence it could support a throughput of 20 MB s The large pattern matcher with a critical path of 137 ns could still support a throughput of 7 MB s The hardware part of the pattern match application would therefore be fast enough to support the disk transfer rate of today s PCs On the PC the Oberon System achieves a disk transfer rate of 4370 KB s when using block reads and 3204 KB s when using words reads 4 characters at a time To evaluate the performance a user experiences we modified the Find progr
141. el terminated by a 330 270 2 resistor pair Likewise the fast clock comes from the oscillator goes to the FPGA and then to the expansion connector and is terminated by a similar resistor pair 4 5 5 Host Coprocessor Communication Normally communication is implemented using direct register reads and writes using the XC6200 s processor interface This is the most flexible way as it allows an arbitrary number of flags to be implemented The pattern matcher application in Chapter 6 is an example using this method where the state machine controlling the application is modified using register writes Two additional simple schemes were devised to control communication between the host application and the application running on the coprocessor 4 Hades Hardware 52 The first scheme uses two status bits in the decoder one to initiate a computation Go and one to signal its completion Busy The Go flag is set or cleared by writing a one or zero to address space 8 cf Table 4 1 The Busy flag is inspected by reading from address space 8 Its value is determined by the XCBusy signal generated within the XC6216 These bits free up the data bus and allow for communication with the RC even when the data bus is used by an application for instance to access local memory The second scheme enables independent operation of coprocessor and host by using in terrupts A computation is initiated with the Go flag but its completion is signaled with
142. ell might be placed where space is found then the two sources feeding these inputs are placed to the right of the cell and to the right above The placer described in Section 5 5 uses this approach Min cut this is a graph partitioning method trying to minimize the number of connections net cut between two subgraphs while keeping the sizes of the subgraphs approximately the same KL70 Bre77 FM82 Kri84 It starts with an arbitrary partition of the netlist and swaps gates between the partitions if this reduces the net cut The swapping is repeated until no further reduction can be achieved A linear time algorithm using a good heuristic is presented in FM82 and an improved version is presented in Kri84 The total runtime of the algorithm depends on the number of partitions the netlist is divided into Simulated annealing this is the most popular and successful approach to placement KGV83 It is based on a model from physics Annealing is the process by which molecules in a hot gas arrange themselves as the gas is cooled down The arrangement minimizes the energy stored in the gas Simulated annealing cools down the gas of cells In the beginning the cells float around freely and exchange positions with other cells A grading function e g based on the total length of wiring is evaluated for a given configuration If the grade for the current configuration is better than for the previous one the current configuration is accept
143. ensive programming techniques to develop the Hades software Upon first sight checked preconditions and assertions in the program code may in cur high runtime overheads if they are executed frequently Also index checks are believed by some programmers to slow down program execution considerably To measure these effects we compiled the whole Trianus and Hades system without index and assertion checks and compiled placed and routed the big pattern matcher application from Table 6 5 The differ ence in runtime was only 5 Normally the effect of caching on the performance of software is much larger than index and assertion checks In conclusion it is our belief that programming with assertions pays off We have no quantitative measure for how much time was saved during testing but the source of an error 5 Hades Software 104 was much more quickly localized by knowing which precondition assertion or index check failed than by having to single step through the code to know where a false calculation was made Therefore every programmer should enable index checks and insert assertions at crucial points in the code provided that the programming language has semantics that allow for index checks C does not 5 11 2 Garbage Collection For Oberon programmers an old hat but for programmers switching from C and C to Java an enlightenment is the availability of a garbage collector i e of automatic memory reclama tion An extensible system
144. er Reg 0 Reg 7 Add Sub Neg Mul Div Step ALU Logical Operations Figure F 1 Floorplan of Wotan Microprocessor F 1 2 Arithmetic Logic Unit The register file consists of 8 registers each 24 bits wide Reg 0 Reg 7 in Figure F 1 On a XC6216 there is not enough room to accommodate 32 bit wide registers in addition to the decoding circuitrv The registers can be loaded with a value coming from the data bus the instruction register or the program counter Operations on the registers include addition subtraction shifting and or exclusive or negation and support for multiplication and division 154 F Wotan Microprocessor 155 steps Since the XC6200 architecture does not have tri state buses inside the chip reading and writing the register file is accomplished through a series of multiplexers One slice of three registers is shown in Figure F 2 Data lines run horizontally and control lines run vertically The multiplexers select lines are driven by control lines from the decoding circuitry above and below the ALU They determine which register is allowed to write its value to the x or y bus Figure
145. ere is no methodology or tool supporting the partitioning of an application i e to tell what should be realized in hardware and what in software A Syntax of Lola The following is a definition of the syntax of the Lola hardware description language It is given in EBNF notation extended Backus Naur form Identifier Letter Letter Digit 7 Integer Digit Digit Logic Value SOP BasicType BIT TS OC SimpleType BasicType Identifier ExpressionList ExpressionList Expression 1 Expression Type I P Expression SimpleType ConstDeclaration Identifier Expression VarDeclaration IdList Type IdList Identifier Identifier Selector Identifier Integer Expression Expression Factor Identifier Selector Logic Value Integer Factor C Expression MUX Expression Expression Expression SR Expression Expression LATCH Expression Expression REG Expression Expression Expression 5 Term Factor P DIV MOD Factor Expression Term Term Assignment Identifier Selector Condition Expression Condition Expression Relation Expression lt
146. es Attached to each side of that array is I MB of SRAM for a total of 4 MB of local storage The host computer is a DEC 3000 workstation with a TURBOchannel bus interface capable of delivering 100 MB s Software Applications are described in Modula 2 Lisp or C using proprietary tools Placement is done by hand or rather program statements Partitioning the design onto the 16 available FPGAs is also done by hand The commercial tools are only used to route the netlists of individual FPGAs and to generate the configuration bitstreams We estimate the turnaround time of the tools to be in the range of tens of minutes 134 7 Related Work 135 7 1 2 Splash Developed at the Supercomputing Research Center in Maryland the Splash 1 and Splash 2 custom computers were among the earliest systems of their kind GHK90 ABD92 Splash 2 consists of up to 16 boards each containing 16 Xilinx XC4010 FPGAs for computation for a total of 2 5 million logic gate equivalents The FPGAs are connected to each other via a serial path and to a 16x16 crossbar switch In addition each FPGA has access to 512 KB of SRAM The host computer is a Sun SPARC 2 Data can be moved to and from Splash 2 at a rate of 50 MB s Software VHDL is used to write applications Partitioning a design onto the multiple FPGAs is done manually With a maximum of 256 FPGAs used for computation and 16 for communication this is not a practical approach Logic synthesis tools from Sv
147. es which can be connected to the upper input multiplexer of d Therefore seven elements are inserted into the priority queue of further wave spreading points three elements with equal weight namely the south output multiplexer to the north of d the west output multiplexer to the east of d and the east output of the switch to the west of d and four entries with higher weight since these resources are rarer than neighbor routing resources namely the south output for length 4 FastLANE of the switch to the north and the corresponding length 4 FastLANE outputs to the south east and west of d The algorithm then proceeds by removing the first element from the priority queue 5 say the entry for the east output of the switch west of d and by spreading the wave 7 9 at that position since it is not a destination 6 Further entries will be put into the priority queue for all possible sources of that neighbor multiplexer at the switch which are the north and south neighbor multiplexer in column 3 the east multiplexer in column 2 length 4 FastLANE routing resources in east and west direction and the length 16 FastLANE routing resource in east direction All these entries have their according weights but increased by one compared to the ones before because these entries are farther away from d The remaining entries 5 Hades Software 92 Program 5 17 Routing of a Net PROCEDURE FindPath r src dst input inst VAR done route from src to
148. es and its implementation is sound and well tested 3 Foundations Lola and Trianus 26 Program 3 3 Ripple Carry Adder in VHDL entity AddElem is interface port X y ci in bit s co out bit end architecture behavior of AddElem is one implementation signal h bit begin h lt x xor y s lt h xor ci co lt x and y or h and ci end behavior entity Adder is interface generic n natural 4 port x y in bit vector n 1 downto 0 ci in bit s out bit vector n 1 downto 0 co out bit end architecture structure of Adder is one implementation component AddElem declare used entities port x y ci in bit interface must be repeated S co out bit end component signal c bit vector n downto 0 auxiliary carries begin c 0 lt cin gen for i in 0 to n 1 generate ae AddElem port map unit assignment x gt x 1 y gt y i ci gt c i s gt s i co gt c i l end generate cout lt c n end structure 3 Foundations Lola and Trianus 27 Program 3 4 Ripple Carry Adder in Verilog module AddElem x y ci s co input X y ci output s co wire s CO define output pin types wire h assign h x y assign s h ci assign co x amp y h amp ci endmodule module Adder x y ci s co input 7 0 x y input ci output 7 0 s output co wire 7 0 s define output pin types wire CO define output pin types wire 6 0 c auxil
149. es data bus CPUDEN XCSel RAMSel PortSel CPU drives address bus CPUAEN XCSel RAMSel XC may drive data bus XCDOE oe XCSel RAMSel PortSel CPURW XC may drive address bus XCAOE oe XCSel RAMSel END DecoderXCCtrl 4 Hades Hardware 48 Line Drivers and Address Comparator The host interface comprises four bi directional line drivers 645 in Figure 4 4 for interfacing the 32 bit data bus two uni directional line drivers 541 for driving the 16 bit address bus a 12 bit address comparator 679 and three PALs 22V10 realizing decoding and control logic Five address lines A 19 A 18 A 4 A 3 A 2 are connected to the decoder directly defining different regions in the address space as defined in Table 4 1 The listed addresses are relative to a base address which is determined by the address comparator The direction of the data line drivers is controlled by signal CPURW of the Ceres bus see Figure 4 5 for timing information The enable signals for the address and data line drivers are generated by the decoder CPUAEN CPUDEN The address comparator requires the 12 address lines to be in a specific sequence namely address lines that are one followed by lines that are zero when active In our case the RC board lies between FECOOOOOH and FECFFFFFH so the highest 12 bits of the address to decode are FECH The sequence A 31 A 25 A 23 A 20 A 24 is used to d
150. esses a stream of regular data such as a pixel bitmap and applies simple operations on it such as convolution it can most likely be sped up using hardware To make the hardware part of an application usable and accessible to the software a driver must be written for it Ideally this driver should be generated automatically such that a soft ware programmer can simply invoke interface procedures to communicate with the hardware In some cases however it is necessary that the driver module be adapted and extended to spe cific needs Hades eases this task in that it can automatically generate a driver module which abstracts from the low level hardware details of the application The software programmer can then use this driver and augment it with more powerful interface procedures 6 2 Pattern Matching Application Pattern matching is an important application area of computers Many applications of recon figurable hardware exist in the field of image recognition and classification CAC96 Guc95 VSC96 These applications often run in phases where the hardware is reconfigured between the phases thus reusing the available silicon 106 6 Application and Evaluation 107 One type of pattern matching is text searching Ber93 PTS93 VBR96 For example a text editor has a function for searching for a pattern in a text most operating systems provide a tool for searching a pattern in files e g grep in Unix Find in Oberon And with the wide
151. et was 1 7 billion US Dollars of Which the FPGA market was 716 million US Dollars Several different FPGA architec tures from manv vendors compete in that market The leader is Xilinx with its 2000 3000 4000 5200 and 6200 architectures Other vendors include Actel MAX Altera FLEX Atmel Concurrent Logic 6000 AT amp T Lucent Orca Lattice ispLSI Motorola Pilkington MPA and Quicklogic pASIC all of them American 2 2 General Structure A field programmable gate array consists of programmable logic cells containing function units and registers a programmable routing network and programmable input and output VO cells The routing network connects logic cells with each other and with the VO cells Figure 2 1 gives an overview of an FPGA Logic Cell Routing Network 1 0 Cell Figure 2 1 General FPGA Structure In many FPGAs programmability is achieved through SRAM cells which are intermixed on the chip with the logic cells and the routing network SRAM based FPGAs can be imple mented using a standard CMOS process Conceptually the SRAM cell layer lies beneath the logic and routing layers controlling switches in the latter Figure 2 3 The switches which are implemented as pass gates using n type transistors determine the funct
152. f first note for Program 5 14 5 6 6 Ripup If a net cannot be routed it is necessary to rip up unroute certain nets to make room for a different route This task is completely left to the user No attempts are made to ripup nets automatically The user can ripup the whole design only the top level all nets in the module only instances of a selected type or individual nets Ripping up a type is an operation accomplished in two phases in the first phase a type broadcast is used to mark the wires that need to be deleted in the type itself and all its instances and in the second phase all marked wires in the design are removed Ripup should only be used during interactive routing to find a routing schedule for the design The schedule should then be recorded in a script and be associated with the design 5 6 7 Discussion The router is the most crucial piece of software in Hades It relieves the user from the most cumbersome task required for layout synthesis namely connecting gates We extended the router of Pfi92 to handle the newly available routing resources of the XC6200 Hence we had a working router after a short time However it turned out to be quite a large piece of soft ware as all the special cases of the XC6200 routing architecture have to be considered For instance at a FastLANE switch the output of the length 4 FastLANE multiplexer is depen dent on the output of the neighbor multiplexer shared routing multiplexer
153. f the circuit This is accomplished by a utility procedure which produces a textual representation of the placement information contained in a displayed layout When ap 5 Hades Software 81 TYPE Examples N Shift Right Reg IN in BIT OUT out BIT VAR s N BIT BEGIN 5 0 REG in FOR i 1 N 1 DO si REG s i 1 END out S N 1 END Examples TYPE Example6 N Shift Left Reg IN in BIT OUT out BIT VAR s N BIT BEGIN sIN 11 REG in FOR i 0 N 2 DO si REG s i 1 END out 5 0 END Example6 pss 0 4 P t Wad zii Um Tu Figure 5 14 Shift Register Example 5 Hades Software 82 TYPE Example8 N ParallelToSerial IN in N BIT read BIT OUT out BIT VAR s N BIT BEGIN s N 1 REG read in N 1 FOR i 0 N 2 DO s i REG MUX read s i 1 in i END out 5 0 END Example8 07 dot ind i Ea ar ar Ha Figure 5 15 Parallel to Serial Converter TYPE CountElem Counter Element IN ci BIT OUT q co BIT BEGIN q REG q ci co q ci END CountElem TYPE Counter N N bit Up Counter IN ci BIT OUT q N BIT co BIT VAR c N CountElem BEGIN cO ci q 0 cO q FOR i 1 N 1 DO ci c i 1
154. fluence the algorithm in various ways used routing resources direction of routing e adaptabilitv the algorithm must be adaptable to future routing architectures A maze running router finds a path between two cells terminals by considering all possible paths between the two and choosing the cheapest one The cost of a path is a combination of distance and used routing resources Rare routing resources cost more e g a neighbor wire is cheaper than a length 4 FastLANE wire which is cheaper than a length 16 FastLANE wire etc Also paths with bends i e changes of direction cost more than straight paths As the router enumerates the paths with incrementing costs it will find the cheapest path first and does not consider more expensive paths which also connect the two terminals Still as all possible paths must be considered the runtime is quadratic in the distance between the terminals Various tricks can be used to limit this quadratic behavior the effects of which are discussed in Chapter 6 5 6 1 The Algorithm The router takes as input a placed design and routes all connections between cells also called nets It finds a path from the source of a signal to all destinations of that signal In the case of the XC6200 it finds a path from cells generating a signal to the cells using that signal at its inputs It does this by routing connections point to point i e nets with two terminals If a net has more than two terminals such as
155. for instance the chip select signal This feature makes it possible for a chip to generate its own control signals thereby 2 Field Programmable Gate Arrays 14 a Y Y Y Y TY En i i i i ma l La pe rat al gt gt l i e T Y Y Y Y 117 En i n i i ma E AEE 2 Figure 2 7 XC6200 Length 4 FastLANEs making external control logic obsolete The input output architecture is discussed in more detail in X1196 2 3 4 Programming Interface From a host processor the XC6200 FPGA can be accessed like a conventional SRAM using data address and control signals Figure 2 8 Adr Data Ctrl User User 1 0 1 0 Figure 2 8 XC6200 Logic Symbol Simple memory mapped reads and writes using an up to 32 bit wide data bus are used to configure the chip and to access the values of cells Using this fast interface at a clock rate of 33 MHz it is possible to fully configure an XC6216 with 64 x 64 cells in 270 us The chip can also be partially reconfigured down to a single configuration bit Applications of this partial reconfigurability will be discussed in Chapter 6 Special registers are provided for the quick configuration of multiple rows and columns of cells
156. found in a device description file to the global input and output signals The global input and output signals are normally implemented as pins of the FPGA In the XC6200 however input signals not occurring in the device description are implemented using buried input registers accessible through the coprocessor interface of the XC6200 Likewise outputs without a description are treated as buried outputs only accessible through the coprocessor interface Note that these do not have to be implemented as registers as the coprocessor interface can also read the state of the combinational output of a cell Program 5 6 shows an example of such buried IOs and Figure 5 9 shows the corresponding layout The interface generator described in Section 5 8 represents these buried inputs and outputs as interface variables to the software programmer 5 Hades Software 69 Program 5 4 Input Variables and Scopes TYPE AddElem IN x y ci BIT OUT s co BIT VAR hi BIT BEGIN h x y s h ci co MUX h x ci END AddElem TYPE Adder IN x y 8 BIT ci BIT OUT s 8 BIT co BIT VAR add 8 AddElem BEGIN add 0 x 0 y 0 ci FORi 1 7DO add i x i y i add i 1 co END FOR i 0 7 DO si add i s END END Adder first reference to x first reference to ci duplicate x ci x 0 ci have to be duplicated x i has to be duplicated co not Program 5 5 Anonymous Expressions TYPE AddElem IN x y ci BIT rest
157. g driver software for the respective host operating system The downside of the medal of course is that such a setup is only possible in a research environment Months Later Just one and a half years later we have gained enough experience to suggest different design alternatives One would be to include a programmable clock chip The other would be the use of synchronous SRAMs to implement local memory Also implementing the decoder in one large PAL like a MACH211 AMD95 or an ispGAL Lat96 would ease maintenance Today we would also feel confident to target a more sophisticated bus such as PCI which would improve performance of data transfers to and from the host considerably at the cost of more complicated driver software required by commercial operating systems 5 Hades Software In this chapter we present the Hades software It consists of layout synthesis software com prising a technology mapper a placer a floor planner and a router In addition a bitstream generator a loader and a runtime system are provided cf Figure 5 1 The Hades software is based on the Lola hardware description language and the Trianus framework for digital circuit design cf Chapter 3 for an introduction An overview of the Hades software is given in GL 96 F Compiler Technology ee Mapper justments Layout gt Synthesis i Place amp Layout Route Editor Chip Programming Download amp amp Device Driver Runtime
158. g layout of the automatic placement algorithm of two funda mental circuits the binary up counter shown in Figure 5 16 and the ripple carry adder shown in Figure 5 17 Depending on the routing of the input signals the default layout of the adder might be the desired one although a routable layout with a bounding box of three by one cells can be accomplished manually Most often the desired bounding box of the counter should be two by one cells instead of one by two Hence the counter must be placed using hints Of course the layout and routing for such fundamental circuits should be defined once in the form of a library component which can be imported into user programs This concludes the presentation of the layouts produced by the placement algorithm A set of larger examples can be found in Chapter 6 and Appendix F There a coprocessor application and a microprocessor are developed and placed with Hades and the limitations of the algorithm when applied to large designs can be clearly seen 5 5 8 Manual Placement and Back Annotation of Position Information When the design placement is not satisfactory the layout editor is used to improve the place ment manually For instance the placement of a type such as the one in Figure 5 12 might be optimized manually Since most users would like this additional placement information to be associated with the description of the logic itself it must be back annotated into the Lola HDL description o
159. gether to form a comparator of size PatternSize Such an AND chain is started at the highest character and flows down to the lowest character 3 A result vector in the form of a FIFO queue is used to store the results of comparisons This is an optimization to avoid polling the match variable by the software driver 4 Variable result is only needed to make the variables patMatch NofPatterns 1 and queue 0 q queue ResultSize 1 q accessible under one name cf Section 6 2 10 5 A shift register suffices to implement the control logic of the pattern matcher shift the lowest bit of the register is connected to the load enable signals of the various registers A one in the lowest bit of the shift register causes the data stream to advance by one position Between ones zeroes can be inserted if additional time is needed between shift steps to propagate signals The Hades placer produces the layout shown in Figure 6 8 Note the long chain of registers in the lower right This is the shift register which is placed horizontally due to the fact that the lower register reads the output of the upper register This effect on placement was already shown and explained in Figure 5 14 of Section 5 5 7 We see that the instances are placed close together and hence the layout is not completely routable As is shown in Table 6 3 this layout results in 38 unrouted nets u 5 m HH AD ri NN cE gt zB Br gt E EF E T E
160. ginally designed for hardware simula tion The dynamic aspects are defined by the way a simulator works A hardware synthe sizer must make an interpretation of the described constructs and map this into hardware with equivalent behavior Some language features are solely used for simulation and cannot be im 3 Foundations Lola and Trianus 25 plemented directly Therefore one often speaks of synthesizable VHDL or Verilog which are subsets of the language definition Compared to Lola both languages are much more complicated and support more features VHDL and Verilog support operator overloading so the adder in Program 3 2 might be written as D x y and the actual implementation of the adder is left to the hardware synthesizer Also as both languages are used for simulation a signal may carrv more values than just zero and one which sometimes leads to code that can be simulated but not synthesized In VHDL the interface entity and the implementation architecture of a type are tex tually separated An entity may have multiple architectures e g an adder circuit may be implemented using a ripple carry or a carrv look ahead scheme The types from Program 3 I would look like Program 3 3 in VHDL Due to the separation of interface and implementation the designer has the possibility to provide multiple implementations for the same interface For libraries this is a welcome feature but for application code this approach leads to verb
161. gineering whatever is called magic should be met with suspicion In our experience almost all designs are routable after some iterations in the placement phase Certain nets may need to be prerouted by the user but almost always this solves a rout ing or performance problem satisfactorily The quick response of the router helps in making an iterative design style viable On a contemporary computer the user seldom experiences delays longer than one minute cf Section 6 5 7 Bitstream Generator and Loader Once a design is successfully placed and routed the configuration bits for the SRAM of the XC6200 can be generated The configuration data of an FPGA is normally called bitstream and we will use this term from now on The XC6200 is the first FPGA architecture from Xilinx whose bitstream format is made public Hence it is possible for third party vendors and universities to develop their own bitstream generator and associated drivers for download ing the bitstream to the FPGA We developed a board independent bitstream generator and a board dependent driver module for the Hades coprocessor board The separation into board dependent and independent parts is important as it allows us to port the Hades software onto a different XC6200 coprocessor board by only writing a new driver module 5 7 1 Bitstream Generator Algorithm The bitstream generator takes as input a fullv placed and routed design It issues a broadcast to all placed nodes
162. grams written in a high level programming language are portable across different hardware architectures In many cases one is willing to sacrifice a factor of 1 5 to 2 in performance to gain portability Likewise hardware synthesis tools abstract from the target technology and map to them using elaborate algorithms This results in slow tools even when a lower level design style is chosen A good or bad example of this is the compilation of a structural VHDL description of a one dimensional discrete cosine transform taking three hours for just synthesizing the netlist It can then be placed using hints and routed using the vendor s tools in just five minutes WCG96 Woo96a We compiled the Lola version using Trianus in under a second When hardware is described in behavioral style current synthesis tools often fail to a chieve good layout densities or they fail to meet the performance requirements Therefore we believe that a structural that is a fairly low level description of the hardware is necessary This should result in good performance and especially in transparency for the designer What You Describe Is What Is Synthesized Note that writing target specific HDL code to achieve good performance is also recommended for VHDL for instance by GS95 Using a structural style to describe the basic components we can then compose our circuit using elements of libraries much like we use procedures from modules to build software systems Th
163. h 1 v p min V start with lowest v coord END ELSE simple object p dir Vertical pitch p pitch default pitch used by placer v p minV start with lowest v coord END continued in Program 5 9 5 Hades Software 74 Program 5 9 Placement of Arrays II continued from Program 5 8 REPEAT thisW 0 thisH 0 PlaceObject p obj u v thisW thisH IF index lt 1 amp obj IS TriBase Instance THEN adjust prediction for simple objects obj 1 sometimes has different size than obj 0 IF thisH gt thisW THEN p dir Horizontal pitch Max thisW p pitch ELSE p dir Vertical pitch Max thisH p pitch END END adjust u v w h according to thisW thisH p dir direction INC index obj next in same dimension UNTIL obi NIL END represent for instance an N bit wide loadable register or an N bit comparator circuit Such simple array elements most often fit into one cell and no type is declared for them For the special case where the first element with index 0 of such an array has a different size than the next one the algorithm adjusts the placement This guarantees that the whole array is placed in the desired manner Such a case could occur for the lowest bit of a counter which consists of a toggle register whereas other bits of the counter consist of an XOR gate with a register and an AND gate 5 5 4 Placement of Instances Instances are composed of other arrays of instances and
164. he high transistor count on an integrated circuit is the ability to produce configurable hardware In configurable hardware some transistors are wasted to implement configuration memory and routing switches instead of active logic circuitry This memory is used to control connections between various transistors on the chip The immediate advantage of configurable hardware is the ability to implement different functions in a chip at very low cost It is not necessary any more to implement a digital circuit using a silicon process Instead configurable hardware brings the silicon foundry to the desktop Traditionally the term configurable hardware is associated with Programmable Logic De vices PLDs such as Programmable Array Logic PAL PALs consist of a programmable And matrix and a fixed Or matrix One great advantage of PALs are predictable signal de lays The configuration memory of these devices is usually implemented by erasable ROM cells They can be programmed many times but one often needs special programming ma chinery and the number of reprogramming steps is limited Only recently have in system programmable PALs been introduced Lat96 In 1985 Xilinx Inc introduced a new class of programmable logic devices the first com mercial Field Programmable Gate Array FPGA As the name suggests FPGAs were con ceived as a replacement for Mask Programmable Gate Arrays MPGAs The configuration memory is implemented by static RAM SRAM
165. he map register only has to be set once for accessing these registers cf Section 6 2 10 The worst case delay for the critical path of this circuit is 45 ns the path runs from the input register through the mapper circuit to the data register Our Hades coprocessor is clocked with the Ceres clock at 25 MHz therefore we need two cycles 80 ns to meet this timing constraint The shift register of the control logic should therefore be loaded with the pattern 1 0 1 0 1 0 1 6 2 9 Large Pattern Matcher The Lola code is written in a way such that by changing the constants NofPatterns and Pat ternSize we can produce a large pattern matcher circuit with 16 patterns each of length 12 The fully placed and routed layout is shown in Figure 6 10 Its characteristic data are listed in Table 6 5 in the top left corner The bottom row in Figure 6 10 contains the OR gates that are used to form the match variable which indicates if any of the patterns matched Since the last OR gate represents 6 Application and Evaluation 119 Figure 6 10 Large Pattern Matcher with Placement Hints 6 Application and Evaluation 120 the first bit of the result vector it i
166. he non deterministic results pro duced and the long runtimes Since we rely on the designer to give placement hints one reason being that the designer will have to give placement hints anyway to achieve a satis factory layout WCG96 we must use a placement algorithm that produces the same result for the same input The approach taken by the Hades placer is similar to the one described in MD95 5 5 1 The Algorithm The placer takes as input a mapped Trianus data structure sorts the type hierarchy topologi cally innermost first then places each type and propagates the placement information to its instances using a type broadcast Each type is placed into an empty chip of equal size as the target chip After the types and instances included therein have been placed the output buffers inserted by the mapper are placed near the corresponding output pins and all expression trees and instances occurring in the module are placed into the actual chip The overall structure is given in pseudocode form in Program 5 7 Program 5 7 Overview of Placement Algorithm PROCEDURE PlaceModule module TriBase InitID module unmark all nodes and wires msg doNode Updatelnstances used to copy placement information TriBase2 TopoSort module list topological sort WHILE list NIL DO type list type type id Placed mark type as placed NewPlacer p initialize placer data structure PlaceDescendants p type place all signals and instance
167. hine However several two input gates and the multiplexer function cannot be implemented this way They store the inverted value of the function into the register hence they implement F a op REG F Xilinx realized this problem only recently and removed these classes of functions from the data sheet Inversions and Input Output Blocks The biggest problem with inversion compensation happens in input output blocks IOBs While inversions on input signals to a cell can be compensated easily with the Y2 and Y3 multiplexers there is no optional inversion possible in an IOB Hence it is not possible to compensate an inversion caused by the routing multiplexers and with it the polarity of the signal on the pad to the outside world might not be correct X1196 We solved this problem by requiring the presence of a buffer cell right next to the IOB cf Section 5 4 By configuring the buffer cell accordingly a possible inversion can be compensated Note however that this solution adds additional delay to all output signals Another solution would be to invert the signal at its source but this is not a satisfactory solution to the user as the inverted value of what is expected would be read back during a state access through the processor interface Also if the source is connected to another IOB it might even be impossible to compensate the inversion It is interesting or rather sad to report that programmable inversions in the IO buffers
168. hitectures for Field Programmable Gate Arrays A Case Study Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1996 R Woods S Ludwig J Heron D Trainor S Gehring FPGA Synthesis on the XC6200 Using IRIS and Trianus Hades or from Heaven to Hell and Back Again Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1997 Xilinx The Programmable Logic Data Book September 1996 R Zimmermann H K slin Cell Based Multilevel Carry Increment Adders with Minimal AT and PT Products To be published in IEEE Trans on VLSI Systems Zuken Redac CadStar for Windows 1995 Curriculum Vitae Stefan Hans Melchior Ludwig May 21 1966 1985 1991 1991 1997 1997 born in Ziirich Switzerland citizen of Schiers Graubiinden son of Donat Dietegen Ludwig and Marlene Ludwig M rki Matura Tvpus B Kantonsschule Freudenberg Ziirich Diploma in Computer Science Swiss Federal Institute of Technology Zurich ETH Z rich Research and teaching assistant in the research group of Prof Dr N Wirth Institute for Computer Svstems Swiss Federal Institute of Technology Zurich ETH Z rich Member of research staff Svstems Research Center Digital Equipment Corporation Palo Alto California 176
169. hole process of inversion compensation is tricky and cumbersome to get right There is no svstematic structure to the presence or absence of inversions on routing multiplexers so a large table for all possible cases has to be consulted Again a low level detail of the hardware makes the software complicated 5 Hades Software 96 X1 X2 Lo Y Dynamic Mux 1 9 J CS Mux RP Mux C F x3 S 2 9 Y3 L o i Q Ck Clr Figure 5 22 XC6200 Function Unit X1 a X2 b mo y A Om MUX a a b a AND b X3 34 IN y 9 PI X1 va X2 a N Y A o MUX ra b a a AND b X3 b Ro y bi E Figure 5 23 Inversions on Inputs 5 Hades Software 97 Inversion Before the Register A further complication arises from the inversion at the output of the RP Mux shown in Fig ure 5 22 If the cell should implement F REG a b the output of the central multiplexer is inverted before being read by the register Should the register see the values of an AND gate on its input the central multiplexer must implement a NAND gate to compensate for the inversion at the RP Mux The inversion on the RP Mux is also the reason that the cell cannot implement all possible functions of type F a op REG F This type of cell would implement a Moore type state ma c
170. iary carries instantiate eight adder elements AddElem bitO x x 0 y y 0 s s 0 ci ci co c 0 AddElem bit1 x x 1 y y 1 s s 1 ci c 0 co c 1 AddElem bit2 x x 2 y y 2 s s 2 ci c 1 co c 2 AddElem bit3 x x 31 V vI3D s s 3 ci c 2 co c 3 AddElem bit4 x x 4 y y 4 s s 4 ci c 3 co c 4 AddElem bit5 x x 5 y y 5 s s 5 ci c 4 co c 5 AddElem bit6 x x 6 y y 6 s s 6 ci c 5 co c 6 AddElem bit7 x x 7 y y 7 s s 7 ci c 6 co co endmodule 3 Foundations Lola and Trianus 28 3 3 1 Motivation and Structure The Trianus project tries to improve performance of hardware design tools by tightly integrat ing them through one common data structure It decomposes the tools into a front end and several back ends all integrated through a framework Trianus features a circuit checker with which different representations of the same circuit can be checked for equivalence e g a hand layout can be compared with a Lola specification and a browser which can extract a textual view from a layout or a schematic The result is the Trianus framework for FPGA design Geh97 GL96 It consists of an architecture independent front end for which several architecture dependent back ends can be developed The front end encapsulates common operations on design data Section 3 3 2 an HDL interpreter back end Section 3 3 4 a c
171. ickly translate into a layout This allows a different design strategy to be employed in that it removes the designer from the low level design flow For example if the designer finds the required target performance has not been met at the circuit layout level he or she can go back to Iris and apply some of the many circuit transformations available The option of adding pipelining delays is particularly well suited for FPGA designs as it can 6 Application and Evaluation 132 sometimes be implemented at no extra cost The key issue is that circuit optimization is being performed at the algorithmic and architectural level which is less consuming than varying placement and routing at the FPGA level The resulting system is presented in more detail in WLH97 6 4 2 Developing Arithmetic Circuits for the XC6200 During a term project at the Institute for Integrated Systems at ETH Ziirich P Miiller imple mented various adder structures for the XC6200 FPGA He described the adders in Lola and then used Hades to place and route the designs Since adders are fundamental circuits which must be optimallv placed hints were used to achieve good placements However it was pos sible to write the Lola code in a parameterized fashion such that adders of arbitrarv sizes can be defined The router was used interactivelv and often single nets were prerouted to guide the routing of subsequent nets A 32 bit carrv increment adder ZK97 with a delay of 39 ns an
172. ignal X7 All other multiplexers are controlled by SRAM configuration bits denoted by shaded boxes attached to the bottom The signals on XI X2 and X3 are determined by multiplexers selecting from eight possible inputs any of the 2 Field Programmable Gate Arrays 11 four neighboring cell outputs or any of the four length 4 FastLANE signals running along the cell see Section 2 3 2 X1 X2 Lo Y Dynamic Mux 1 9 J CS Mux RP Mux C F x3 Te 3 9 Y3 L o Q Ck Clr Figure 2 4 XC6200 Function Unit The feedback from O to the Y2 and Y3 multiplexers in Figure 2 4 gives additional flexi bility for instance to implement the register of a counter F REG F 9 cin The RP Mux can be used to protect the register When a register is protected it is only writable through the processor interface of the FPGA Section 2 3 4 This can be useful for implementing constants or for writing parameter values into a circuit without having to connect the register to an I O pad also called padless IO To achieve higher transmission speeds the multiplexers in Figure 2 4 have inverted out puts These can be a problem for the CAD software Not only do these inversions exist in the multiplexers within the cell but also on routing multiplexers process called inversion compensation cf Section 5 7 is required to dete
173. iguration times of the XC6200 make this chip a first choice for coprocessor applications This is not surprising as the architecture is targeted at that market segment It is not evident however if the fine grained architecture of the XC6200 can implement arithmetic circuits as efficiently as for example the XC4000 with its dedicated carry logic However Mul97 presents a fast adder circuit also discussed in Chapter 6 and KNS96 recently described a constant multiplier with similar density and better performance than a XC4000 implementation The logic cell lends itself to regular pipelined data path type applications The availability of the bitstream format makes it possible to write new tools exploring various possibilities of generation and reconfiguration of hardware 2 5 4 Deciding on an Architecture The simple cell regular routing architecture processor interface fast reconfiguration times possibility of user register access and availability of the bitstream format made the XC6200 a clear winner for implementing a reconfigurable coprocessor and associated software tools 3 Foundations Lola and Trianus In this chapter we present the foundations Hades is based on namely the hardware description language Lola and the Trianus framework for digital circuit design with FPGAs Lola is compared to the more popular languages VHDL and Verilog The structure of Trianus and its data structures are explained in some detail to give the reade
174. illiseconds Such a system is usually called a Custom Computing Machine CCM but we prefer the name Reconfigurable Coprocessor RC as it better characterizes the close coupling to a CPU When compared with the use of hard wired ASICs the inclusion of a reconfigurable co processor in a computer system bears several advantages e The time consuming part of an algorithm can be executed at the speed of hardware e The design implementing this part of an algorithm can be developed with the flexibility and turnaround time of software e The available FPGA hardware can be reused for various algorithms thereby reducing circuitry that is otherwise unused in a system 1 Introduction 4 e The hardware can be adapted to changing requirements or new algorithms as only the SRAM configuration data has to be generated anew e When specific parameters to an algorithm are known the hardware can be special ized for those thereby reducing the amount of logic due to constant propagation and achieving higher speeds e The cost and power consumption of a system is reduced as one reconfigurable copro cessor can fulfill tasks of several separate ASICs provided that the tasks are separated in time There are two main application areas where RCs can be used to speed up an application algorithms performing integer operations on large amounts of data most DSP applications fall into this area and the processing of input and or output data streams To
175. ing resources are intricate if not astounding A single cell has 12 input signals and generates 4 outputs signals It has 4 direct connections to its neighboring cells and access to a dedicated carrv chain Between cells switch matrices have access to 16 single 8 double 24 quadruple length lines 16 long lines and 8 global signals This gives a total of 45 vertical and 32 horizontal lines near each cell Figure 2 13 shows a simplified diagram of the routing resources The switch matrix implements a sparse connection of the input and output signals Apart from the switch matrix two signals can be connected together at most cross points in Figure 2 13 Some of the longer lines can be used to implement tri state buses other signals are uni directional in principle although multiple sources can be connected to a signal The XC4000 is programmed through a serial interface or 8 bit parallel in the case of the XC4000EX Partial reconfiguration and the setting of user registers is not possible However it is possible to read the state of all logic cells 2 5 Evaluation Comparing FPGA architectures against each other is difficult Each one of them has its strengths in certain application domains and its weaknesses in others A simple fact is that Xilinx s 3000 4000 and 5200 architectures all lookup table based coarse grained FPGAs hold roughly 70 of the FPGA market This seems to indicate that these architectures are well suited for different appli
176. ingness to enhance or alter Trianus almost instantly It was fun to work with you Thanks to Immo Noack for many things especially for being who you are Thanks and acknowledgments go to all of my colleagues at the Institute for Computer Systems for a stimulating and enjoy able working environment Remember there is only one true Ludwig of the day Erwin Oertli for tips regarding the details of the Hades board and many other things e Marco Sanvido for the provision of a timing analyzer Beat Heeb for helping with the Ceres 2 and Cuno Pfister for having produced CALLAS a good tool to measure my own work against Wolfgang Weck and Clemens Szyperski for discussions on various issues Tom Kean for the chip and much more Bill Wilkie for answering my endless questions and for never stopping to appreciate my corrections to the data sheet the remaining people at Xilinx Development Corp Scotland for their openness and generosity Virtual Computer Corporation for distributing Hades with their board Dr Roger Woods and Jean Paul Heron of the DSP Laboratory Queen s University of Belfast for their willingness to work with Hades and for their humor Patrick Miiller and Reto Zimmermann for betting a term project on Hades and for succeeding Chuck Thacker Dave Conroy and Mark Shand of the Digital Systems Research Center for discussions about FPGAs and high performance memory systems Monty Brekke and Steve Atkins for the
177. inputs Technology mapping is complex as the software has to decide what part of a circuit to combine and put into one cell The AT6000 cell is a collection of special cases Technology mapping is complex as some basic functions such as the OR gate have to be constructed from several cells or certain parts of a circuit have to be combined and put into one cell All FPGAs are register rich ratio of registers to logic gates They are well suited for pipelined data path intensive designs In addition the XC4000 with its high fan in cells is good for implementing random logic Also the dedicated carry logic is a big plus for arithmetic circuits and the distributed RAM capability is useful in many applications 2 5 2 Routing Resources The XC6200 has a regular hierarchical routing structure with few special cases The XC 4000 s abundant routing resources gives good routability at the cost of a complicated software 2 Field Programmable Gate Arrays 20 implementation because of the many special cases to consider Both the XC6200 and the XC4000 have routing resources which are independent of the chosen cell function The ab sence of this feature is the biggest drawback of the AT6000 as the hardware synthesis software must decide at an early stage whether to use a cell for routing or for logic If at a later time routing is not possible it has to redo placement to free up some cells for routing The tri state buses in the XC4000 and the AT60
178. ion since the necessary components are produced automatically Manual intervention during layout synthesis is still state of the art Hades supports a fast and interactive design cycle and lets the designer s knowledge enter this cycle Interactive and incremental tools result in designs with the same or better quality than automatically generated designs in the same or less time 8 Summary Conclusions and Outlook 141 A fast easily usable interface to the reconfigurable coprocessor is essential for the per formance of algorithms executed by it In our opinion reconfigurable computing has a great potential However more work on interface issues is needed and libraries of whole algorithms are to be built 8 6 Outlook We never knew as much about the subject of this thesis as now at the end of the project Therefore we now know what we should have done differently and what worked out well The following sections give some ideas on what could be done differently 8 6 1 Hardware s was seen in Chapter 6 the most pressing issue in the design of a reconfigurable coprocessor is the communication speed between the CPU and the FPGA The Hades board has a memory card interface and has a relative speed advantage over the PCI board of a factor of 6 How ever the Hades board was developed for obsolete host hardware An Alpha CPU from Digital Equipment Corporation can be clocked at 500 MHz and can issue 2 instructions per cycle During an acce
179. ionality of the cells and the direction of the signal flow cf Figure 2 2 The SRAM cells are implemented as 5 or 6 transistor cells Because of the volatility of the SRAM cells the configuration bits are loaded from an external storage device upon power up typically from a serial ROM A serial or parallel pro gramming interface is provided for that task Protecting intellectual property in the presence of a ROM is problematic as the ROM can be read out not only by the FPGA but also by using a probe Therefore FPGA vendors try to protect the intellectual property of their customers by keeping the format of the configuration bitstream secret 2 Field Programmable Gate Arrays 10 SRAM Cell Pass Gate Zn Figure 2 2 Pass Gate SRAM based FPGAs can be programmed an unlimited number of times Depending on the size of the chip and the speed of the programming interface the time needed for a full reconfiguration is between hundreds of microseconds and hundreds of milliseconds 1 0 Logic Cell Routing Cell Network N se path gt Configuration Store Figure 2 3 Configuration Store Other technologies for storing configuration information and implementing switches are based on antifuses Act95 Qui94 and EEPROM cells A1t96 In this thesis we focus on FP GAs based on SRAM cells as they are the only alternative for implementing
180. ircuit checker Section 3 3 6 a browser Sec tion 3 3 6 and a graphical user interface framework for editors The separation into a front end and several back ends bears the advantage that based on the framework new back ends for a new FPGA architecture for instance can efficiently be developed and integrated For the user of the system this results in a uniform interface and consistent behavior of the tools For the programmer the use of a shared front end and a com mon data structure reduces the amount of code to be written and tested when implementing a new back end This reduction in code complexity results in a more reliable and smaller system The name Trianus is a derivation from the name of the two headed Greek god Janus and stems from the fact that the framework supports three different views onto the same circuit namelv a textual view by means of a Lola program a schematic view and a layout view in a layout editor Figure 3 1 gives a graphical representation of possible transitions between VIEWS Extract Place amp Route Layout Schematics Data Extract structure Compile Extract Place amp Route HDL Figure 3 1 Different Views in Trianus The three views in Trianus give users a choice between different representations of a de sign Some prefer describing their design with a program some with a schematic and some with a lavout Using the circuit checker a
181. izable with the word width of a circuit 3 2 2 Variables Signals and Assignments Variables serve to give a name to an electrical signal a wire or net a group of signals or a component Each variable is either of a signal type an array type or a composite type There are three basic signal types in Lola BIT single source signal TS tri state bus and OC open collector bus A signal carries the value zero 0 or one 1 or undefined in the case of a tri state bus Signals can be declared in one of four different declaration sections IN input INOUT input output OUT output and VAR variable Input signals may only be read input output signals are tri state or open collector buses and may be read and written output and variable signals may be read and written IN INOUT and OUT signals make up the interface of a type cf Section 3 2 4 INOUT and OUT signals are visible outside the scope they are declared in whereas VAR signals are local to the scope In an assignment var exp the expression exp defines the value of variable var There may only be one assignment to a variable of BIT type Variables of type TS may have multiple conditional assignments Variables of type OC may have multiple assignments 3 2 3 Operators Expressions Control and Position Statements Table 3 1 lists the basic operators of Lola which are used to combine signals In addition to unary and binary operators there are multiplexer latch register
182. k 139 8 1 What has been Accomplished o o 139 8 22 Hades Hard Wales 25 2 REN er Genk nn 139 8 3 Hades Software 5 21 ia ed EIS 139 8 4 Lolaand Trianus eer 140 8 9 Conclusions os aus u ve ve RES DENG eier Be uke ee AES 140 8 64 Outlook e at 45 Ban aD FH rear ieh 141 A Syntax of Lola 143 B Schema of Hades Coprocessor Board 145 C Photograph of Hades Coprocessor Board 147 D Components for a Hades Board 148 E Hades RC Board Decoder 150 F Wotan Microprocessor 154 F 1 Architecture and Principle of Operation 0 154 E25 EOS Codes ec Sa See mue Ke ee ue e ens 156 E3 LEayoutSynthesis s cu Lane bt h ge REV b e Lade Pu 164 G Resources on the Web 167 Bibliography 168 Curriculum Vitae 176 1 1 1 2 1 3 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 2 13 3 1 3 2 3 3 3 4 3 5 3 7 4 1 4 2 4 3 4 4 4 5 5 1 2 2 5 3 54 5 5 5 6 5 7 5 8 5 9 5 10 5 11 5 12 List of Figures CPU nd Coprocessor sg SA 2 Typical Reconfigurable Coprocessor L 4 Hardware Synthesis Flow 2e 6 General FPGA Structure a 9 Pass le uode ee A AAA wd G 10 Configuration Store 00 10 XC6200 Function Unit eee 11 Mux Implementation of the AND and XOR Functions 12 XC6200 Neighbor Routing e 13 XC6200 Length 4 FastLANEs Lee 14 XC6200 Logic Symbol LL 14 CAL Function Unit u zu sek KOTE SET re G
183. l It is hence possible to use normal memory operations to read and write register values On the board local SRAM is used for fast storage of intermediate data The board is integrated via its memory interface into the host operating system which makes it easily and transparently accessible from within application code 8 3 Hades Software The Hades software implements a layout svnthesis back end for the Trianus framework It consists of a technologv mapper a constructive deterministic placement algorithm a maze running routing algorithm a configuration bitstream generator a driver for the Hades hard ware and an interface generator to make a hardware design accessible to software For the same input the placer produces the same lavout It places arravs constructivelv and almost always optimally Expression trees are placed in a way to make them routable The placer uses space generously and relies on hints given by the designer to achieve a satisfactory result The router uses a maze running routing algorithm which can be influenced in several ways The shape of wave expansion the routing resources used and the sequence in which nets are routed can be set The router is scriptable automating these tasks in an iterative design cycle The place and route software preserve and make use of the hierarchical information pro vided by the Trianus front end The software operates type based that is it places and routes 139 8 Summary Co
184. le for an RC application because registers in an array were arranged in a circle and could not have been accessed efficiently using the processor interface 6 3 2 Small Pattern Matcher with Hints In the second experiment the input to the placement and routing algorithm was the Lola code annotated with placement hints Essentially placement is done by hand in the form of Lola position statements thus only the router has to perform real work Table 6 4 summarizes the results As the placement was already performed for all cells except for the expression trees of the mapper we used the default placement option for XACT which was high A run 6 Application and Evaluation 129 with low did not improve the runtime of the placement phase To see how much workstation technology has improved in 8 years we conducted the experiment also on Ceres 2 It features a National Semiconductor 32532 CPU clocked at 25 MHz and was developed in 1987 The PC was purchased in 1995 i e 8 years of technological improvement lie between the two machines 248 Cells and 779 Nets Reading Design Files Hades XACT Isp 155 OTs 935 O1s ors 23s 8x47 18x47 18x 45 2 1 Ea Place Table 6 4 Pattern Matcher with Hints 2 Patterns of 4 Characters Each Ea BI a of mj ass 12204 ss rer 8 1 9 s 2s Evaluation As can be seen Hades performs very well compared to XACT and it routes the design without retries o
185. ll instances of the outermost type A single type can be routed including all types contained in that type to give the user the possibility to try out different placements for a type without having to route the whole design Also single nets can be routed This can be useful to be able to route certain nets manually in order to impose an order on the use of routing paths For high performance designs this feature is essential Chapter 6 and Mul97 present some examples All routing commands can be recorded in a script This script is a simple text which the user can edit playback command can be used to execute prerecorded routing steps This feature is useful when the router has to be used iteratively to obtain a routed design If some small change is made to the Lola program and the circuit is synthesized and placed again the user would have to re enter all routing commands again manually By using a script which can be stored together with the Lola program a design can be routed in the same way as in a previous step Scripts serve the same function in the router as position assignments in the placer The typical length of scripts is less than twenty commands As the ultimate measure for guiding the router the user can insert wires manually into the layout to enforce a specific routing When switching from manual to automatic routing an extraction and verification step is necessary to ensure the consistency of the Trianus data structure c
186. ll yield the same result An additional compaction step such as the one used in Pfi92 could be used to improve the placement i e to reduce the height of the expression tree by two rows The next example shown in Figure 5 13 realizes the multiplexer function shown in Fig ure 5 12 with Lola s multiplexer construct and encoded selector signals The resulting circuit is much more compact than the one shown in Pfi92 This is of course a direct result of the availability of the multiplexer in a single cell in the XC6200 architecture a feature CAL lacked The following example shown in Figure 5 14 implements a left and a right shift register The first register is placed according to the array placement heuristic while the second al though similar in structure is placed according to the expression tree heuristic In the first example subsequent array elements refer to previous array elements which have been placed already hence the expression tree traversal is always only one level deep In the second exam ple however the placement of the first array element causes a tree traversal of depth N 1 as all subsequent array elements have not yet been placed and are referenced by previous ones It is questionable if it would be advantageous to prevent the placement of array elements in ex pression tree traversals and rely on the array placement to place these nodes It might produce better placements only in certain cases We have not investigated
187. lligence Laboratory MIT 1996 E Detjens G Gannot R Rudell A Sangiovanni Vincentelli A Wang Technol ogy Mapping in MIS Proc IEEE Conference on Computer Aided Design 1987 J Dion Fast Printed Circuit Board Routing Proc 24th Design Automation Con ference ACM IEEE 1987 Also available as Digital Western Research Labora tory Report No 88 1 1988 J Dion L M Monier Contour A Tile Based Gridless Router Digital Western Research Laboratory Report No 95 3 1995 A R Disteli P Reali Combining Oberon with Active Objects Proc Joint Mod ular Languages Conference 1997 C Ebeling D C Cronquist P Franklin RaPiD Reconfigurable Pipelined Dat apath Proc 6th Intl Workshop on Field Programmable Logic and Applications LNCS 1142 Springer 1996 H Eberle Development and Analysis of a Workstation Computer Dissertation 8431 ETH Z rich 1987 J G Eldredge B L Hutchings Density Enhancement of a Neural Network Using FPGAs and Run Time Reconfiguration Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1994 C M Fiduccia R M Mattheyses A Linear Time Heuristic for Improving Net work Partitions Proc 19th Design Automation Conference ACM IEEE 1982 R A Finkel J L Bentley Quad Trees A Data Structure for Retrieval on Com posite Keys Acta Informatica 4 1 1974 C A Fields The Proper Use of Hierarchy in HDL Based High Density FPGA Design P
188. lso a subcircuit described in the form of a layout might be represented by an HDL description on a higher level Figure 1 3 describes the flow from a schema or HDL program to the final circuit which in this case is implemented using an FPGA The dark shaded areas represent tools and hardware which are the subject of this thesis A compiler translates an HDL program or a schema into a device independent netlist rep resenting the circuit The netlist is mapped to the target device by a technology mapper Technology mapping is a hard problem for FPGAs with complex cells cf Chapter 2 and takes some time to decide which gates in a netlist are mapped into which cell in an FPGA The resulting netlist is then placed and routed i e the physical location on the FPGA for the gates in a netlist is determined placement and the gates are connected together with wires routing Both problems are NP complete their runtime cannot be bounded by a polynomial function i e they require time exponential in the problem size and require long run times to achieve good results Commercial tools often use placement algorithms based on simu lated annealing KGV83 This algorithm tries many different configurations until it reaches a solution Once the design is placed and routed the netlist can be converted into an SRAM configuration for the FPGA and finally be downloaded to the device Netlists that cannot be placed or routed or that do not meet the timing cons
189. mmunication 1996 Chromatic Research MPact Media Engine 1995 S Churcher T Kean B Wilkie The XC6200 FastMap7M Processor Interface Proc 5th Intl Workshop on Field Programmable Logic and Applications LNCS 975 Springer 1995 D A Clark B L Hutchings The DISC Programming Environment Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1996 D Conrov Digital Svstems Research Center California Personal Communica tion 1996 S Cook The Complexitv of Theorem Proving Procedures Proc Third Annual ACM Symposium on the Theory of Computing 1971 T H Cormen C E Leiserson R L Rivest Introduction to Algorithms The MIT Press 1990 W B Culbertson T Osame Y Otsuru J B Schackleford M Tanaka The HP Tsutsuji Logic Synthesis System Hewlett Packard Journal August 1993 Bibliography 170 CAC96 Cyp95 DeH96 IDGRS7 Dio87 DM95 DR97 ECF96 Ebe87 EH94 FM82 FB72 Fie95 Fou93 FRV91 Gal95 W B Culbertson R Amerson R J Carter P Kuekes G Snider Exploring Architectures for Volume Visualization on the Teramac Custom Computer Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1996 Cypress Programmable Logic Data Book 1995 A DeHon Reconfigurable Architectures for General Purpose Computing Dis sertation A I Technical Report No 1586 Artificial Inte
190. n IN IdList InType INOUT IdList InOutType OUT IdList OutType 5 VAR VarDeclaration CLOCK Expression BEGIN StatSequence END Identifier 77 B Schema of Hades Coprocessor Board 145 B Schema of Hades Coprocessor Board 146 TIT DDD DD J I A FEFFPEFE UU QL Qu HH Figure B 1 Schema of Hades RC Board C Photograph of Hades Coprocessor Board Figure C 1 Photograph of Hades RC Board 147 D Components for a Hades Board 1 XC6216 FPGA PGA 299 1 PGA Socket 299 pin 8 SRAM 64K x 4 bit with output enable 15 ns DIP 28 300 mil Motorola MCM 6209C P15 8 DIP 28 300mil socket with 100 nF capacitor 3 22V 10 7
191. n rOin Y1 sum a o Xor And BIT register bitslices Oth register special r R Regs 1 RegCell Place BEGIN pZ MUX pes z pc select alu output or pc so MUX ds pz din select alu output pc or data input shift MUX shd su sd shift up or down d MUX shen so shift shift select shiftr MUX shd hu hd rin MUX shen d shiftr ren en 0 shen r REG ren rin reg 0 Zero 0 rOin ys 0 r R O d zero rOin en 1 xs 1 ys 1 reg 1 FOR i 1 6 DO reg 2 7 R i d R i 1 x R i 1 y en i 1 xs i 1 ys i 1 END x R 6 x Y R 6 y x and y register bus Y1 MUX im Y ir select instruction or normal register F Wotan Microprocessor 160 y Y1 neg negate u x y half sum alu operations sum u ci a X y 0 X y Xor MUX xor sum u And MUX and Xor a z MUX or And 0 co MUX u x ci carry out zo z zi zero chain IF Place I THEN FOR i 0 6DOR 4 1 2 0 END so 0 0 shift O 1 d 1 0 rin 1 1 shiftr 2 0 ren 2 1 r 3 0 r0in 3 1 zero 4 1 x 16 0 Y 17 1 Y1 18 0 y 18 1 co 19 0 u 19 1 sum 20 0 Xor 20 1 a 21 0 And 21 1 0 22 0 z 22 1 pz 23 0 zo 23 1 END END ALUslice F Wotan Microprocessor incrementable program counter TYPE PCslice Place I
192. n Figure 6 6 is 10 4 ns and the delay of the right circuit is 9 5 ns 6 2 7 Registers Pattern registers are implemented with buried registers These make use of the processor interface a unique feature of the XC6200 FPGA The patterns are directly loaded into the registers without passing through I O buffers The 8 bit and 5 bit data registers have a load enable signal which is used to implement the shift steps mentioned in Section 6 2 4 Data is loaded through the processor interface into the 32 bit 4 x 8 bit input register Again no I O buffers are needed to do that If the data would be copied directly from a network adapter I O buffers might be used to route the data to the input register Program 6 4 shows the two Lola type definitions for buried and loadable registers Note that we could have the technology mapper implement buried registers for us by simply declar ing them as global inputs cf Section 5 4 But we would like to define a pattern register in a type such that an instance of such a type represents a register of a certain bit width The array placement heuristic places the registers optimally i e the bits vertically on top of each other They are hence easily accessible through the processor interface and can be read or written in one access cycle Figure 6 7 shows the resulting placement of an 8 bit data input register a loadable 5 bit mapped data register and a 5 bit pattern register 6 2 8 Connecting Everything N
193. n a direct 5 Hades Software 62 approach to technology mapping rather than writing a generic mapper using the widely used graph matching technique described in Keu87 Normally a mapper determines what logic gates are implemented by what cells hence a preplacement step is performed Since in our case the mapping is almost one to one we defer this step to the placement phase described in Section 5 5 In the following we will often use the terms instance and type type is a Trianus data structure describing a type definition in Lola An expanded type is an instantiated type i e a generic type with actual parameters An instance is an actual hardware component of an expanded type such as an adder element See Section 3 2 4 and Geh97 for further details on the difference between generic and expanded types 5 4 1 The Algorithm The mapper takes as input a Trianus data structure representing a compiled Lola module It first initializes the id field of all nodes in the data structure The id field is used to mark nodes which have been mapped already such that already visited nodes are not visited again The algorithm then sorts the type hierarchy topologicallv with types not containing instances of other types occurring first This is necessary for the correct treatment of an instance s input variables as will be seen later All instances of every expanded type are mapped in the sequence defined by the topological order The mapping i
194. n a week Trianus is a fast device independent framework for circuit design offering a general yet simple data structure A flexible broadcast mechanism is used to alter the data structure and to keep a consistent one to one correspondence between types and their instances Type based tools guarantee efficiency even for large designs The layout editor is comfortable to use and provides immediate response for most operations It can be used for manual layout and for floor planning of large designs On the downside Trianus shows quadratic run time behavior on unfavorable input due to the simple implementation of the data structure a linked list of nodes This implementation will become a limiting factor of the data structure s performance as FPGA devices get bigger A more efficient data structure for representing the geometric relationship of nodes would be advantageous such as quad trees FB72 or a simple hash table as presented in Section 5 9 Also support for libraries and for the copying of data structure elements such as types is missing The latter is a prerequisite for the implementation of libraries which are essential for the development of reconfigurable coprocessor applications 4 Hades Hardware In this chapter we present the hardware part of this thesis the Hades reconfigurable copro cessor cf Figure 4 1 A tutorial style introduction to the hardware is presented in Lud96 Throughout this chapter names printed in Sans
195. n gates re quires 30 minutes This design would require about 25 XC4010 chips for each of which the place and route time using commercial tools would be in that time frame Synthesis partition ing placement and routing are fully automatic and hence the long compilation time for the whole application is acceptable 7 2 Reconfigurable Coprocessors This class of FPGA based computers covers smaller devices which are used in close cooper ation with a host computer Our Hades RC falls into this category 7 2 1 Chameleon Built in 1992 at ETH Ziirich Chameleon is a workstation using Algotronix CAL FPGAs and a MIPS CPU HP92 Hee93 One CAL chip is used to implement the control logic of the 7 Related Work 136 workstation such as the keyboard and mouse interface the video controller and the network interface A 2x3 array of CAL chips is used to implement a custom computer No local memory is attached to the FPGAs but the main memory of the host computer can be used via the processor Data transfers are relatively slow as no DMA is supported Software Applications are described in the Debora HDL Hee93 The CALLAS layout synthesis soft ware is used to generate layouts Manual improvement of the placement is necessary but no position hints in the HDL are allowed Instead a match tool is used to propagate information from a previously generated layout into a new one The tools are very fast and design cycles in the order of tens of seco
196. nclusions and Outlook 140 the proto type of a circuit and then propagates this information to all components of that type in the design The synthesis tools support a fast interactive and iterative design cycle Hints in the Lola HDL description of a design can be used by the user to influence the produced result The design cycle starting with compiling Lola HDL code and ending with a finished layout is very fast and usually takes under one minute on a Pentium class PC Compared to commercial tools the Trianus Hades software is a factor of 2 smaller implements nearly the same func tionality requires at least a factor of 2 less memory and produces the result faster by a factor of 10 this does not include HDL compilation by commercial tools 8 4 Lola and Trianus Lola is a simple easy to learn hardware description language for describing digital circuits on a structural level Position statements are very useful to guide a placement algorithm to obtain a good layout Datapaths can be described concisely and the inclusion of hierarchy is essential for tackling large designs Trianus is a fast robust device independent framework for FPGA circuit design offering a general yet simple data structure that is efficiently and comfortably altered using a flexible iteration mechanism It supports and maintains the hierarchical information available in Lola and provides algorithms for manipulating the data structure respecting the hierarchy The la
197. nd a relatively fast VHDL compiler 7 3 Reconfigurable Processors A different approach to custom computing is taken by research groups who investigate the combination of FPGAs with CPU cores Most of these projects were started only recently and few results are available None of these projects have reported a working implementation The potential seems to be very promising however 7 Related Work 137 7 3 1 PRISC One of the earliest work in this field is PRISC Programmable Instruction Set Computer from Harvard University Cambridge Massachusetts Raz94 RS94 It consists of a CPU augmented with a programmable function unit PFU which has the equivalent die area of 1 KB of cache memory A compiler exists which analvzes the source code for operations that could be implemented in the PFU A CPU with one PFU executes the SPEC integer benchmarks 22 faster No compilation times are reported 7 3 2 BRASS The Berkeley Reconfigurable Architectures Systems and Software BRA96 project from the University of California Berkeley consists of a reconfigurable CPU and a C compiler generating configurations for it similar to PRISC The project was started in 1996 and no concrete results have been presented yet The hardware consists of the Garp processor which contains a MIPS II core and an FPGA optimized for datapath applications A modified C compiler is used to generate code and configuration bitstreams for Garp Logic and layout synthesi
198. nd extractor consistencv between the three views can alwavs be verified manuallv or automaticallv By using only one common data structure for representing design data intermediarv file input and output as shown in Figure 1 3 is avoided Hence the design flow shown in that figure can be simplified to the one shown in Figure 3 2 The shaded areas correspond to the Lola HDL and the Trianus software described in this chapter 3 Foundations Lola and Trianus Schema Compiler Corrections Technolo M 2 Adjustments apper jis i Place amp Layout Route Editor Download amp Runtime System FPGA Figure 3 2 Lola and Trianus Part in Design Flow from Fig 1 3 29 3 Foundations Lola and Trianus 30 3 3 2 Data Structures In the following we give a brief overview of the basic data structure used for design represen tation in Trianus We refrain from introducing every detail and leave out those aspects which are not relevant in this thesis s context In later chapters we give more information on the data structures should the need arise For a more detailed discussion of Trianus and its underlying design we refer the reader to Geh97 The central data structure describes a hardware circuit in a general compact device inde pendent manner Generality and compactness of the data structure are important attributes as they make it possible to keep the data in memory between different phases in th
199. nds or at least a few minutes Sadly with current commercial tools this is impossible Fie95 Woo96a WLH97 During one day many design cycles can be performed in software development Good programming language compilers have turnaround times in the range of seconds Wir96a With modern operating system concepts such as dynamic linking and loading a programmer can testa module a few seconds after having made a change to the program text This statement is at least true for the Oberon System WG92 One driving force in the development of the Hades layout synthesis tools was to achieve the same level of speed for the development of hardware As can be seen in Chapters 6 and 7 this has been achieved for all parts of the synthesis flow except for routing which is the slowest part of our software Especially HDL compilation mapping and placement are very fast Interactivity Interactive use of layout tools is mandatory to achieve high quality layouts of FPGA circuits Ber93 BT94 especially with the XC6200 as current placement algorithms perform poorly on fine grained architectures HW96 WCG96 Therefore we strove for tools that the de signer can influence at all levels When describing a circuit with the Lola HDL this is possible 5 Hades Software 60 through position statements during placement the layout editor can be used to arrange cells and instances manually and during routing single nets can be routed before others nets can
200. nds to minutes are attainable The fact that the control logic of Chameleon is described and synthesized using Debora and CALLAS proves the usefulness of the tools CALLAS was a constant source of inspiration during the development of Hades Hades size is compared to that of CALLAS in Section 5 10 7 2 2 PCI Pamette A smaller Programmable Active Memory machine called PCI Pamette was developed at Digital s Systems Research Center in Palo Alto California and is the third generation of the Perle family Sha96 The Pamette is a PCI board featuring 4 Xilinx XC4010E FPGAs and 2 banks of SRAM each with 128 KB A fifth XC4010E implements the PCI interface which allows for data throughputs close to the theoretical maximum of 133 MB s The board is used as a flexible programmable I O device for instance as a real time data acquisition interface Software The Pamette is programmed with the same software methodology as the Perle machines from DEC PRL Commercial tools are used for routing and design cycles typically lie in the range of tens of minutes 7 23 VCC s Reconfigurable Processing Unit Virtual Computer Corp manufactures a PCI board hosting a Xilinx XC6216 FPGA used as a reconfigurable coprocessor and a Xilinx XC4013E FPGA implementing the PCI interface VCC97 The card has two banks of SRAM each with 256 KB Software Design software for that board includes our Trianus Hades system as well as the Xilinx XACT step series 6000 a
201. next level of the routing hierarchy is composed of length 4 FastLANEs They run across four cells and are connected together through switches which are spaced 4 cells apart Figure 2 7 A length 4 FastLANE can be driven by another length 4 FastLANE or by the signals on the previous or next level Accordingly there are length 16 FastLANE signals driven by switches spaced 16 cells apart and chip length FastLANE signals driven by switches located at the perimeter of the cell array Four global signals Gl G2 GCIk GCIr are provided for low skew low delay signals such as register clear and clock Xi196 presents the routing architecture in more detail It also describes the topology of the magic routing resources Which are not supported by our tools due to their irregular structure with respect to hierarchy They allow to connect the signal of the X2 or X3 input of a cell to two 4x4 switches 2 3 3 Input Output Blocks Surrounding the array of cells configurable input output blocks IOBs are located at every cell location An IOB can be configured to act as an input an output or a bi directional driver controlled by a tri state enable signal Not every IOB is connected to a pad padless IOB A novel feature of the XC6200 is that padless IOBs can be connected to additional inputs of IOBs with pads This gives additional flexibility for the routing of signals Also it is possible that an IOB can drive a control signal which goes into the array
202. nment with vectors x and y and constant zero FOR i 0 Bits 1 DO D i rd adder s i rd controlled tri state assignment END END Add 3 2 6 Compilation of Lola In analogy to a high level programming language compiler the Lola compiler performs a syntax and type check on the program and then generates an abstract syntax tree representing it An interpreter traverses this tree to generate an expanded data structure with signal and expression nodes cf Section 3 3 The expansion step is necessary to generate the required number of nodes described in FOR loops and IF statements and to evaluate position state ments This interpretation step however is different from the interpretation step in Verilog and VHDL which is based on a simulation model to generate the actual hardware cf Sec tion 3 2 7 The values of individual signals is not known to the Lola interpreter as only the control statements are interpreted After expansion an optional simplification step may be executed on the resulting data structure This simplification step propagates constants through the expression tree An indi vidual instance may thus look quite different from its original type definition For instance if the carry input is zero the lowest full adder in Program 3 1 degenerates into a half adder 3 2 7 Other HDLs HDLs used in industry are Verilog and VHDL Very High Speed Integrated Circuits HDL both developed in the 1980s Both languages were ori
203. nopsvs Inc Syn92 are used to compile the VHDL code into netlists The Xilinx tools ppr are used to compile the final configuration bitstream We have no reported performance numbers on compilation speed but from discussions with other researchers VHDL compilation is in the range of tens to hundreds of minutes and ppr is known to have long runtimes Fie95 We expect therefore that layout synthesis for Splash 2 is a rather lengthy process allowing for only a few design iterations per day 7 1 3 Teramac Teramac ACC95 ACC96 CAC96 of the Hewlett Packard Laboratories in Palo Alto Cali fornia consists of 16 boards with a total of 1728 custom FPGAs PLASMA and 512 MB of RAM 64 independent 32 bit wide banks The system provides at least the equivalent of 1 million logic gates The most important advantage of the PLASMA FPGA is that it has abun dant routing resources which allow for fully automatic placement and routing tools Designs typically have clock rates of under I MHz Communication with a host computer occurs via a SCSI interface and is thus limited to relatively low speeds Software Interestingly the group developed their own FPGA because the place and route tools for com mercially available FPGAs had unacceptable execution times for a custom computing ma chine PLASMA is rich in routing resources and layout synthesis is very fast 3 seconds per FPGA Compilation of a volume visualization design consisting of a quarter millio
204. nput output and control signals of one ALU slice sd so hd r control signals zo co D eZ PC ALU Slice IR p Y so su r hu zi ci Figure F 6 ALU Slice Signals RegCell din data input to register d from input bus loadD register load enable xin yin input to output muxes wrX wrY output mux selectors 0 pass xin yin input to x y output 1 feed register value to x y The register input din flows through a mux controlled bv ds data select selecting either the ALU output or input from memorv D vielding so and then through a shift mux con trolled by shen shen shift enable shd shift down sd su shift mux inputs from next higher or lower bit slice The output d of the shift mux goes to the registers A second shift path is used for multiply and divide instructions in which register 0 r plavs an exceptional role The second shift mux has inputs hd and hu for shifting up or down The ALU part implements inversion exclusive and inclusive OR and AND Controls are neg complement y input xor select XOR or select OR and select AND F Wotan Microprocessor 158 The y input is selected to be either the register output y called Y or to come from the IR register immediate operand ci and co are the carries and zi and zo the chain of AND gates to determine whether all ALU outputs are zero A multiplexer at the ALU output allows to feed in the PC value for branch subroutine The control unit c
205. nt 6 Application and Evaluation 130 6 3 3 Large Pattern Matcher with Hints In our final experiment a big instance of the pattern matcher is generated Table 6 5 lists the result for a pattern matcher with 16 patterns each of which has a length of 12 characters This results in a design with 3048 cells filling most of the available space on a XC6216 FPGA 83 utilization within the bounding box Hades XACT I XACT 2 nap os 10895 10835 3048 Cells and 8299 Nets Reading Design Files 1089s 1089s 35 33st 07s 0x 6i RAE Fabri SN ss 19235 El AE o Map 205 6565 Hee Table 6 5 Pattern Matcher with Hints 16 Patterns of 12 Characters Each of T tes T ess 31835 EA 182777 A ex N UA n Evaluation Again Hades is very fast a factor of 8 faster than XACT We made several runs using XACT and two typical ones are listed They only differ in the placement of the mapper circuit which is non deterministic The effect on routing time is quite drastic though The second run takes twice as long as the first Interestingly prerouting the comparator type results in much slower routing speed in the first case and faster routing speed in the second case As was seen earlier routing only succeeds if the Magic routing resource is allowed so it truly deserves its name at least in combination with XACT The speed of Hades on the Ceres 2 is still on par with XACT on the PC
206. of computing Con96 Alt 2 Another problem when lacking a dedicated coprocessor interface is that logic in the FPGA would have to be used to implement the communication signals between the CPU and the FPGA This logic might be too slow when interfacing to a fast CPU Despite these problems the idea of an FPGA tightly coupled to the CPU looks attractive and we would like to pursue this in the future cf Chapter 8 specifically by using a slower and cheaper CPU with a simple memory system 4 Hades Hardware 43 4 2 2 Extension Card with DMA Only The second alternative from the list above is an RC board that contains an FPGA and DMA logic for quick data transfers from and to memory Such a setup is shown in Figure 4 3 the local memory shown has to be ignored for the discussion in this section The DMA option allows an RC application to access main memory One limitation of this setup however is that all data transfers have to use the system bus If only small transfers are made the control overhead deteriorates performance Depending on the target system bus such a card is relatively easy to build as there isa well defined interface On the downside building a board which can act as a bus master can be quite complex since bus arbitration is needed and additional control signals have to be generated 4 2 3 Extension Card with Local Memory The third alternative overcomes the shortcomings of the solution above in that the RC bo
207. of the Hades soft ware tools and put them into perspective to the commercially available tools for the XC6200 FPGA In Appendix F we implement a small microprocessor on the XC6200 Hades was also used by other groups to develop libraries of arithmetic circuits and DSP algorithms A short overview of their work is given All timings in this chapter were conducted on a Dell OptiPlex XL 5120 PC equipped with an Intel Pentium CPU running at 120 MHz with 256 KB of second level cache and 32 MB of main memory We used Windows 95 from Microsoft and Oberon for Windows V4 0 2 0 from the University of Linz 6 1 Applications Running in Hardware Coprocessor applications consist of a hardware part and a software part The software imple ments the control part of the application steering the data flow to and from the reconfigurable coprocessor It also implements the user interface of the application i e it makes the copro cessor accessible to the software programmer The hardware typically implements that part of the whole application which accounts for most of its runtime Good candidates for operations to implement in hardware are inner loops or whole procedures subroutines being executed many times These operations should consist mainly of integer and bit operations and should contain only a small amount of control logic Hardware in general is profitably used for highly parallel repetitive applications of primitive operations If an application proc
208. oftware interface is given in Section 5 7 and 5 8 4 8 Discussion Adequate Testbed The board has proved its value for implementing and testing coprocessor applications See Chapter 6 for a discussion of an application using the board and a presentation of perfor mance data The performance of the Ceres 2 and its bus was adequate and competitive with a prototype PCl card we received from Xilinx LSC96 which had to be accessed using slow input output commands A Software Person Can Build Hardware With Some Help The Hades RC board was our first hardware project It proved advantageous to have a simple conservative design which was intellectuallv manageable and overseeable at all times We spent more time learning the CAD tools than actuallv designing or constructing the board With some help from hardware experts it was possible for a software oriented person to con ceive describe and implement an FPGA based coprocessor board in three months Interfacing to the XC6200 FPGA Interfacing to the XC6216 is quite easv The control logic for generating the two necessarv signals CS and RW of the synchronous interface is simple and easily built depending on the complexity of the host bus protocol of course Operating System Issue The availability of an operating system which we could change was an invaluable advantage When discussing such issues with other researchers we always heard complaints about the difficulties of writin
209. on a given problem We need more experience with building actual applications especially to see what support is needed in the runtime system to make this technology usable for software program mers Lola is quite good for describing datapaths i e regular logic but descriptions of random logic and state machines tend to be illegible cf Program 6 2 and Section F Tabular methods or truth tables might be better suited for this A simple translator could be written to generate Lola code from such tables The XC layout editor of Trianus has proven its value for experi menting with placement during the construction of an RC application The speed of the tools is excellent and allows for an efficient effective and iterative design cycle 7 Related Work The idea of using FPGAs to build reconfigurable computers must occur to every engineer who hears about FPGAs for the first time It is a compelling idea and seems to have many ad vantages and promises great speedups over conventional software running on general purpose CPUs When confronted with reality however euphoria quickly turns into disillusionment The reason is the difficultv of programming such a system In this chapter we present and discuss selected projects that are related to our work either on the hardware side or on the software side Both issues have received and still receive a great deal of attention by the research community and by commercial companies For each project listed
210. one source and more than one destination one connection after the other is routed The router sorts the type hierarchy topologically innermost first and routes all nets in each type propagating the resulting routing information to all instances of that type using a type broadcast As the module itself is also a type it will be routed by this process in the same manner The router data structure is shown in Program 5 13 and the overall structure of the algorithm is given in Programs 5 14 and 5 15 In those programs r represents a router data structure The router makes extensive use of the broadcasting mechanism The following list ex plains some of the more subtle points of the algorithm shown in Programs 5 14 and 5 15 The numbers in the list correspond to the numbers given in parentheses in the programs 1 The user is allowed to insert wires into a design manually for example to preroute certain nets The extractor of the Trianus framework cf Section 3 3 6 is used to collect 5 Hades Software 86 Program 5 13 Router Data Structure Router POINTER TO RECORD module type TriBase Type module and type which get routed coord TriBase2 Coord list of wire coordinates for type based insertion nets TriNetTables Table table of nets to be routed minU min V maxU max V INTEGER bounding rectangle hierarchies SET routing resources which are allowed for routing map LeeMap matrix of routing resources used for wave expansion
211. onsists of 16 PC slices Each slice contains a bit of the PC register and two muxes The address output used to address memory is determined by the selectors as1 and as0 as shown in Table F 2 The control unit also contains the program counter Its next value is also shown in Table F 2 sat Arrow PC 7 PC fir fire ii ALUz ALU etur Table F 2 Control Unit F Wotan Microprocessor 159 MODULE Wotan register d and simulated register buses x y TYPE RegCell Place data in x y bus in load register write bus x y IN din xin yin loadD wrX wrY BIT OUT x y BIT x y bus VAR d BIT register BEGIN d REG loadD din x MUX wrX xin d conditionally write d to x bus y MUX wrY yin d conditionally write d to y bus IF Place 1 THEN d 1 0 x 0 0 y 1 1 END END RegCell alu bit slice with 8 registers TYPE ALUslice Place CONST Regs 8 8 registers IN din ir pc BIT data in inst reg prog counter su sd hu hd BIT shift data up dn mult div up dn ci zi BIT carry in zero in data select immediate select pc select shift enable shift up dn ds im pcs shen shd BIT and or xor negate and or xor neg BIT register load enable output mux selectors en XS ys Regs BIT OUT Z Y so r co zo BIT r for register 0 VAR d X V u zero pz shift shiftr rin re
212. ools to define and construct this hardware and a software interface Which allows software programmers to make use of the available hardware accelerator 1 6 Overview of Thesis The following Chapter 2 introduces the technology of FPGAs focusing on the target architec ture of this thesis the Xilinx XC6200 Chapter 3 presents the foundations of our Hades sys tem namely the hardware description language Lola and the Trianus framework for FPGA de sign Chapter 4 introduces the Hades hardware a coprocessor board based on the XC6200 FPGA Chapter 5 introduces the Hades software featuring a technology mapper automatic place and route tools a loader and driver for the coprocessor board and a software interface presenting a coprocessor application to the software programmer as an accelerated library module Chapter 6 shows the usefulness of Hades by presenting applications of the board and the software and compares Hades to the commercial tool for the XC6200 FPGA Chapter 7 presents related work Chapter 8 summarizes the presented work draws some conclusions and presents ideas and suggestions for future work 2 Field Programmable Gate Arrays In this chapter we present the motivation for using field programmable gate arrays and de Scribe various architectures In particular we present the Xilinx XC6200 FPGA and the rea sons it was chosen as the target architecture for this thesis 2 1 Background FPGAs continue the trend of higher integ
213. or already routed instances as it must take the routing resources used by the instance into account For example an adder type using length 4 FastLANE in the vertical direc tion will be constrained to a specific vertical position with respect to the switches driving the length 4 FastLANE signals This information must be gathered prior to placement Currently it is gathered prior to routing Another area where further improvement is possible is the inclusion of timing constraints The placer might use timing information to determine which gates should be close to each other although our simple approach already performs quite well as it places the gates in the order they are connected The router might use timing constraints to sort the nets according to their criticality Runtime systems and suitable languages for the development of reconfigurable coproces sor applications are two areas where much work is needed in the future and where considerable improvements must be made to bring this technology into the hands of software programmers 6 Application and Evaluation A large body of literature exists demonstrating the usefulness of configurable hardware to ac celerate applications The yearly IEEE Workshop on FPGAs for Custom Computing Machines is the main conference on this subject In this chapter we use the Hades software to develop a coprocessor application running on the Hades reconfigurable coprocessor board We present performance data
214. osity and more code has to be written Another problem with this approach is that the declarations of interface and private signals are textually separated When writing the implementation the designer always has to consult the entity declaration to see what input and output variables are available Verilog has no generic construct One has to use a preprocessor which then generates the corresponding code An 8 bit adder in Verilog is shown in Program 3 4 An interesting note is that unlike in Lola it is not possible in VHDL or in Verilog to use the output signal of an instance directly In the Adder type we need an auxiliary variable c and a map statement to transfer the carry out of the previous AddElem to the carry in of the next one This may seem like a trivial limitation but it ultimatelv decides whether a language is handy or not Table 3 2 compares Lola with VHDL and Verilog and lists some of the major differences Feature Loa VDE Verice Generics Overloading Structural Behavioral Synthesizable Position Hints Standard Fast Compiler Inspired by Table 3 2 Lola vs VHDL vs Verilog 3 3 Trianus Trianus is the name of the companion project of Hades carried out by Stephan Gehring Geh97 It serves as the base and foundation for the Hades software The principal archi tecture was designed by S Gehring and then fine tuned based on experience with the Hades tools It has proven to be a very robust base for Had
215. ow that the building blocks are defined we connect them together and define the control logic which steers the data flow in the application Programs 6 5 and 6 6 show the Lola program of the final pattern match application which implements 2 pattern matchers each of length 4 2x4 This application together with a software driver is used for the performance analysis in Section 6 2 11 The following list explains some points in the code of Programs 6 5 and 6 6 The numbers in the list correspond to the numbers in parentheses given in the programs 6 Application and Evaluation 114 Program 6 4 Loadable and Buried Registers register of N bits loadable with control signal TYPE LoadReg N IN Id BIT d N BIT OUT q N BIT BEGIN FOR i 0 N 1 DO q i REG ld d i END END LoadReg buried register of N bits loadable with direct register write TYPE BuriedReg N OUT q N BIT BEGIN FOR i 0 N 1 DO q i REG q i END END BuriedReg d7 dj cd d6 1 dan d5 44 dan 4 dA Sy 4 Sy 4 baht Baht DE 10 12 L d3 d3 1 N 94 d2 x27 Hal Hd y DL Su ig GT cda Ad LJ do Ko OM AQT eta PLE 10 2 _ data _ pat Figure 6 7 Data and Pattern Registers
216. r or by the programmer in the form of a library e g a digital signal processor graphics coprocessor A reconfigurable coprocessor RC however is a resource changing its behavior when it is reprogrammed Therefore a runtime system is needed to manage this resource in a way transparent to the user The first paper analyzing these issues in detail is Bre96 In its current form the runtime system of Hades is sufficient for the manual loading of an RC application and interaction through interface objects It is the software programmer or the client of the RC application who initiates the loading of the necessary bitstream onto the RC The runtime system should provide more support for these tasks It has to be known what application is currently loaded on the coprocessor in order to know if a software request results 5 Hades Software 99 Program 5 18 Interface Objects TYPE Interface POINTER TO InterfaceDesc InterfaceDesc RECORD next Interface interface objects can be linked name ARRAY 32 OF CHAR name of variable in Lola map XCBoard MapRegister relevant bits col INTEGER column where this value resides in read value from column col using map register map PROCEDURE VAR i InterfaceDesc Read write value to column col using map register map PROCEDURE VAR i InterfaceDesc Write END Int POINTER TO CharDesc IntDesc RECORD InterfaceDesc val INTEGER value being read and written PROCEDURE VAR i In
217. r Er Er E T 6 Er B B Book Br 5 Er EF EB B E EF Er kr fi gt E E EF EF D EF E fi EK E Er ji Er I f Er Er gt gt t D Er B Er kr ii T E E B i j i dd ii ai E Ce amp x ii E M m 8 m 8 e ii R 2 E 3 e s e ee e 133193 7 E 3i J 3 E E i IE E E 2 2 S Z 2 E E E E S E E E B i 3 E M M M I 333 35 E E E E ti E T T E ki Y M DD DG d 4 o a ft 1 E le L I E E E i ii Figure 6 8 Pattern Matcher without Placement Hints Ideally the pattern registers should be placed right next to the comparators Also since the data registers are read by the comparators they should be placed such that the same bit of 6 Application and Evaluation 118 the register and the comparator lie in the same row The AND gates of
218. r background information for the remaining chapters Further information on Lola can be found in Wir95 Wir96b An overview of Trianus is given in GL96 and detailed information in Geh97 3 1 Hardware Description Languages s in most engineering disciplines traditional hardware design involves a graphical approach using diagrams Today many hardware engineers still use schematic entry to describe their hardware designs Hardware synthesis tools are used to translate these schematics drawings into netlists containing gates such as ANDs ORs Registers and wires connecting these gates The graphical description is well suited to describe the global signal flow in a digital system However a schema containing many components is hard to read and understand Also en tering and altering a schema is often a tedious task as it involves the placement of graphical components on the drawing plane and connecting these with wires Carrving over the methodologv used in software programming languages a textual de scription of hardware is possible using hardware description languages HDLs These tex tual descriptions are precise and the semantics of the individual language constructs are well defined Describing repetitive components is accomplished easilv and descriptions can be pa rameterized with certain values such as the width of a bus HDL compilers perform the same task as their graphical cousins but take a textual description of the har
219. r manually prerouting certain nets and without using Magic routing resources XACT succeeds on the first try only when using the Magic routing resources Note that using type based routing does not improve the result much Hades has a compile place and route design cycle that is a factor of 20 faster than place and route using the Xilinx tools In fact Hades completes the design in the same time as XACT places a preplaced netlist i e no work by the placer has to be done except for placing the mapper circuit and checking the validity of the placement hints Even on the slow Ceres 2 the design is completed faster than on the PC using XACT This is a somewhat sad result as it indicates that 8 years of technological advancement is annihilated by software This is a clear case for the applicability of Reiser s law which says that software is getting slower faster than hardware is getting faster A factor of 20 makes a qualitative difference in the usage of the tools If we include VHDL compilation times this factor increases to about 100 to 1000 depending on the VHDL compiler used Note that the turnaround time with XACT is below one minute which should therefore be considered as fast but when the user has to perform several iterations to find a routable placement a turnaround time of 2 seconds is much better than one of 40 seconds Such a fast design cycle increases productivity and reduces the time needed to find a satisfac tory placeme
220. r with an AND gate tree which is the fastest way to implement a high fan in AND gate The last AND gate e 3 0 will have a value of one if and only if each data bit d i has the same value as its corresponding pattern bit p i Program 6 3 shows the Lola type for that comparator x and y are the bit vectors to be compared Figure 6 5 shows the placement of an instance of the Comparator type It is composed of three arrays and the resulting placement thanks to the array heuristic is quite satisfactory 9 6 Application and Evaluation 110 Program 6 2 Mapping of 8 Bit to 5 Bit Characters TYPE Mapper IN in 8 BIT OUT out 5 BIT VAR t1xOxxxx tIxxOxxx tIxZZxxx tlxxxOxx txxxx0ZZ tx I I xxxx numeric BIT BEGIN t1xOxxxx 1n 6 in 4 tIxxOxxx 1n 6 in 3 t1xZZxxx tIxOxxxx tIxxOxxx tIxxxOxx 1n 6 in 2 txxxx0ZZ in 2 in 1 in 0 tx11xxxx in 5 in 4 numeric in 6 tx11xxxx in 3 in 2 in 1 out 0 7in 7 tIxZZxxx tIxxxOxx in 1 in 0 numeric out l 7in 7 tIxZZxxx t1xxx0xx in 0 in 1 numeric out 2 7in 7 t1xZZxxx in 2 numeric out 3 7in 7 tlx0xxxx in 3 in 6 in 3 txxxx0ZZ numeric out 4 in 7 tx11xxxx 71n 3 in 2 in 1 in 6 in 4 in 3 txxxx0ZZ END Mapper Figure 6 2 Mapper Circuit without Placement Hints 6 Application and Evaluation 111
221. ration of the processor interface and also avoids the need for additional synchronization registers between the Ceres bus and the XC6216 The G1 signal is connected to a 40 MHz oscillator which is mounted on a socket on the board It can be exchanged with slower or faster oscil lators The G2 signal is connected to the XCStep output of the decoder PAL It is used for single stepping and can thus be used to debug an application A single step can be generated by writing to address space 7 The three clocks proved sufficient for most applications Either the main clock GCIk or the single stepping mechanism on G2 was used The fast clock on G1 was only used to test the performance of some basic circuits counters and adders If an application demands a slower clock the main or the fast clock must be divided in the XC6216 This is however quite inconvenient as it introduces additional logic which must be placed at a certain location inside the chip interfering with other logic and complicating the placement task In retrospect a more sophisticated clocking scheme based on an external programmable clock generator would have been advantageous Care was taken when routing the clock signals It was ensured that they run in a single line across the board without branches which would cause reflections JG93 The main clock GClk coming from the Ceres 2 connector first goes to the decoder PALs then to the FPGA then to the expansion connector and is parall
222. ration and more flexibility in programmable logic devices Whereas PALs and complex PLDs are used for the replacement of glue logic on a printed circuit board and the implementation of decoders and simple state machines FPGAs were invented as an alternative to mask programmable gate arrays These are mainly used for the implementation of high volume complex logic chips in an electronic device for which no standard off the shelf solution exists A mask programmable gate array is as the name indicates an array of gates with fixed functionality such as NAND gates where part of the routing network typically the last metal layer in the silicon process is determined by the designer This last layer of metal is manufac tured by the gate array vendor in a fabrication facility The problem with this approach is that the manufacturing process usually takes several days or weeks and that it is quite expensive Errors in the design lead to a faulty chip and hence to a substantial increase in costs It is also not easily possible to explore different design alternatives except using slow simulation FPGAs try to alleviate these problems by making the gates and the routing network pro grammable The designer of a circuit can program the functionality of the chip in the field by simply downloading configuration bits onto the FPGA FPGAs have a clear advantage over MPGAs as it is possible to create a new gate array within a few seconds and with SRAM
223. riBase Mux1 node n y MUX enable MUXI n fct TriBase Mux n y node END 5 Hades Software AND nx pond Mea PY NOT NOT nxx L n y X L a b deleted node prev NOT n L AND nx L M n y NOT NOT nX X I n y X L a b Figure 5 6 Mapping of AND Gate prev label n LATCH nx ny e enable data 66 NOT N X exp OR N X X L ny a b prev OR de n prev y a b r label MUX a 9 gt enable MUX1 ia pa BUF data Figure 5 7 Mapping of Latch 5 Hades Software 67 SR Latch The set reset latch with active low control signals has no equivalent in the XC6200 logic cell and is implemented using two cross coupled NAND gates Like the D latch the SR latch has to be rooted in a signal name We refrain from explaining the transformation in detail and just give the equivalent description in Lola label SR set reset gt label 7 label reset set The newly produced expression tree has to be mapped again as we might have introduced additional negations and AND gates which should be mapped to OR gates One notable point is that the transfo
224. ribed in the last section guarantees a routable design at the expression and type level Of course the approach wastes some cells by leaving them unused In our experience the user can manually improve such a placed expression quite easily while still keeping it routable The responsiveness of our tools guarantees that this can be done efficiently The reader is referred to Chapter 6 and Appendix F for case studies on using the tools interactively 5 5 7 Examples In the following we present some short Lola code examples and the corresponding layout produced by the placement algorithm The routing is shown as well The first six examples are taken from Chapter 6 Section 1 3 of Pfi92 which served as examples for the placement algorithm for the CAL FPGA architecture Alg90 Kea89 The algorithm presented in that thesis uses several heuristics to ensure a routable placement We chose an even simpler place ment algorithm As can be seen the produced result is always routable at the level of the type Figure 5 10 shows a simple expression tree The tree structure of the code can clearly be seen in the layout The first argument of the XOR gate is to the right of it while the second argument is above TYPE Example Tree IN a b c d e BIT OUT z BIT BEGIN z v a b c d x e END Example1 al fe a E E s Q L I
225. ries By using the browser it is possible to obtain a textual description of a laid out circuit or sub circuit This can sometimes be useful for viewing a circuit in a different representation for instance during manual layout when many wires are shown on the screen Schematics Editor As a third possible view onto a circuit or as a third means for design entry Trianus offers a schematics editor called Schem together with a circuit extractor and checker It too is based on the editor framework and is a type based tool Using the browser it is possible to extract a textual view of a schematic and pass this text to the Lola compiler The resulting data structure can then be placed and routed using Hades Hence Schem is a different input possibility to describe part of a coprocessor application Figure 3 7 shows an AddElem in a schematic view Lx x GP pu Figure 3 7 Schema Showing an AddElem 3 4 Discussion Lola is a simple easy to learn hardware description language for describing digital circuits on a structural level Module libraries can be built with Lola which in turn can be used to construct coprocessor applications on the basis of small composable components Position statements can be used to guide a placement algorithm to obtain a good layout The language is small synthesizable and has clear semantics such that it can be learned and put to work withi
226. rmation has to be done in a way that respects the XC editor s limitation given in Table 5 1 The second NOT in the expression above has to be mapped to the left subtree of the AND gate Mapping it to the right subtree could not be represented by the XC editor This is quite a subtle point as one might naturally write down the expression above as label set label reset which would lead to an erroneous expression tree Multiplexer The mapping of a multiplexer node is straight forward A negation on the selector input of the multiplexer is mapped by eliminating the negation and swapping the two inputs i e a MUX sel in0 int gt a MUX sel ini inO A negation occurring at the output of the multiplexer is propagated to the inputs If the input 1s already negated the negation is eliminated otherwise a new negation is inserted a MUX sel in0 in1 gt a MUX sel in0 int Register A register with an enable input is implemented using a multiplexer in front of the register with the selector signal being the enable and one feedback path being the register s output a REG enable b gt a REG MUX enable a b Since the enable signal might be negated the generated multiplexer has to be mapped again to eliminate the negation and swap the inputs When a register is reading directly from a signal name an additional buffer between the register and the signal has to be inserted as this is man
227. rmine the polarity of input signals to a cell The reader should keep the presence of these inversions in mind when reading subsequent chapters All Boolean functions of two or less inputs can be implemented by a single multiplexer with optional inversions at its inputs Figure 5 4 in Chapter 5 shows all cell functions supported by Hades As an example we show the implementation of an AND and an XOR gate other functions can be implemented accordingly The true value of the function is taken at point C in Figure 2 4 hence the inversion on the output of the central multiplexer must be taken into account To implement F a A b we connect a to XI and X3 and b to X2 Y2 and Y3 select the inverted value of X2 and X3 respectivelv F a A b can be implemented by selecting the true value of X2 at Y2 To implement F a Q b we connect a to XI and b to X2 and X3 inverting X3 at Y3 Figure 2 5 shows the resulting circuit and Table 2 1 shows the truth table of the functions We use the Lola notation for describing a multiplexer Chapter 3 where MUX sel inO inl indicates that when sel is zero the multiplexer selects in0 and when it is one it selects in 2 3 2 Routing Network The routing network of the XC6200 consists of connections between neighboring cells and a hierarchy of longer connections between switches located at 4 and 16 cell boundaries and at 2 Field Programmable Gate Arrays 12
228. roc 5th Intl Workshop on Field Programmable Logic and Applications LNCS 975 Springer 1995 P W Foulk Data Folding in SRAM Configurable FPGAs Proc IEEE Sympo sium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1993 R J Francis J Rose Z Vranesic Chortle crf Fast Technology Mapping for Lookup Tabled Based FPGAs Proc 28th Design Automation Conference ACM IEEE 1991 J Galloway The Transmogrifier C Hardware Description Language and Com piler for FPGAs Proc IEEE Symposium on FPGAs for Custom Computing Ma chines IEEE Computer Society Press 1995 Bibliography 171 GJ79 Geh97 GLW94 GL 96 GHK90 Gol89 GS95 Guc94 GG 95 Guc95 Gut97 GMN96 HH95 HWA76 HHC96 Hee88 Hee93 M R Garev D S Johnson Computers and Intractabilitv A Guide to the Theorv of NP Completeness W H Fremann 1979 S Gehring An Integrated Framework for Structured Circuit Design with Field Programmable Gate Arrays Dissertation to appear ETH Z rich 1997 S Gehring S Ludwig N Wirth A Laboratory for a Digital Design Course Us ing FPGAs Proc 4th Intl Workshop on Field Programmable Logic and Applica tions LNCS 849 Springer 1994 S Gehring S Ludwig The Trianus System and its Application to Custom Com puting Proc 6th Intl Workshop on Field Programmable Logic and Applications LNCS 1142 Springer 1996 M Gokhale W Holmes A
229. routing rich FPGA For groups without this kind of resources fast interactive integrated tools like the Trianus and Hades system provide a better way to achieve dense layouts quickly 8 Summary Conclusions and Outlook 8 1 What has been Accomplished A project covering both disciplines of computer science and computer engineering is very interesting and challenging In this thesis we described the development of both hardware and software to build a complete system called Hades reconfigurable coprocessor based on the new Xilinx XC6200 FPGA architecture was developed along with associated layout synthesis tools place and route and a rudimentarv runtime system for coprocessor applications No new algorithms were developed no novel approaches in hardware design were invented We made use of the available knowledge of algorithms and of the novel features the hardware provided and combined these into a usable efficient reliable and small system for experimenting and implementing algorithms on a reconfigurable coprocessor 8 2 Hades Hardware The Hades hardware consists of a reconfigurable coprocessor board containing the new Xilinx XC6200 FPGA The board is designed for the Ceres 2 workstation It implements a memory card interface such that communication with the board uses the same protocol and has the same latency as normal DRAM memory The configuration memory and cell states of the XC6200 FPGA are accessible via this interface as wel
230. rsion 4 is the lack of a reference implementation When porting our tools from one platform to another ever so often small adjustments had to be made 5 12 Discussion Hardware synthesis is a difficult problem and may never be as fast as software synthesis i e object code generation The layout problem alone is more difficult as it is two dimensional whereas in software it is one dimensional placement of instructions in the instruction stream Hence it might always be necessary that designers must resort to low level descriptions of hardware including placement information to achieve high performance and density Simi larly in software programmers sometimes have to resort to assembly language to reach higher levels of performance To support this design style the current Hades software provides for a very fast design cycle and lets the user exert tight control over the design process As we develop good reusable libraries fewer and fewer gates have to be placed manually using the layout editor Data path elements should be preplaced and prerouted In the end only the placement of state machines and other control logic might need manual assistance For this a simulated annealing approach might be viable For the placement of instances a min cut algorithm such as the one in Kri84 should be implemented and evaluated 5 Hades Software 105 Once library components are used the placer will be constrained in choosing the place ment f
231. ructure for AddElem Type 3 Foundations Lola and Trianus 35 3 3 3 Algorithms The core of Trianus provides various algorithms which operate on the data structures just described The most important ones for the Hades tools are algorithms for placing nodes which are used by the placement algorithm inserting wires which are used by the router and broadcasts which are used by all back end tools such as the mapper placer router and loader The broadcasting mechanism is of special interest as it is used to guarantee the afore mentioned consistency between instances and their type The relevant constants types and procedures are shown in Program 3 10 Depending on a selector a message based on the base type Message is sent to the nodes and wires in a Trianus data structure including all extensions of nodes For instance it is possible to send a message to all instances of the same type by using the selector SelType This can be used to distribute placement and routing information to all instances of a type after the type has been placed and routed Note the terminology used herein for message sending is not to be confused with the one from object oriented programming The broadcast mechanism is a generic data structure iterator construct It applies operations to the nodes and wires of a data structure The nodes and wires do not react to the message sent to them it is rather the iterator mechanism that invokes the operation on the no
232. s Once it is known which nets have to be routed the sequence in which the nets are routed has to be determined We choose a straightforward heuristic to determine this sequence Shorter nets are routed before longer nets and nets with the source and destination in the same row or the 5 Hades Software 89 IL FH An TL THA 10 0 L L 14 0 Figure 5 18 Routing Resource Conflicts Program 5 16 Marking of Wires Running over Instances Below some necessary type guards have been left out for brevity PROCEDURE MarkWire wire VAR mark r mark router TriBase2 AbsWire wire wu WV absolute position of wire DEC wu mark minU DEC wv mark minV wu wv relative position of wire to instance causing this broadcast IF In wire to r wires THEN only consider legal wires mark wire position relative to tvpe s u v SetMap r r type u wu r type v wv wire to Used END PROCEDURE MarkWires inst VAR msg ASSERT inst tvpe msg tvpe OR inst msg type 100 IF inst type msg type THEN it is an instance not the type itself minU minV contains absolute position of inst TriBase2 AbsNode inst mark minU mark minV bounding box of inst is stored into mark r mark r u mark minU mark r v mark minV mark r w inst w mark r h inst h mark doWire MarkWire mark router
233. s type w bounding box width calculate bounding box type h bounding box height msg type type msg placer p propagate placement information stored in the placer msg placer to all instances of type msg type TriBase Broadcast module msg TriBase SelType list list next END NewPlacer p PreplaceOutputBufs p module PlaceDescendants p module UpdateType p copy placement information The interesting part of the placement algorithm is the placement of the signals and in 5 Hades Software 72 stances occurring in a type This is accomplished by procedure PlaceDescendants We use several heuristics described below to achieve a reasonable placement which can then be im proved using the layout editor and placement hints The sequence of objects being placed is as follows 1 instances and signals with their associated expression tree which have a position hint 2 arrays of instances 3 arrays of variables arrays of output and inout signals instances remaining signals nn FB anonymous expressions reachable only via an input signal of an inner instance The reader may note the extensive use of the array property In fact the presence of arrays of signals and instances and the availability of this information in the back end tools is the single most crucial information It makes a simple constructive approach usable achieving a reasonably good placement on the first attempt In the following we present th
234. s accomplished by invoking a procedure for every instance and for the type itself using a type broadcast cf Section 3 3 3 The mapper is the first of the back end tools making use of this broadcasting mechanism Recall that the type broadcast invokes the procedure stored in the doNode field of the message for all instances of the type stored in the type field and finally for type itself The used procedure MapType calls MapDescendants which traverses all signals defined in an instance or type and invokes a map procedure for them After the broadcast the signals occurring in the module itself are mapped Finally unused signals and instances are removed from the data structure This could be done by the compiler but was not considered during its development It came as a natural addition to the mapping process as every node is visited by the algorithm and its use can be determined Program 5 1 presents the pseudocode for the process just described Program 5 1 Overview of Mapping Algorithm PROCEDURE MapModule module TriBase InitID module unmark all nodes and wires TriBase2 TopoSort module list topological sort msg doNode MapType WHILE list NIL DO msg type list type call msg doNode for every instance of msg type and for msg type itself TriBase Broadcast module msg TriBase SelType list list next END MapDescendants module RemoveUnused module We will now describe the necessary steps to actually map the opera
235. s caching loop unrolling and implementation in assembly language Often though these techniques do not achieve the required speedup If a large enough user base is interested in solving such a task quickly and there is enough economic interest to justify the investment special purpose hardware is built This hardware solution of a task can consist of a board full of chips or of a single chip The latter is called an Application Specific Integrated Circuit ASIC The production of ASICs is expensive and time consuming but the resulting chips solve the task much more quickly than a general pur pose CPU A coprocessor is an ASIC which aids the CPU by speeding up a task and usually requires the presence of the former in a system to function properly Figure 1 1 shows a schematic view of such a system The coprocessor is either closelv coupled to the CPU with a dedicated interface or it resides on an extension card attached to the system bus Copro CPU __ Memory cessor Dedicated Interface Fast Other Input System Bus Coprocessor Output Figure 1 1 CPU and Coprocessor Well known examples of early coprocessors are Floating Point Units FPU such as the Intel 8087 and the Motorola 68881 In the past years as chip area became larger and cheaper FPUs were integrated into the CPUs resulting in even better performance Coprocessors more common today are digital signal processors DSPs video
236. s in the second subtree Thus the first subtree is placed at the same vertical position as the root cell and the second subtree is placed at the same horizontal position with the vertical position increased by the first subtree s height In case of a multiplexer the second subtree is placed to the right of the root cell and above the first subtree and the third subtree is placed above the root cell and above the second subtree The figures at the end of this section illustrate the different placement strategies Program 5 10 and 5 11 show in more detailed form the actions taken when placing the various node types In that program p is a placer data structure which stores positional information for the placed nodes This information is used during a type broadcast at a later stage Program 5 10 Placement of Nodes I PROCEDURE PlaceArg p arg u v offU off V to VAR w h If arg is a gate or another expression tree it is placed at u offU v offV If it is an input label it is placed at the point of usage u v w and h return the size of the arg expression tree PROCEDURE PlaceNode p sig u v to VAR w h IF sig id Void THEN not yet placed my Wx 0 myWy 0 myHx 0 myHy 0 CASE sig fct OF TriBase BIT TriBase TS new root if already placed it is a leaf PlaceObject p sig TriBase Object u v myWx myHx myWx Max my Wx Cell myHx Max myHx Cell TriBase Zero TriBase One leaf cell ASSERT sig x
237. s initialized with the value Free indicating that all routing re sources are available Then already used routing resources i e wires already present in the Trianus data structure are marked in the Lee map with Used Finally all wires that are be ing sourced by the source node and all neighbor outputs of the source node itself are marked as possible destinations of the wave with the value Dest The spreading of the wave in the Lee map can terminate as soon as a position marked as Dest is encountered The spreading starts at the destination node proceeding outwards in all directions until the bounding rectangle cf next paragraph is encountered or an entry marked as Dest is found The spreading has to start at the destination node since we might have multiple target positions marked as Dest For each routing resource which might be connected to the destination all possible sources feeding that routing resource are marked as well and the wave spreads from those points further on To ensure that once the source is found a way back to the destination can be constructed the routing resource from which the wave spread is entered into the map at the current position To limit the number of map points visited a bounding rectangle is used If atype is routed the size of this rectangle is the size of the type s bounding box cf Section 5 6 5 If the net is in the top level module the rectangle is made 1 4 larger on each side than the bounding
238. s is oriented towards datapaths in that it tries to preserve the hierarchical and structural information available in the design description One of the project s goals is to have fast tools 7 3 3 MATRIX MATRIX MD96 DeH96 from the Massachusetts Institute of Technology Cambridge is an array of programmable functional units operating on 8 bit operands and a dvnamicallv pro grammable connection network Data flow inside MATRIX can be steered by the application itself and instructions for the functional units can flow alongside the data The architecture can be used to implement systolic arrays VLIW processors microcoded processors or any combination thereof No compiler exists yet 7 3 4 RaPiD RaPiD from the University of Washington Seattle implements a reconfigurable pipelined data path Similar to MATRIX it has programmable functional units but not an equally flexible connection network It is optimized for systolic array operations The functional units have floating point capability an ALU and a multiplier The connection network is augmented with pipeline registers leading to efficient small systolic array structures No compiler exists yet 7 4 High Level Hardware Description Currently there are three widely used ways to describe an application on a reconfigurable co processor or a custom computer One uses schematic entry one uses a hardware description language and one uses a program written in a general purpose high le
239. s placed below the result FIFO queue in the rightmost column The critical path through this circuit is 137 ns It runs from a data register through the comparator the eql chain and the OR gates to the result queue Clearlv such a delav is unacceptable although it still results in an aggregated performance of 1 4 billion character comparisons per second 1090137 16 12 Pipelining could be used to lower the critical path of this circuit Even so the 7 MHz throughput rate of the circuit would suffice when reading data from a disk 6 2 10 Software Interface To make a hardware application accessible from software a driver module is needed Using the automatic software interface generator available in Hades a programmer can construct a driver module quickly cf Section 5 8 Programs 6 7 and 6 8 list the module produced by the Hades interface generator for the 2x4 pattern matcher shown in Figure 6 9 The two procedures marked with 1 and 2 were written by the programmer to ease the task of downloading the patterns Note that instead of implementing the mapping table for the patterns in software we simply use the available mapping hardware in the coprocessor application to map the pattern characters to their 5 bit values If needed the software programmer can edit the generated code for example to change the interface types of shiftReg and result to a set The impact of this change can be seen in Program 6 9 which lists the search proc
240. s used to represent the outermost scope of a program E g type AddElem type Adder N in Program 3 1 module Add in Program 3 2 Wire The last type left to describe is Wire It represents a physical connection in a circuit whereas logical connections are represented by the operand tree of Node For example an OR gate reading from two AND gates as shown in the full adder in Program 3 1 has two logical con nections to the AND gates represented by the x and y pointer in the Node data structure while the actual connections made in an XC6200 FPGA for instance using neighbor connections would be represented by Wires The definition of Wire is listed in Program 3 9 The specific values for from and to are determined by the respective back ends The back end for the XC6200 FPGA for instance stores information on the routing multiplexers into from and to e g from Function Unit output to West neighbor routing multiplexer 3 Foundations Lola and Trianus 33 Program 3 9 Definition of Wire Wire POINTER TO WireDesc WireDesc RECORD from to SHORTINT physical source and destination u v w h INTEGER physical location and dimension next Wire next in same net link Wire next in same instance outer Instance outer instance type id INTEGER END Summary and Example Figure 3 3 summarizes the principal types of a Trianus data structure and shows the corre sponding Lola constructs Trianus Types Lola Construct
241. se some gates were left unplaced The router took about 40 times as much time and resulted in twice as many unrouted nets despite utilizing the Magic resources We tried the automatic ripup and reroute feature of XACT After 10 iterations which took nearly 3 hours there still were 52 unrouted nets This experience undermines our request for a fast interactive and iterative design cycle 1247 Cells and 3931 Nets Hades XACT high Map amp Place sj gt 035 ms Bounding Box 34x4 34x54 Roe f 231s 1976s Umwws 8 87 Route using Scrip 412s orke Unos Total RR Speedup of Hades Table F 3 Wotan Place amp Route Times 41 7s 1236 s 10060 s EE Clearly the requirement of our tools to specify the location of nearly every cell is an undesirable feature but it is necessary to achieve a compact routable and fast design The sophisticated placement algorithm of XACT fails to find a satisfactory solution and the user has to give hints as well Wotan has a critical path of 163 ns and should therefore run at about 6 MHz Since most instructions have two phases except for load and store it executes about 3 MIPS The resulting design is shown in Figure F 8 Note that the size of the bounding box in Table F 3 does not include the address drivers on the right side of the chip With these the bounding box would be 64x54 cells F Wotan Microprocessor 166
242. signer shifts the view onto a circuit only those parts should be redrawn which enter the field of view the unchanged portion of the screen should be moved using a bitblock 3 Foundations Lola and Trianus 38 transfer It is a sad fact that many commercial layout editors simply redraw the whole screen when something is changed or the view is shifted This results in slow screen updates and flicker and it unnecessarily slows down the editing and layout process 3 3 6 Other Tools Trianus offers additional tools of which only the circuit extractor is used by the Hades router They are briefly described in the following sections More detailed information can be found in Geh97 Checker and Extractor Rather than synthesizing circuits it is sometimes desirable or even necessary to hand craft a circuit with the aid of a layout editor Such manual circuit implementation is inherently error prone To support manual implementation Trianus provides a circuit checker which compares a design with a circuit specification The design may be entered with the layout editor and the specification may be obtained through a Lola program Mismatches between specification and implementation are detected and denote errors in the laid out design assuming a correct specification The checker is type based and therefore very fast Matching is based on ordered binary decision diagrams OBDDs Bry86 Bry92 Geh97 Each variable can be either one or zero and thus
243. speed up this evaluation process is by using a second processor in a multi processor system Another way would be to use an RC board which has access to main memory The guards could then be evaluated by the coprocessor and the scheduler running on the CPU could simply test a bit vector to determine which guard is asserted 6 5 3 System Monitoring An RC board which has access to the system bus can monitor the computer system and gather statistical information about bus traffic This can be useful to analyze system performance and to find out where an application spends most of its time At DEC SRC for evaluation purposes a PCI Pamette Sha96 is used to generate and monitor traffic on the PCI bus 6 Application and Evaluation 133 6 5 4 Support for Arbitrary Precision Integers To speed up calculations with arbitrary precision integers an RC board can be used to acceler ate addition and especially multiplication operations The PAM group used FPGAs to speed up long integer multiplication and implemented an RSA algorithm which held the speed record for a long time VBR96 6 6 Discussion Our own experience and experiences from other groups with Lola Trianus and Hades give reason to believe that we have constructed a usable and reliable system to define and imple ment RC applications More work is needed to find out what kinds of applications are suitable for RCs i e a taxonomy is needed for quickly determining the applicability of RCS
244. ss comparator the router optimized address line A 24 by moving it to its natural position between A 25 and A 23 cf Section 4 5 1 The wire length was reduced since a crossover was removed but the decoded address was wrong Cutting the nets on the PCB by hand and inserting two patch wires solved the problem however in a not very aesthetic way The lesson learned was double and triple check the final plots of your board before sending them to the manufacturer 4 6 5 Power Consumption A fully populated Hades board with 8 SRAM chips 3 decoder PALs and an idle FPGA draws 980 mA 4 9 W in standby mode without being accessed Accessing the SRAM from the host draws another 230 mA 1 15 W An FPGA design full of registers 4096 toggling at 25 MHz adds another 300 mA 1 5 W For comparison purposes the Ethernet board for the Ceres 2 draws 850 mA 4 25 W in standby mode without being accessed so the standby current used by Ceres 2 extension cards seems to be quite high 4 Hades Hardware 54 4 7 System Software The Hades board occupies I MB in the address space of the host s operating system This address space must not be cached otherwise changes in the FPGA or in the on board mem ory are not observed by the processor The Ceres 2 runs the Institute s Oberon operating system WG92 whereby we were able to modify the kernel to provide the needed address space More details on the bitstream generator and the hardware s
245. ss to the FPGA over the 33 MHz PCI bus which takes an optimistic 60 ns the Alpha could issue in theory 60 64 bit instructions Hence either the latency of commu nication must be improved drastically or reconfigurable co processors make only sense with slower CPUs which are hard to come by these days In the future we see several possibilities for improving the communication speed and reducing the latency e An RC could be mounted on a memory card and attached directly to the fast local memory bus of the CPU as is done on Ceres as opposed to the system bus as is done on PCI e The RC could be attached to the CPU via a coprocessor interface e The FPGA could be moved directly onto the CPU die and incorporated into the data path of the CPU BRA96 DeH96 ECF96 Raz94 8 6 2 Software It seems that due to its very nature software is never finished We see several opportunities for improving the software in Hades Placer Although the deterministic placement algorithm currently used in Hades is very fast the pro duced results must be improved manually Slower but smarter algorithms should be evaluated to improve the placement within types and the placement of instances For random logic a stochastic algorithm could be used for arrays the approach used in the Hades placer gives good results and for placing whole instances a min cut approach could be used Router The routing algorithm is currently the bottleneck in the H
246. st instance and compares the positions of all subsequent instances to that posi tion excluding routing resources not available to all of them Obviously invoking this procedure is not necessary when the module itself is routed 4 A further problem with multiple instances of the same type is the fact that routing re sources crossing these instances must be marked as used as they are not available for routing the current type Consider again the instances shown in Figure 5 18 The neigh bor wires running upwards in the third column make the cell s north output of the second and third column unusable during the routing of the type The north output is neither available in the third cell of i1 nor in the second cell of i2 hence it is not available for routing in the second and third cell of the type itself Program 5 16 shows in detail how the marking is accomplished namely by using two nested broadcasts a type broadcast for finding all instances of the current type and a bounded broadcast for marking the wires intersecting the current instance 5 Just as was the case in the mapper algorithm the router has to process input variables separately as they are additional destinations of nets which are not accessible from 5 Hades Software 87 Program 5 14 Overview of Routing Algorithm I PROCEDURE RouteSingleType module type PrepareRoute module extract manually inserted wires 1 TriBase2 TopoSort type list WHILE list f NIL DO this
247. stem bus such as a PCI bus PCI93 Chapter 7 presents some of the RCS that have been developed in the past The bigger boards described therein achieve or surpass performance levels normally attributed to supercomputers 1 Introduction 5 Programming an RC is difficult even more so than programming a CPU As the execution unit of an RC is one or several FPGAs programming an RC means designing hardware There are efforts under way to allow for the generation of FPGA designs directly from a high level programming language such as C AS93 Gal95 IS95 but by and large the main method for describing an RC application is by means of schematic entry or hardware description languages This might be one reason why RCS are still not very popular as few people are trained in hardware design and as hardware synthesis software is usually big expensive and often slow 1 4 Hardware Synthesis Hardware is usually described using a combination of the following three means e Schematic entry e Hardware description language HDL e Circuit layout Still many hardware design engineers use schematic entry to describe a hardware circuit But over the last years hardware description languages HDLs have gained ground and are be coming more popular For utmost performance and density using a layout editor is still the preferred way for describing a circuit Note that any combination of these three methods can be used to describe a complete system A
248. t coming from the OR gate in the previous adder element Routing resources can be put into two categories channels and switchboxes A routing channel is a region free of logic cells The cells lie on either side of the channel Wires run from one side of the channel to the other possibly crossing over each other connecting cells on one side with cells on the other side see Figure 5 2 A more general routing structure is the switchbox which connects wires from different sides While routing channels are often found in standard cell technology switchboxes are typical for FPGAs In Figure 5 2 a channel router connected the top row of cells to the lower row of cells For a given width the channel router uses as high a channel as is necessary to perform the routing These routers perform not so well in FPGAs as there is a multitude of available routing resources not just channels Routing Channel Figure 5 2 Routing Channel A maze running router Lee61 Hig69 spreads a wave from the destination D of a signal in all directions until the wave reaches the source S signal cf Figure 5 3 Depending on the implementation the wave starts from the source or from both ends Also the shape of the wave can be influenced in many ways such that only horizontal or only vertical wires are used or that expansion in one direction is more costly than in another direction A maze runner represents the most flexibl
249. t routable 38 unrouted nets XACT is slower by a factor of 5 to 12 with comparable results Higher placement efforts generally result in better routable designs But even using the highest placement effort and allowing for the transformation of instances still results in an unroutable design Bv using two ripup and reroute steps on that design however the design can be routed successfullv The mapper circuit of Figure 6 2 is placed within 5x12 cells as opposed to 8x14 cells by Hades In this case the stochastic placement algorithm results in a compact placement It is noteworthy that for XACT no clear recommendation can be given on whether to use type based routing or not Also for multiple runs using the same option the stochastic nature of the placement algorithm becomes apparent as the resulting routing times varies by as much as 400 The numbers listed in Table 6 3 indicate typical run times i e run times a user most likely experiences It might be possible that a route completes 4 times faster as was the case during testing but there is no clear recipe to achieve that For some placements prerouting the types results in faster routing medium and for others it slows down routing high trans The use of the Magic routing resources however is recommended as it always reduces the number of unrouted nets It can have a negative effect on the routing time though None of the placements produced by XACT would have been usab
250. tDesc Read PROCEDURE VAR i IntDesc Write END PROCEDURE InitDescriptor VAR i InterfaceDesc name ARRAY OF CHAR map XCBoard MapRegister col INTEGER PROCEDURE SetMap VAR i InterfaceDesc is map register value of sub d subset of of PROCEDURE SubMap VAR sub of InterfaceDesc BOOLEAN 5 Hades Software 100 in the loading of a new hardware module This and other information about an RC application should be stored in an application descriptor to answer questions such as e Where on the RC is the application located and how much area does it occupy e Does the application use IOBs to interface with the host e Does the application use the on board SRAM e Are interrupts used All of this information is present in a Trianus data structure It has to be distilled into an application descriptor which can be stored together with the configuration bitstream to a file or into a program module 5 9 Support Modules and Genericity in Oberon 5 9 1 Data Structure to Store Temporary Data The Trianus framework implements a data structure that contains enough information to repre sent an FPGA design If certain algorithms require additional information associated with the nodes and wires occurring in the data structure these algorithms must manage this additional information themselves Both the placer and the router are tools that require such additional data structures We developed a module which defines a data structure an
251. ta H J Kim C Jones S Lans ing B Mangione Smith Configurable Computing Solutions for Automatic Tar get Recognition Proc IEEE Symposium on FPGAs for Custom Computing Ma chines IEEE Computer Society Press 1996 Virtual Computer Corp http www vcc com 1997 J Vuillemin P Bertin D Roncin M Shand H Touati P Boucard Pro grammable Active Memories Reconfigurable Systems Come of Age IEEE Trans on VLSI Systems Vol 4 No 1 March 1996 M Wazlowski L Agarwal T Lee A Smith E Lam P Athanas H Silver man S Ghosh PRISM IT Compiler and Architecture Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1993 N Wirth and J Gutknecht Project Oberon The Design of an Operating System and Compiler Addison Wesley 1992 N Wirth Digital Circuit Design An Introductory Textbook Springer 1995 N Wirth Compiler Construction Addison Wesley 1996 N Wirth The Language Lola FPGAs and PLDs in Teaching Digital Circuit De sign Proc 2nd Intl Andrei Eshov Memorial Conference LNCS 1181 Springer 1996 M J Wirthlin B L Hutchings A Dynamic Instruction Set Computer Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1995 R Woods Answer to question asked during presentation of WCG96 1996 R Woods Queen s University of Belfast Northern Ireland Personal Communi cation 1996 R Woods A Cassidy J Gray VLSI Arc
252. tations http www eee bham ac uk James RoxbyP reconfig htm james roxby birmingham reconfigurable Philip James Roxby s page about reconfigurable computing containing Lola examples screen shots of Hades in action and more pointers to other resources http www vcc com VCC 6200 Page of Virtual Computer Corporation offering a PCl card featuring an XC6200 FPGA http www xilinx com xilinx 6200 product literature Page with data sheet for the XC6200 167 Act95 AMD95 ASU86 ACG95 Alg90 Alg91 A1t96 ACC95 ACC96 ABD92 AS93 Atm95 BRV89 Ber93 BT94 B1i96 BM77 Bibliography Actel ACT Family Field Programmable Gate Array Data Book 1995 Advanced Micro Devices Programmable Logic Data Book 1995 A V Aho R Sethi J D Ullman Compilers Principles Techniques and Tools Addison Wesley 1986 M Alexander J Cohoon J Ganley G Robins Performance Oriented Place ment and Routing for Field Programmable Gate Arrays Proc European Design Automation Conference IEEE Computer Society Press 1995 Algotronix CAL 1024 Data Sheet 1990 Algotronix Configurable Array Logic User Manual 1991 Altera Data Book 1996 R Amerson R J Carter W B Culbertson P Kuekes G Snider Teramac Configurable Custom Computing Proc IEEE Symposium on FPGAs for Custom Computing Machines IEEE Computer Society Press 1995 R Amerson R J Carter W B
253. ter cannot be read by logic cells and are therefore faster The serial or 8 bit parallel programming interface can be clocked at 10 MHz Partial reconfiguration is possible down to a single cell and the state of all user registers can be read 2 Field Programmable Gate Arrays 17 Express Bus Sa Local pus ad er T pi T D Apt B Are HA Ba Ba U U U U gt gt B A put B A e Aa Blyttia BL Repeater A B Describe Outputs Figure 2 11 AT 6000 Routing Network 2 43 XC4000EX The XC4000EX is the third version of the market leader architecture from Xilinx X1196 It is a coarse grained SRAM based FPGA based on the architectures of the 2000 and 3000 series Figure 2 12 shows a diagram of a logic cell It has three function generators F G H plus two registers XQ YQ F and G have four H has three inputs two of which can be the outputs of F and G The function generators are implemented as lookup tables and F and G can be used for implementing distributed SRAM In addition the XC4000 features fast carrv logic for speeding up adder and counter structures The rout
254. ternal file format used by XACT to store designs on disk 6 Application and Evaluation 127 6 3 1 Small Pattern Matcher without Hints The first experiment is the automatic placement and routing of the pattern matcher developed for the Hades reconfigurable coprocessor It has two parallel pattern matchers each consisting of four 5 bit characters It was chosen as a typical example of a coprocessor application with a regular datapath data and pattern registers comparators some random logic mapper and little control logic shift step register Most often such a small design is used to determine the placement hints for the data path part which are then put into the Lola code It is therefore mandatorv that the design cycle is fast as many iterations are needed to find a satisfactory placement which is also routable Table 6 3 summarizes the results obtained We used various options in XACT for auto matic placement and routing A tvpical user may just use the default options supplied by the tools which are high effort no type based routing use of Magic routing resources 248 Cells and Hades XACT XACT XACT XACT Bre PYNT 31 eds af es ef Reading Design Files ef s ny 1s Mp Os I Tr Pes HHH 294s 4735 Bounding Box 36x22 18x18 13x63 13x54 13555 Ru I I r 1 Without Magic 185s 7733s TRS TT Unos I 18 af ar B PreroutedTypes 75s 178s 1123s 3595 9315
255. the Board 4 7 System Software llle 4 8 Discussion E R e none 5 Hades Software 5 1 Problem Statement and Motivation 5 2 Programming Methodology 3 32 OVerVIeW sa Serra te e SEMI DA Mapper se SEG SS SE Ge EVE B 5 5 Placer and Floor Planner 22222020 DO ROUET E oe eens na ur Tp CDU SA ant x Vil Contents Vili 5 7 Bitstream Generator and Loader llle 95 5 8 Runtime System c se skog va Ge nina EUR B 98 5 9 Support Modules and Genericity in Oberon rn rv enn 100 5 10 Quantitative Issues Coon 101 5 11 Experiences with Our Programming Methodology and Oberon 103 2012 DISCUSSION 2 u sende ee gale heste esse he ee dE ae A 104 6 Application and Evaluation 106 6 1 Applications Running in Hardware LL 106 6 2 Pattern Matching Application o o e rer enn 106 6 3 Comparison to XACT step Series 6000 126 6 4 Hades inthe World ssx R sene gt EEK ER R R sek Br 131 6 5 Possible Future Applications o e 132 6 6 Discussions bs a MR at ER Sor boet ate 133 7 Related Work 134 T l Custom Computers tersa sus Rb peg dede des kossen RUE Xa Ge 134 7 2 Reconfigurable Coprocessors le 135 7 3 Reconfigurable Processors 2 2222 e 136 7 4 High Level Hardware Description o e 137 7 5 The Need for Better Tools o 0 002 002 eee 138 8 Summary Conclusions and Outloo
256. this issue further Next we present a parallel to serial converter in Figure 5 15 When the layout is compared 5 Hades Software 79 TYPE Example3 N Selector IN a b c d BIT q r s t N BIT OUT p N BIT BEGIN FOR i 0 N 1 DO p i ax q i b x ri cx si dx ti END END Example3 haley xs ba As E Eh HE S Es Fa Figure 5 12 Selector Example 5 Hades Software 80 TYPE Example4 N Mux IN a 2 BIT q r s t N BIT OUT p N BIT BEGIN FOR i 20 N 1 DO p i MUX a 1 MUX a 0 qii ri MUX a0 s i ti END END Example4 alt D la Figure 5 13 Multiplexer Example to the same example in Pfi92 the advantage of the XC6200 s ability to implement a register and a multiplexer in the same cell is apparent The last array element is a register with an enable input which is implemented by the mapper with a multiplexer with feedback and the rest are simple multiplexers in front of registers Again the expression tree placement heuristic overrules the array placement heuristic Finally we show the resultin
257. titioning Graphs Bell System Technical Journal Vol 49 February 1970 Bibliography 173 Keu87 KG V83 Kni96 Kri84 Lat96 Lee61 ILS88 Log91 Lud94 Lud96 LSC96 LD93 MD95 MD96 Moo59 MW91 Mot95 Mul97 NS88 K Keutzer DAGON Technology Binding and Local Optimization by DAG Matching Proc 24th Design Automation Conference ACM IEEE 1987 S Kirkpatrick C D Gelatt Jr M P Vecci Optimization by Simulated Anneal ing Science Vol 220 May 1983 G Knittel A PCI compatible FPGA Coprocessor for 2D 3D Image Processing FPGAs for Custom Computing Machines 96 IEEE Computer Society Press 1996 B Krishnamurthy An Improved Min Cut Algorithm for Partitioning VLSI Net works IEEE Trans on Computers Vol C 33 No 5 May 1984 Lattice Lattice Data Book 1996 C Y Lee An Algorithm for Path Connections and its Applications IRE Trans Electronic Computer Vol EC 10 September 1961 B Liskov L Shrira Promises Linguistic Support for Efficient Asvnchronous Procedure Calls in Distributed Systems SIGPLAN 88 Conference on Program ming Language Design and Implementation ACM SIGPLAN Notices 23 7 1988 Logical Devices CUPL PLD FPGA Language Compiler 1991 S Ludwig Conventions for Programming Internal Memo Institute for Computer Systems ETH Ziirich 1994 Also available in the Oberon System 3 distribution S Ludwig The Design of a
258. tors of the Trianus data structure onto the XC6200 Table 3 1 listed the operators available in Lola and Figure 2 4 showed the structure of the logic cell ofthe XC6200 FPGA Figure 5 4 shows the possible cell 5 Hades Software 63 configurations without the optional register as represented in the XC editor Inputs to the cells come from left right and below c xb il i i l xE e D l l l p xL 4b l l H i 1 i l sO xL spl l l HE 10 0 1 J d J j 14 0 i Figure 5 4 XC6200 Cell Configurations without Registers For every variable or output signal node in an instance the expression tree is traversed depth first and the nodes encountered are marked as mapped Each operator node requiring manipulation is changed in place that is the data structure is changed on the fly For instance if an AND node has to be transformed into an OR node the fct field of the node is changed accordingly This saves space and time as no additional nodes have to be allocated Implementation note On some occasions input variable and constant node duplication a new node has to be allocated and in those cases we appreciated the existence of reference parameters in the Oberon language very much It made the mapping procedure more elegant and
259. traints may lead to changes in the design Therefore a feedback path as shown in Figure 1 3 exists from several design phases to several other phases and often several iterations are necessary to reach a satisfactory result Note in Figure 1 3 that the netlists between different stages may not be in the same format as they may be produced by software from different vendors Also most often the output from one phase is written to a file and read in again in the next phase leading to inefficiencies These and other aspects of hardware synthesis software are discussed in Chapter 3 1 5 Contributions This thesis deals with the problem of speeding up computationally intensive tasks by means of specialized hardware The goal is the development of a hardware description system which 1 Introduction pial Schema PM F Compiler Netlist Corrections Adjustments Technology Mapper Mapped Netlist Layout Editor P amp R ed Netlist Download amp Runtime System FPGA Figure 1 3 Hardware Synthesis Flow 1 Introduction 7 runs on machines with moderate performance allows for the construction of reconfigurable coprocessor applications and supports a verv fast design cvcle The resulting svstem called Hades HArdware DEscription System makes use of Field Programmable Gate Array tech nology to implement a reconfigurable coprocessor It features fast interactive physical design t
260. urrence of one or more patterns by implementing multiple comparator circuits 3x3 pattern matcher is shown in Figure 6 1 It has 3 patterns each of length 3 The data flow is verv regular and little control logic is needed The text is simply streamed by the comparator circuits which detect a match Grey boxes in Figure 6 1 represent loadable registers and the boxes with an equal sign represent comparator circuits The box with the plus sign represents an OR gate which de tects a match calculated by any one of the comparators The Preprocess box implements the character mapping shown in Table 6 1 In the following we develop the Lola code for this application and use the Hades software iteratively to place and route the pattern matcher 6 2 4 Overall Structure Each 8 bit data character is first loaded into a register then mapped to a 5 bit character using the mapper circuit shown in Program 6 2 and finally loaded into a 5 bit data register This data register is compared to the corresponding pattern register which is loaded beforehand In our implementation four data characters at a time are loaded into a 32 bit wide input register matching step consisting of 4 cycles is started It loads a mapped character into the 5 bit data register and shifts the other characters by one position After 4 such shifts another 4 characters are loaded and shifted The result is read back and a match in the previous 8 characters can be detected Program
261. uts two can come from any of the four neighboring cells and one can come from one of four local buses It drives two outputs which are connected to all four neighboring cells One output can be connected with pass gates to the local buses A logic cell contains an XOR gate a N AND gate and a register plus two additional AND gates reading from the local bus input Figures 2 10 and 2 11 A cell can for example implement a half adder a counter element a multiplexer or a loadable register A 1 B N S E W NS EW 1 AND If fe L Eva e NS2 L EW1 EW2 e 4 NST PU e e Passgates OR o 12 34 3 32504209 AN s E W BN S E W Figure 2 10 AT6000 Function Unit For routing the two neighbor inputs can be passed to the neighbor outputs straight through or crossed over This means that a cell is either used for logic or for routing but never for both On each side of the cell there is a local bus which runs along 8 cells A local bus can be driven by multiple cells thus implementing a tri state bus A vertical and a horizontal local bus can be connected together implementing a corner turn At 8x8 block boundaries so called repeaters are used to connect local buses to each other or to express buses The lat
262. vel programming lan guage to produce the netlist representing the application All three approaches are used by at least one of the presented projects in Sections 7 1 and 7 2 A fourth approach becoming more and more popular is compiling a high level language directly into hardware as is done in the projects in Section 7 3 and in the following ones 74 1 PRISM PRISM and PRISM II AS93 WAL93 from Brown University Providence Rhode Island are large compiler systems that analyze C source code to find code sequences suitable to be 7 Related Work 138 synthesized into hardware The partitioning unit is a C function The compiler produces VHDL code The PRISM II hardware consists of three Xilinx XC4010 and an AMD 29050 RISC CPU The advantage of PRISM is that normal C programs not written specifically for a hardware implementation can profit from hardware acceleration The disadvantage is that this speedup comes at a high cost in terms of long compilation times Reported figures for compilation of small programs are in the minute range not including synthesis of VHDL and place and route using commercial tools 7 4 2 Transmogrifier C Transmogrifier is a custom computer from the University of Toronto Ontario To program applications a C compiler was developed which maps C statements and expressions onto hardware Gal95 Only a subset of the C language is supported In this project C is used as a hardware description language There are speci
263. were left out for performance reasons Kea96 but that the vendor s design guidelines now recommend the insertion of these buffer cells as well causing additional delay 5 7 2 Loader Since the configuration bits for a design are stored in an array of bytes it can be written to a file or downloaded directly to the FPGA When writing to a file we use a simple algorithm to compress the bitstream to 1 4 of its original size on average BJL92 The configuration bits are downloaded to the XC6200 FPGA using 32 bit transfers where possible A clear separation between the low level interface to the hardware and the hardware independent parts of the loader make it easily retargetable 5 7 3 Discussion While not difficult in principle the development of the bitstream generator was tricky due to the presence of inversions on the routing multiplexers and inside the logic cell It is these features of hardware that contribute to software bloat As of now the bitstream generator and loader lacks support for partial reconfiguration of the hardware and for making use of the wildcard registers of the XC6200 cf Section 2 3 4 These features can be useful for the implementation of an operating system for the XC6200 like the one described in Bre96 The unit of reconfiguration would be a placed and routed instance where input and output would occur only through padless IO i e using buried inputs and outputs cf Section 5 4 5 Hades Software 98
264. where a faulty configuration in the XC6216 is driving the data bus pins which would prevent the host from being able to reprogram the chip The XCReset global reset and XCGCIr register clear signals of the XC6216 can be asserted with writes to address space 5 and 6 respectively The Serial and Wait control signals must be connected to power and ground Xil96 respectively since the XC6216 is configured in parallel mode only There are four global signals on the XC6216 GCIk G1 G2 and GCIr The various clock signals are explained in Section 4 5 4 below The GCIr signal is driven by XCCIr of the de coder 4 5 3 Interface Timing Figure 4 5 shows a timing diagram of the Ceres 2 bus signals top and derived from them the chip select signal XCCS for the XC6216 bottom In addition the timings of a data write to the FPGA and a data read from the FPGA are shown DOut Din on the Ceres side correspond to DRead DAvail on the XC6216 side Additional timing information for the Ceres 2 can be found in Hee88 and for the XC6216 in Xi196 4 Hades Hardware 50 Ti Ti 12 13 14 5 Ti CLK CPUDS 7 AN CPURW Ceres 2 timing Din lag DOut a XCCS bag je DXCOut XC6216 timing DXCIn a Figure 4 5 Interface Timing 4 Hades Hardware 51 A read or write access to the XC6216 takes at least two clock
265. wire Wire list of wires to all destinations of this node u v w h INTEGER physical location and dimension to SHORTINT additional physical information id INTEGER general purpose field END Object Variables of a signal type are represented by an Object type shown in Program 3 6 which is derived from Node Program 3 6 Definition of Object Object POINTER TO ObjectDesc ObjectDesc RECORD NodeDesc fct BIT TS OC x expression tree defining this signal y always NIL name Name name of signal instance type type Type type of signal instance mode SHORTINT IN INOUT OUT VAR next Object next in declaration sequence END The name of an object is the same as the one in the Lola declaration mode indicates the type of the signal input tri state open collector output variable type points to the BIT TS or OC type structure and fct from Node is either BIT TS or OC E g x s hin type AddElem of the Add example Elements of a signal array are also represented as objects The only way to determine that an object is indeed an array element is through its name Individual elements of a declared array variable x 3 BIT are represented as objects with names x 0 x 1 x 2 Instance Instances of composite and generic types are represented by the Instance type shown in Pro gram 3 7 which is derived from Object The name of an instance is the same as the one in the Lola declaration mode is always VAR as instan
266. x myHx arg TriBase2 A Arg sig IF arg sig THEN feedback on upper path to XCBase A ELSE arg TriBase2 BArg sig IF arg sig THEN feedback on lower path to XCBase B END END feedback indicated in regl to PositionNode p regi rlu rl v to clock signal PlaceArg p sig x u v 0 myHx XCBase Clk myWy myHy calculate this node s own width and height similar processing for other node types END INC w myWx INC h myHx adjust w h by own size END Check p u v w h adjust u v if top of chip reached 5 Hades Software 77 gate in front of the register e g x REG x y This case is encoded in the Trianus data structure with a special value in the to field of the Reg1 node The placer handles this case in Program 5 11 TriBase Reg using the TriBase2 AArg and BArg procedures These traverse the expression tree starting at the current node and return the first or second argument of the node respectively In the case of the register this is the first or second argument of the data input to the register Applied to the example above AArg yields the register itself since it is rooted in object x which is the first argument to the AND gate in front of the register and BArg yields object y Thus since the register refers to itself the to field of the Reg1 node is set to XCBase A A feedback on the second path would result in to XCBase B 5 5 6 A Note on Routability Using the tree based approach desc
267. y Reconfigurable Architectures Systems and Software Research Group BRASS Research Group Homepage http www cs berkeley edu projects brass index html 1996 G Brebner A Virtual Hardware Operating System for the Xilinx XC6200 Proc 6th Intl Workshop on Field Programmable Logic and Applications LNCS 1142 Springer 1996 G Brebner and J Gray Use of Reconfigurability in Variable Length Code Detec tion at Video Rates Proc 5th Intl Workshop on Field Programmable Logic and Applications Springer 1995 M A Breuer Min Cut Placement Journal of Design Automation and Fault Tol erant Computing Vol 1 4 Oct 343 362 1977 S D Brown R J Francis J Rose Z G Vranesic Field Programmable Gate Arrays Kluwer Academic Publishers 1992 S Brown M Khellah Z Vranesic Minimizing FPGA Interconnect Delays IEEE Design amp Test of Computers Vol 13 4 1996 R E Bryant Graph Based Algorithms for Boolean Function Manipulation IEEE Trans on Computers Vol C 35 6 Aug 677 691 1986 R E Bryant Symbolic Boolean Manipulation with Ordered Binary Decision Di agrams ACM Computing Surveys Vol 24 293 318 1992 I Buchanan Xilinx Development Corporation Scotland Personal Communica tion 1996 M Burrows C Jerian B Lampson T Mann On Line Compression in a Log Structured File Svstem Digital Svstems Research Center Report No 85 1992 S Casselman Virtual Computer Corporation USA Personal Co
268. y a new graph repre sentation has to be defined Another approach is to directly map the netlist to the available FPGA cell configurations The mapping of netlist gates to cell configurations of the FPGA is determined by the algorithm itself The mapping is constructed This approach is efficient when the target cells are simple and more complex for more complicated cells such as the CLBs of the XC4000EX family Hence graph based mapping is used for coarse grained cells and direct mapping is used for fine grained cells Consequently the mapper for the XC6200 described in Section 5 4 uses the direct approach Placement Placement is the task of arranging the mapped cells onto the grid in the FPGA For placing the full adder from the example of Section 3 2 5 onto the XC6200 the task would be to place the two XOR gates the two AND gates and the OR gate How are they placed relative to each other How is a group of full adder elements making up an N bit wide adder arranged How are bigger functional units like an ALU or a state machine placed with regard to each other and with regard to I O pins connecting the functional units to the outside world There are several approaches to tackle these problems Constructive placement this is a very fast placement method An overview of several algorithms is given in HWA76 Constructive placement deterministically arranges the cells according to a programmed recipe For instance a two input c
269. yout editor for the XC6200 can be used to optimize a circuit layout by hand and features fast view updates 8 5 Conclusions The Hades hardware implements a memorv card interface to a reconfigurable coprocessor Such an interface is essential to implement a low latency communication path from the host CPU to the application running on the reconfigurable coprocessor The Hades software tools make use of the type information and operate type based i e they operate on the proto type of a circuit and propagate the produced result to all instances of that type The advantages of this approach are as follows e Synthesis results are predictable as all instances components of a certain type have the same placement and the same routing e The results are produced quickly since the algorithms have to perform the work only once and can then propagate the information e The algorithms scale to larger devices When the complexity of a design increases it will have more levels of hierarchy but not necessarily more components per hierarchy When FPGA devices get bigger the runtime of type based tools therefore increases only linearly and not quadratically Automatically generated interfaces make accesses to the hardware easier and safer They relieve the software programmer from knowing intricate details of the hardware implementa tion Further the designer of the hardware application is encouraged to produce a high level interface to the applicat
270. ypes and small expressions Large types and large expression trees result in wasteful placement leaving many cells unused The placement of instances does not take into account the connectivity between these instances To improve this situation a min cut approach as described in FM82 and Kri84 could be used which gives good results at almost linear runtime costs 5 Hades Software 85 It would be advantageous to be able to place only single types and edit them manually However the Trianus framework currently does not support the duplication of types or the creation of objects based on a type except by using the Lola back end to interpret the Lola code tree Hence it is not possible to open the layout editor on a single type and improve its placement Our floor planner is an intermediate solution to this problem but not a satisfactory one To implement libraries the duplication of types is needed anyway so the framework will have to support it in the future 5 6 Router As mentioned in the introduction to this chapter the router used in Hades is a maze running router based on the algorithm presented in Lee61 and used in Pfi92 Section 5 3 Despite being a brute force approach known for a long time it is the only practical algorithm that meets our requirements These are e generality the algorithm must be capable of routing all kinds of nets two or more terminals various shapes e customizability the user must be able to in

Hades — Fast Hardware Synthesis Tools and a - ETH E

Contents

Download Pdf Manuals

Related Search

Related Contents

Hades &mdash; Fast Hardware Synthesis Tools and a - ETH E

Contents

Download Pdf Manuals

Related Search

Related Contents

Hades — Fast Hardware Synthesis Tools and a - ETH E