Home

Processor performance in real

1. 147 B 5 1 PCB search ee ee 147 B 5 2 Register Store Restore 147 B 6 T800 PCB search o ee 147 B 7 THORPCBsearch 149 C Schematics 151 List of Tables 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 MC88100 general purpose registers 27 MC88100 floating point registers 27 MC88100 control registers 28 MC88100 internal registers 29 MC88100 Triadic register and 10 bits immediate instruction formats 29 MC88100 16 bit immediate and control register addressing instruction formats 30 MC88100 indexed addressing instruction formats 31 MC88100 Flow control triadic register and 9 bit vector table index instruc tion formats a a a 32 MC88100 16 bit displacement and 26 bit displacement instruction formats 33 80960KB REG nstructionformat 39 80960KB COBR instructionformat 40 80960 CTRL instructionformat 41 80960 MEMA MEMB instruction formats 41 Am29000 general purpose registers 46 Am29000 special purpose registers 48 Am29000 instruction formats 49 Am29000 exception vectors 52 R2000 instructionformats
2. 45 2 3 1 Am29000instructionset 45 2 3 2 Am29000dataformats 45 2 3 3 Am29000 register description 46 2 3 4 Am29000instructionformat 49 2 3 5 Am29000 processor states 50 2 3 6 Am29000 pipelining 51 MIPS R2000 processor 53 2 4 1 R2000 instruction set 53 2 4 2 R2000 data formats 53 2 4 3 R2000 register description 53 2 4 4 R2000 instruction format 54 2 4 5 R2000 processorstates 95 2 4 6 R2000 pipeline 56 Cypress SPARC CY7TC600 57 2 5 1 SPARCinstructionset 57 2 5 2 SPARC dataformats 58 2 5 3 SPARC registers 58 2 5 4 SPARC instruction formats addressing modes 60 2 5 5 SPARC traps and exceptions 62 INMOS T800transputer 64 2 6 1 T800 data formats 64 2 6 2 T800 instruction set 2 2 ee ee ee 64 2 6 3 T800 instruction formats and addressing modes 64 2 6 4 TheT800registers 65 2 7 Saab Ericsson Space THOR 66 2 7 1 THOR instruc
3. 135 A 55 T800 Processor initialisation operations 135 A 56 T800 Floating point Load Store operations 136 A 57 T800 Floating point general operations 136 A 58 T800 Floating point rounding operations 136 A 59 T800 Floating point error operations 136 A 60 T800 Floating point comparison operations 137 A 61 T800 Floating point conversion operations 137 A 62 T800 Floating point arithmetic operations 137 A 63 THOR Arithmetic instructions 138 A 64 THOR Move instructions 138 A 65 THOR Logical instructions 139 10 A 66 THOR Shift instructions A 67 THOR Compare instructions A 68 THOR Control instructions 11 List of Figures 1 1 2 1 B 1 B 2 B 3 B 4 B 5 B 6 C l C 2 C 3 C 4 C 5 C 6 C 7 C 8 A Risc Design Decision Graph 21 Three overlapping windows and globals 59 Process Control Block structure 141 MC88100 multiple storeseguence 150 MC88100 multiple load sequence 150 I80960KB multiple storeseguence 150 I80960KB multiple load sequence 150 MIPS R2000 multiple load store sequenc
4. 4 14 1 SPARC HSO configurationezecutionrate Summary of Results Conclusions Concluding Remarks Instruction set summaries A l A 2 A3 AA A5 A 6 A T MC88100 instruction set summary 180960 KB instruction set summary Am29000 instruction set summary R2000 instruction set summary SPARC CY7C601 instruction set summary T800 instruction set summary THORinstructionsetsummary Processor Context Switch B 1 B 2 MC88100 B 1 1 PCBsearch B 1 2 Register Store I80960KB 111 114 121 125 128 132 138 141 B 2 1 PCBsearch 143 B 2 2 Register Store 143 B 2 3 Register Restore 143 BB Am29000 145 B 3 1 PCB search 145 B 3 2 Register Store Restore 145 B 4 MIPS R2000 146 B 4 1 PCB search ee ee 146 B 4 2 Register Store Restore 146 B 5 SPARC
5. 54 SPARC Register Addressing 58 2 20 SPARC format 1 and format 2 instruction formats 60 2 21 SPARC format Sinstructionformats 61 2 22 SPARC trap vector table 63 2 23 THORinstructionformats 67 2 24 THOR registers ee 68 2 25 THOR Task Control Registers 70 2 26 THOR esrceptionnumbers 72 3 1 Number of cycles required to search the PCB list 84 3 2 Number of cycles required for storing restoring processor context 84 3 3 Total time required for a process switch estimated 85 4 1 Summary real time system configuration 106 4 2 Summary general purpose system configuration 106 A l MC88100 Integer Arithmetic Instructions 111 A 2 MC88100 Logical Instructions 112 A 3 MC88100 Flow Control Instructions 112 A 4 MC88100 Floating Point Instructions 112 A 5 MC88100 Bit Field Instructions 113 A 6 MC88100 Load Store Exchange Instructions 113 A 7 180960KB Load Store instructions 114 A 8 180960KB Integer arithmetic instructions 114 A 9 I80960KB Moveinstructions 115 A 10 I80960KB Shift rotate and logic
6. 87 3 4 6 T800 87 34 7 O aaa a 87 3 5 Conclusions 87 System Hardware Considerations 90 4 1 General notes on the designs 91 4 2 Execution Rate Estimation 91 4 3 Memory Power Consumtion 93 4 4 Instruction Miz 94 4 5 Notes on the Failure Rateestimation 94 4 6 The HDO configurations 94 4 7 T800 HDO configuration 95 4 7 1 T800 Read memory cycle external memory 96 4 7 2 T800 HDO config execution Tate 97 4 8 THOR HDOconfiguration 98 4 8 1 THOR Read memory Cyele 99 4 8 2 THOR HDO configuration execution Tate 99 4 9 SPARC HDO configuration 100 4 9 1 SPARC Read Cycle 101 4 10 4 11 4 12 4 13 4 14 4 15 4 16 4 9 2 SPARC HDO configuration ezecutionrate The HSO configurations General Notes on the HSO configurations T800 HSO configuration 4 12 1 T800 HSO configuration execution rate THOR HSO configuration 4 13 1 THOR HSO config execution rate SPARC HSO configuration
7. cess Entries in the PCB may also be used by the process itself e Data Space where the process data resides e Code Space where the process code resides May in some cases be shared by several processes In addition to this we must add the procesor context to fully describe a process at any time A processor s context is characterised by e Accessible register contents e Internal unaccessible register contents e Processor internal state During a context switch at least the processor internal state and the internal register contents must be preserved or the processor must be allowed to proceed until a well defined state is reached For example the current instruction is allowed to complete Furthermore to allow restart of the interrupted program the status register stack and program counter must be saved For a process switch obviously the entire processor context must be saved which also includes the accessible registers A common method is to let the process stackpointer reside in the upper region of data space growing downwards The stackpointer itself upon a process switch is stored in the actual process PCB That is A minimum of operations performed to freeze a process and maintain the ability to restart it at any later time for the operating system must be 1 Save the entire processor context by pushing it onto the stack 2 Store stackpointer value in the PCB The process can be restarted simply by loading the
8. load signed byte load signed byte from alternate space load signed halfword load signed halfword from alternate space load unsigned byte load unsigned byte from alternate space load unsigned halfword load unsigned halfword from alternate space load word load word from alternate space load doubleword load doubleword from alternate space load floating point register load double floating point register load floating point state register load coprocessor register load double coprocessor register load coprocessor state register atomic load store unsigned byte atomic load store unsigned byte from alternate space store byte store byte into alternate space store halfword store halfword into alternate space store word store word into alternate space store doubleword store doubleword into alternate space store floating point store double floating point store floating point state register store double floating point queue store coprocessor store double coprocessor store coprocessor state register store double coprocessor queue swap register with memory swap register with alternate space memory Table A 38 SPARC Load Store instructions 129 SAVE rs1 rs2 imm rd save callers window RESTORE rsi rs2 imm rd restore callers window RETT address return from trap BA label branch always BN label branch never BNE label branch on not equal BE label branch on equal BG label branch on greater BLE lab
9. time i e different real time systems in cooperation should be able to use this global time for different purposes Moreover the system should provide an accurate delay time for processes that require it It should be noted that we are really addressing an issue that is different from a conventional real time clock in a work station application Real time system software needs careful debugging and testing Traditionally pro cessors give support for this through a trace instruction i e by executing one machine instruction at a time and then returning control to some debugging tool or monitor In a real time system which is event driven a more extensive support would be desirable to catch transient erronous behaviour resulting from special occurances of events The environments in which real time systems mostly reside and the tasks that they most often perform makes contiguous service or service during operation difficult or im possible to carry out This makes hardware debugging facilities and fault tolerant aspects central in real time system design The following paragraphs summarize support related to e Timer facilities e Software Hardware debugging e Fault tolerance 3 4 1 MC88100 The processor can be forced to a serial mode by setting one bit in the status register This significantly reduces machine throughput but is useful for debug purposes Besides 85 from that software debugging must be accomplished by the
10. loading 80 registers will be accomplished within 79 9 3 1 cycles 144 B 3 Am29000 B 3 1 PCB search PCB search exits with task identification number T ID in r4 task priority T PRI in r3 ptr to highest process tasks PCB in r5 r2 PCBOPTR OxFFFF 1 consth r2 PCBOPTR gt gt 16 amp OxFFFF 1 load immediate into r2 done const add const const const L1 add r5 r2 0 r1 10 r3 0 r4 0 r7 r2 T PRI ptr to hi priority task 1 number of PCB s to search 1 initial priority lowest 1 initial PCB ID undefined 1 compute address of priority in r7 10 feedforward no penality for r7 load wait for r8 cplt jmpf nop add add load add L2 add load sub cpeq jmpf nop 0 CNTL r8 r7 r9 r3 r8 r9 L2 r3 r8 0 r7 r2 T ID 0 CNTL r4 r7 r5 r2 0 r7 r2 T NEXT 0 CNTL r2 r7 ri ri i r9 r1 0 r9 L1 get priority into r8 memory access 30 compute boolean into r9 10 branch if previous greater 2 always executed 10 remember new priority 1 compute address of new task ID into r7 1 remember task ID memory access 1 remember PCB ptr 1 compute address of next PCB ptr 10 get next PCB pointer memory access 10 one more 1 compute boolean into r9 10 continue until done 20 always executed 10 B 3 2 Register Store Restore The Load Multiple and Store Multiple instructions allows the entire register file to be restored or saved in
11. possible through a streamlined design and instruction set simplicity hence a Reduced Instruction Set Computer MIP87 Consider this expression for processor performance Time CTTI Task C where e C cycles instruction e T time cycle e I instructions task It is clear that P should be kept as small as possible under given the circumstances There must be at least three different ways of minimizing P 18 1 Reduce the number of cycles per instruction 2 Reduce the time per cycle 3 Reduce the number of instructions per task Let us have a closer look at each of these 1 The cycle time could be made very small through pipelining technics Le several instructions can be executed simultaneously each one occupying different stages of the pipeline This will keep most of the hardware busy most of the time The cycle time will be equivalent to the slowest stage in the pipeline Hence pipelining is a way of reducing C 2 T can only be kept low through the use of instructions that can be decoded and exe cuted by non complex and thereby fast subsystems therefore keeping instructions simple will decrease 7 3 Ican theoretically be made as low as 1 Le when there exists an instruction for each high level program construction that a task can constitute This is hard to achieve but the principle is clear Complex instructions are required to minimize I As we can see there is no way of meeting all of the
12. targ branch if greater or equal targ branch if ordered targ branch if unordered Table A 12 I80960KB Branch instructions CMPIBE srcl src2 targ compare integer branch if equal CMPIBNE srel src2 targ compare integer branch if not equal CMPIBL srcl src2 targ compare integer branch if not less CMPIBLE srel src2 targ compare integer branch if not less or equal CMPIBG srcl src2 targ compare integer branch if greater CMPIBGE srel src2 targ compare integer branch if greater CMPIBO srcl src2 targ compare integer branch if ordered CMPIBNO srcl sre2 targ compare integer branch if unordered CMPOBE srcl src2 targ compare ordinal branch if equal CMPOBNE srel src2 targ compare ordinal branch if not equal CMPOBL srcl src2 targ compare ordinal branch if not less CMPOBLE srcl sre2 targ compare ordinal branch if not less or equal CMPOBG srcl src2 targ compare ordinal branch if greater CMPOBGE srel src2 targ compare ordinal branch if greater BBS bitpos src targ check bit branch if set BBC bitpos src targ check bit branch if clear Table A 13 I80960KB Compare and branch instructions 116 SETBIT bitpos src dst set bit CLRBIT bitpos src dst clear bit NOTBIT bitpos src dst not bit bit toggle CHKBIT bitpos sre check bit ALTERBIT bitpos src2 dst alter bit SCANBIT src dst scan for bit SPANBIT src dst span over bit EXTRACT bitpos len src dst extract bits MODIFY mask src src ds
13. 115 outs ins r 8 next window 1131 1 24 1123 locals r 16 i rj15 1131 outs ins r 8 r 24 1123 locals ri16 r 115 outs r 8 r 7 r 0 Figure 2 1 Three overlapping windows and globals Because the processor logically provides new locals and outs after every procedure call register local values need not be saved and restored across calls Figure 2 1 shows how parameters may be passed to and from subroutines The IU s control status registers are all 32 bit read write registers unless specified otherwise They include the program counters PC and nPC the Processor State Register PSR the Window Invalid Mask Register WIM the Trap Base Register TBR and the Multiply step Y register The PC contains the address of the instruction currently being executed and nPC hold the address of the next instruction to be executed assuming no trap occurs The 32 bit PSR contains various fields describing the state of the IU Among these are ICC which contains the IU s condition codes These bits are modified by dedicated instructions and by the WRPSR write processor status register instruction The EC bit determines whether or not the coprocessor is enabled The EF bit determines whether or not the FPU is enabled Processor interrupt level is reflected by the contents in PIL field The processor only accepts interrupts whose interrupt level is greater than the value in 59 PIL The S bit determines whe
14. 25 MHz clock rate and a 40 ns instruction cycle time AMD claims that it can hit a peak execution rate at 25 mips and a sustained performance level at 17 mips Am29000 is an enhanced RISC design meaning that key RISC concepts have been combined with conventional design to reach highest possible performance Among other things it features a four stage pipeline 128 bytes instruction branch target cache and an on chip memory management unit 2 3 1 Am29000 instruction set The Am29000 instruction set contains 112 instructions divided into 9 classes integer arithmetic compare logical shift data movement constant floating point branch and miscellanous instructions The processor executes all instructions in a single cycle except for interrupt returns load multiple and store multiple The complete instruction set is given in Appendix B There are two mutually exclusive modes of program execution the supervisor mode and the user mode In the supervisor mode executing programs have access to all pro cessor resources In the user mode certain processor resources may not be accessed any attempted access causes a trap 2 3 2 Am29000 data formats A word is defined as 32 bits of data A half word consists of 16 bits and a double word consists of 64 bits Bytes are 8 bits in length Within a word bits are numbered in increasing order from right to left starting with the number 0 for the least significant bit Within a word bytes and half w
15. 40 30 40 38 95 Area mm2 1451 220 154 220 220 154 270 220 255 1944 280 154 270 220 220 4 7 1 T800 Read memory cycle external memory T1 Address setup time before address valid strobe T2 Address hold time after address valid strobe T3 Time for the bus to go to tristate on a read cycle or to present valid data on a write cycle T4 T5 Time for the read or write data pulse T6 Time for the bus to remain in tristate after the end of read or for data to remain valid after the end of write For the selected device 1 Tm 28 5 ns 1 Address is latched at the falling edge of T1 Address setup time is a 8 20 5 ns The 373 requires typically 5 ns thus it is sufficient with T1 1 Tm Address hold after falling edge of T1 is b 9 19 5 ns The 373 needs typically 6 ns thus T2 1 Tm For T3 T4 and T5 CS is asserted at the end of T1 during a read cycle data is latched at the falling edge of T5 Buffer propagation delay is 11 ns T800 needs stable data 25 ns before it is latched memory require 35 ns from CS the EDAC is 36 ns Hence 35411436425 107 ns violates T3 T4 T5 1Tm 85 5 ns and two extra Tm s are required 96 4 With T6 1 Tm we arrive at a total of 8 Tm ie 228 ns for an external memory cycle Thus a memory read bus cycle is equivalent to 228 57 4 processor cycles 4 7 2 T800 HDO config execution rate The following parameters were chosen to
16. Computational instructions 125 A 33 R2000 Shift instructions 126 A 34 R2000 Jump branch instructions 126 A 35 R2000 Multiply divide instructions 126 A 36 R2000 Special coprocessor instructions 127 A 37 SPARC Arithmetic Logical Shift instructions 128 A 38 SPARC Load Storeinstructions 129 A 39 SPARC Control Transfer instructions continued 130 A 40 SPARC Control Transfer mstructions 131 A 41 SPARC Read Write control register operations 131 A 42 SPARC Miscellaneous instructions 131 A 43 T800 Function codes 132 A 44 T800 Arithmetic Logical operations 132 A 45 T800 Long arithmetic operations 133 A 46 T800 Generaloperations 133 A 47 T800 2D block moveoperations 133 A 48 T800 CRC and bitoperations 133 A 49 T800 Indezing arrayoperations 134 A 50 T800 Timer handling operations 134 A 51 T800 Input Output operations 134 A 52 T800 Controloperations 135 A 53 T800 Scheduling operations 135 A 54 T800 Error handling operations
17. EXHW extract half word EXHWS INBYTE INHW MFSR MFTLB MTSR MTSRIM MTTLB rc ra rb const8 extract half word sign extended insert byte insert half word move from special register move from translation look aside buffer register move to special register move to special register immediate move to translation look aside buffer register rera rc ra rb const8 rc ra rb const8 rc spid rera spid rb spid const16 ra rb Table A 26 Am29000 Data movement instructions CONST ra const16 constant CONSTH ra const16 constant high CONSTN ra const16 constant negative Table A 27 Am29000 Constant instructions CALL ta target call subroutine CALLI ra rb JMP target JMPF ra target ra target ra rb call subroutine indirect Jump Jump false jump false and decrement jump false indirect jump indirect jump true jump true indirect JMPFDEC JMPFI JMPI tb JMPT JMPTI ra target ra rb Table A 28 Am29000 Branch instructions 123 rc ra rb floating point add double precision rc ra rb floating point division double precision rc ra rb floating point equal to double precision rc ra rb f p greater than or equal to d p rc ra rb f p greater than d p rc ra rb f p multiply d p rc ra rb f p subtract d p rc ra rb f p add single precision rc ra rb f p divide s p rc ra rb f p equal to s p rc ra rb f p greater than or equal to s p rc ra rb f p greater than s p rc ra rb f p
18. FPINT round to floating integer Table A 61 T800 Floating point conversion operations FPADD floating point add FPSUB floating point subtract FPMUL floating point multiply FPDIV floating point divide FPUABS floating point absolute FPREMFIRST floating point remainder first step FPREMSTEP floating point remainder iteration FPUSQRTFIRST floating point square root first step FPUSQRTSTEP floating point square root step FPUSQRTLAST floating point square root end FPUEXPINC32 multiply by 2 EE 32 FPUEXPDEC32 divide by 2 EE 32 FPUMULBY2 multiply by 2 FPUDIVBY2 divide by 2 Table A 62 T800 Floating point arithmetic operations 137 A T THOR instruction set summary add integer add float add immediate add unsigned divide integer divide float modulus multiply integer multiply float multiply immediatly multiply long multiply unsigned subtract subtract reversed subtract float subtract reversed float subtract unsigned subtract reversed unsigned convert to absolute value convert float to integer convert signed integer to float Table A 63 THOR Arithmetic instructions expr expr push value onto stack push immediate reg expr push register expr expr push indexed pop value from stack reg expr pop register expr expr pop indirect load indirect Table A 64 THOR Move instructions 138 logical and logical and immediate first bit changed logical not logical
19. Load Store Count Remaining 0 1 2 3 4 5 6 7 8 9 Eu Ro e Ww N Table 2 15 Am29000 special purpose registers Channel Control Contains information associated with a channel operation and re tains this information if the operation does not complete successfully Register Bank Protect Restricts access of User Mode programs to specified groups of registers This facilitates register banking for multi tasking applications and protects operating system parameters kept in the global registers from corruption by User mode programs Timer Counter supports real time control and other timing related functions Timer Reload maintains synchronisation of the Timer Control It includes control bits for the Timer facility Program Counter 0 Contains the address of the instruction being decoded when an interrupt or trap is taken The processor restarts this instruction upon interrupt return Program Counter 1 Contains the address of the instruction being executed when an interrupt or trap is taken The processor restarts this instruction upon interrupt return Program Counter 2 Contains the address of the instruction just completed when an interrupt or trap is taken This address is provided for information only and does not participate in an interrupt return MMU Configuration Allows selection of various memory management options LRU Recommendation Simplifies the reload of entries in the translation look asi
20. Pro cessor Status Register 5 The Current Status register is modified to indicate interrupt trap 80 6 The address of the first instruction of the interrupt or trap handler is determined 7 The processor determines whether or not the first instruction is in instruction ROM 8 An instruction fetch is initiated using the instruction address as determined in pre vious steps At this point normal execution resumes 3 2 4 MIPS R2000 An interrupt exception occur as a result of hardware signal or by execution of special instructions 1 The R2000 branches to the general exception vector for this exception 2 the IP field in the Cause register shows which of six external interrupts are pending and the SW field in the Cause register shows which of two software interrupts are pending More than one interrupt can be pending at a time 3 The R2000 saves the Kernel User previous Interrupt Enable previous Kernel User current and Interrupt Enable current bits of the Status register in the Kernel User old Interrupt Enable old Kernel User previous and Interrupt Enable previous bits respectivly and clears the Kernel User current and Interrupt Enable current bits 3 2 5 SPARC An interrupt is a special case of trap condition A trap causes the following action 1 It disables traps 2 It copies the field of the PSR into the PS field and then sets the S field to 1 3 It decrements the CWP by 1 modulo 7 4 It saves the PC and
21. a single instruction Thus loading as well as storing 192 registers will be accomplished within 4 191 cycles 145 B 4 MIPS R2000 B 4 1 PCB search PCB search exits with task identification number T ID in r4 task priority T PRI in r3 ptr to highest process tasks PCB in r5 lui ori load immediate into r2 done or ori ori ori L1 lb nop sltu nop blez nop ori lb ori L2 1hu 1h addi or sltu nop blez nop r2 PCBOPTR gt gt 16 1 r2 r2 PCBOPTR amp Ox FFFF 1 r5 r0 r2 copy into r5 1 r1 r0 9 number of PCB s 1 to search 1 r3 r0 0 initial priority lowest 1 r4 r0 0 initial PCB ID undefined 11 r8 T PRI r2 r9 r3 r8 r9 L2 r3 r8 0 r4 T ID r2 r5 r2 0 r6 T NEXT r2 priority memory access 10 delay slot 10 compare priorities result in r9 10 delay slot 10 branch if previous is greater 10 delay slot 10 substitute new priority 1 remember task ID memory access 1 remember PCB ptr 1 PCB pointer high memory access 10 r7 T NEXT 2 r2 PCB pointer low memory access 10 ri ri 1 r2 r6 r7 r9 r1 r0 r9 L1 10 move result into r2 10 compute bool into r9 10 delay slot 10 exit when all PCB s searched 9 delayed branch 9 B 4 2 Register Store Restore Pipeline stalls while data is read from memory or stored in memory see figure B 6 since this prevents the processor from fetching the next instruction Thus R2000 lo
22. address r 1 rS2 or r 1 r 2 scale rD rS1 r82 Scale might be 0 1 2 or 3 LDA SZ rD rs1 IMM16 load address rD r 1 r 2 rD rS1 152 LDCR rD erS load from control register ST SZ rD rs1 IMM16 store contents of rD in memory 1r51 IMM16 ST SZ USR rD rS1 rS2 store in r 1 rS2 or rS1 rS2 Scale rD rS1 152 STCR rD crD store to control register XMEM BU rD rS1 IMM16 exhange register with memory XMEM BU USR rD rS1 r52 rD rS1 152 XCR rD rS crS D exhange control register Table A 6 MC88100 Load Store Exchange Instructions 113 A 2 180960 KB instruction set summary src dst load src dst load ordinal byte src dst load ordinal short src dst load integer byte src dst load integer short src dst load long src dst load triple src dst load quad src dst load address src dst store src dst store ordinal byte src dst store ordinal short src dst store integer byte src dst store integer short src dst store long src dst store triple src dst store quad Table A 7 180960KB Load Store instructions srcl src2 dst add integer srcl src2 dst add ordinal srcl src2 dst subtract integer srcl src2 dst subtract ordinal srcl src2 dst multiply integer srcl src2 dst multiply ordinal srcl src2 dst divide integer srcl src2 dst divide ordinal srcl src2 dst add ordinal with carry srcl src2 dst subtract ordinal with carry srcl src2 dst extended multiply srcl src2 dst extended divide srcl src2 dst remainder integer s
23. allow three simultaneous register accesses 35 2 2 Intel 80960KB The 80960KB is an implementation of the 80960 32 bit architecture from Intel This architecture has been designed to meet the needs of embedded applications such as machine control robotics process control avionics and instrumentation The architecture provides 32 registers 28 of which are available for general use These are divided into two types globals and locals There is a 512 byte instruction cache on chip and multiple set of local registers Execution of some instructions may me overlapped This is accomplished by register scoreboarding 2 2 1 80960 KB instruction set The 80960 KB processor implements all the instructions in the 80960 instruction set which includes all of the data movement arithmetic logical and program control instructions commonly found in computer architectures The processor also includes a set of floating point instructions and several instructions to handle architectural extensions found in the processor All instructions are 32 bits long and aligned on 32 bit boundaries There are over 50 instructions that can be executed in a single clockcycle A summary of the 80960 KB instruction set is given in Appendix B The processor provides a mode and stack switching mechanism called the user supervisor protection model This protection model allows a system to be designed in which kernel code and data resides in the same address space as the u
24. and modified indirectly General Purpose registers r0 r31 table 2 1 contain program data Their usage are dedicated due to software con ventions further discussed in chapter 3 All of these registers with the exeption of r0 constant zero has read write access A write operation to r0 has no effect Floating point operation registers fcrl fer7 are used to hold floating point operands and results while the rest holds various status from the floating point unit table 2 2 Control Registers Control registers table 2 3 contain status execution control and exception processing information Some of the registers have read write access others are read only Internal Registers Internal registers table 2 4 located in the register file sequencer and instruction unit control instruction execution and data availability These registers are not explicitly ac cessible for the programmer 2 1 4 MC88100 instruction formats addressing modes All instructions are 32 bits in length Immediate operands and displacements are encoded in the instruction word All other operands are located in registers which can be moved to and from memory with load and store instructions There are three instruction types flow control data memory accesses and register to register operations Each type has unique addressing capabilities Flow control instruction references are made by the instruction unit Data memory access instructions address those sect
25. describe the T800 configuration X 2 X91 2 X22 4 X 3 8 X3 2 X4 8 The manufacturer claims that about 70 of executed instructions are encoded in a single byte Inm89 p 195 From the current instruction mix we assume that 50 of the instructions are encoded in 8 bits 30 of the instructions are encoded in 16 bits the rest are encoded in 32 bits This gives U 2 and with W 3 from the previous section we have Y W U 2 Thus 1 X1 2 Z X 3 8 Z3 5 Za Xa 8 leading to 1 1 4 8 MmizedI P R ggs 5rns 7 8M mivedi Ps For the memory activity we obtain AMA 0 18 The total memory power requirement 189 mW device 97 4 8 THOR HDO configuration The THOR has on chip timer thus no such peripheral device Furthermore THOR has a built in EDAC Thus no such peripheral device either The chip is not yet available Actual figures concerning the THOR chip are obtained from simulations in Genesil Silicon Compiler from these simulations assuming components satisfying military range require ments the clock frequency will be 15 MHz Component list Device Qty Power mW Area mm2 FITS Ul THOR 1 1500 2450 78 U2 U6 74ACT245 5 36 220 3 U7 74ACT138 1 41 220 3 U8 U10 74ACT244 3 36 220 3 U11 OTO16 1 100 270 26 U12 74ACTO4 1 30 154 3 U13 U14 54HCT393 2 26 220 3 MU1 MU10 CY7C194 35 10 326 255 218 Average according to AMA 98 4 8 1 THOR Read memory Cycle Assuming a need fo
26. en exception is detected forcing it into kernel mode It remains in kernel mode until an Restore From Exception instruction is executed 2 4 2 R2000 data formats The R2000 defines a 32 bit word a 16 bit halfword and an 8 bit byte The byte ordering is configurable configuration occurs during hardware reset into either big endian or little endian byte ordering Bit 0 is always the least significant rightmost bit Thus bit designations are always little endian The R2000 uses byte addressing with alignment constraints for half word and word accesses half word accesses must be aligned on an even byte boundary and word accesses must be aligned on a byte boundary divisible by four Special instructions are provided for addressing words that are not aligned on 4 byte word boundaries Load Store Word Left Right LWL LWR SWL SWR These instructions are used in pairs to provide addressing of misaligned words with one additional instruction cycle over that required for aligned words 2 4 3 R2000 register description The register set consists of general purpose registers as well as dedicated registers e The R2000 provides 32 general purpose 32 bit registers r0 r31 each consists of a single word The registers are treated symmetrically with two exeptions Register r0 is hardwired to a zero value and r31 is the link register for jump and link instructions 53 I type J type R type bits encoding bits encoding bits encoding 31 26 OP
27. for use by the subprogram The global register g15 is reserved for use as a Frame Pointer Local registers r0 rl and r2 are reserved for use as Previous Frame Pointer Stack Pointer and Return Instruction Pointer respectively Parameters are passed using global registers accessible regardless of which local register set is currently active thus 15 parameters could conveniently be passed to or from a subprogram Nested calls therefore requires stacking of parameters 76 3 1 3 Am29000 register conventions The Am29000 utilises a large on chip register set which is organized as a run time stack When a subprogram is called a new activation record or stack frame is allocated This record includes local variables arguments to the subprogram and a return address A compiler targeted to the Am29000 should use two run time stacks for activation records one for often used scalar data and another for structured data and additional scalar data The scalar portion of the activation record can then be mapped into the processor s local registers because of the stack pointer addressing which applies to the local registers Allocation and de allocation of activation records can occur largely within the confines of the local registers The term stack cache refers to the use of local registers to cache a portion of the activation record stack The principle of locality of reference which allows any cache to be effective also applies
28. jump rs2 jump to subroutine B5 r51 D16 branch on bit clear B5 r51 D16 branch on bit set M5 r 1 D16 branch on condition met D26 unconditional branch B5 rS1 VEC9 trap on bit clear B5 r51 VEC9 trap on bit set rs1 IMM16 trap on bounds check rs1 r52 M5 r 1 VEC9 conditional trap return from exeption Table A 3 MC88100 Flow Control Instructions FADD FSZ rD rS1 r52 floating point add FCMP FSZ rD rS1 r52 floating point compare FDIV FSZ rD rS1 r52 floating point divide FLDCR rD ferS load from floating point control register FLT FSZ rD rS2 convert integer to floating point FMUL FSZ rD rS1 r52 floating point multiply FSTCR rD ferD store to floating point control register FSUB FSZ rD rS1 r52 floating point subtract FXCR rD rS ferS D exhange floatin point control registers INT FSZ rD rS2 round floating point to integer TRNC FSZ rD rs2 truncate floating point Table A 4 MC88100 Floating Point Instructions 112 rD rs1 IMM10 clear bit field rD r 1 r 2 rD rS1 IMM10 extract bit field rD r 1 r 2 rD rS1 IMM10 extract unsigned bit field rD r 1 r 2 rD rs2 find first bit clear rD rs2 find first bit set rD rs1 IMM10 make bit field rD r 1 r 2 rD rS1 IMM10 rotate register only 5 bits of IMM10 used rD r 1 r 2 rD rS1 IMM10 set bit field rD r 1 r 2 Table A 5 MC88100 Bit Field Instructions LD SZ rD rS1 IMM16 load register rD from memory at address r51 IMM16 LD SZ USR rD r 1 r 2 load from
29. leading to Y U W 0 51 and Zi 1 Z2 1 Z3 2 Z4 4 finally 1 1 ER 14 3 MmizvedI PS 1 75 40 ns 4 14 SPARC HSO configuration Component list Device Qty Power mW Area mm2 FITS Ul CY7C601 1 3250 1998 14063 U2 CY7C602 1 2250 1600 13979 U3 U4 CY7C157 2 1250 397 11303 US CY7C604 1 3250 2554 14116 U6 CY7C343 1 775 311 4527 MU1 MUS CYM1624 8 2750 442 11242 MU9 MU1O CY7C338 2 750 226 3398 MU11 MU14 74ACT245 4 95 220 490 MU15 MU17 74ACT244 3 95 220 490 4 14 1 SPARC HSO configuration execution rate The SPARC configuration utilises a 64 kByte cache memory Experience has shown that for a cache of this size a hit rate of 90 is probable Denoting a 32 bit word fetched from the cache Z C we write ERE 2101 Z2 2 Z303 Zata 0 104 A C e Z2a C u2 Za C z3 Za C za 0 9 104 Timing analysis carried out as in 4 9 1 shows that a cache miss will cost one wait state An access whithin cache may be done without wait state Hence and The HSO configuration runs at 40 MHz and from this 1 1 ER R 1 735 25 ns 23 MmizedIPS 4 15 Summary of Results As shown in table 4 2 the designs that were intended to show maximum performance clearly favours the SPARC This is not very suprising The SPARC cpu is available in a 40 MHz version and offers an architecture designed for single cycle execution of instructions The figures of power requirement and the required board area indicates
30. or logical or immediate logical exclusive or Table A 65 THOR Logical instructions shift left shift left dynamic shift right shift right arithmetic shift right arithmetic dynamic shift right dynamic shift right dynamic long Table A 66 THOR Shift instructions compare lower limit compare compare float compare unsigned compare upper limit Table 4 67 THOR Compare instructions 139 call subprogram call protected clear flags flush cache enter halt mode jump relative jump relative on equal jump relative on greater than or equal jump relative on greater than jump relative on less than or equal jump relative on less than jump relative on not equal jump relative indirect move top of stack no operation return return to user mode set flags test signed integer raise exception change TCB task accept task accept end task accept start task conditional accept task conditional entrycall task delay task entrycall task entrycall end task pointer task schedule Table A 68 THOR Control instructions 140 Appendix B Processor Context Switch Figure B l describes the Process Control Block structure The PCB s search may be accomplished by the following formal scheme Figures within curly brackets denotes number of times each instruction are executed for a complete search PCB search generic exits with task identification number T ID in r4 task priority T PR
31. providing method of implementing op erating system calls etc A trap may be conditional such as TRAP on OVERFLOW and used in conjunction with arithmetic operations Real time systems are event driven i e an external event should affect the internal state of the system and or require som form of attention In a real time system the ability to respond to such an event within a specified time is a major requirement Hardware support for event handling is provided by the processor s interrupt mechanism The following paragraphs describes these mechanisms 3 2 1 MC 88100 Upon recognition of an interrupt the MC 88100 acts as follows 1 Finish current instruction synchronize 2 Freeze all pipelines except the data unit 3 Allow data unit to complete or fault 4 Freeze all shadow registers and copy the PSR to the TPSR 5 Set new PSR to indicate exception processing 6 Generate vector 7 Prefetch vector and vector 4 3 2 2 I80960KB Whenever the processor receives an interrupt signal it performs the following action 1 It temporarily stops work on its current task whether it is working on a program or another interrupt procedure 2 It reads the interrupt vector 79 3 It compares the priority of the vector with the processor s current priority 4 If the interrupt priority is higher than that of the processor the processor continues as described below 5 If the priority is equal to or less than that of the process
32. stackpointer from PCB and pulling processor context from the stack For a complete process Switch the old process must be preserved a new process must be selected and started That is at least two processor context switches and the selection contribute to the total time required In a system with several runable processes the operating system must choose the one with highest priority There might for example be 83 Processor Processor Cycles MC88100 148 I80960KB 136 Am29000 133 MIPSR2000 145 SPARC 144 T800 hardware implemented THOR hardware implemented Table 3 1 Number of cycles required to search the PCB list Processor Register file Register file Processor save cycles restore cycles MC88100 I80960KB Am29000 MIPSR2000 SPARC T800 THOR Table 3 2 Number of cycles required for storing restoring processor context Special hardware support for process switch makes these abundant processes waiting for IO or processes waiting for synchronization with other processes in the system In other words Every process PCB has to be checked regarding the process status runable or not and priority to pick the runable process with the highest priority The effiency of this activity is of major importance for a real time system where the overall function relies on the systems ability to respond to external events and schedule an appropriate process As an example of process switch in small real time systems a
33. the instruction immediatly following a branch conditional or unconditional is always executed is used to reduce penalty as sociated with changes in program flow However this requires a careful strategy by the compiler Optimising compilers could take advantage from this feature A uniform instruction execution can only be acheived by using uniform instructions This leads to a rather simple and reduced instruction set Data should be accessed within a single cycle therefore a large on chip register file is needed in the top of the memory hierarchy Since instructions addressing modes should be kept simple and data should be kept in registers there are strong implications for special load store instructions that perform data traffic hence the commonly used name load store architecture A large register file will create significant overhead in the case of context switch A special support for such occasions is therefore needed Optimising compilers could provide such support Register windows is another way of reducing context switch overhead Approximately 20 percent of the executed instructions are used about 80 percent of the time spent executing a program Rad83 the so called 20 80 rule Analysing the instruction mix shows that simple instructions dominate among these 20 percent Hen90 We can see strong needs for careful code generation or the increase of performance may be outbalanced by an increase of static and dynamic ins
34. the operating system software and the instructions that specify a particular ASI value are privileged and can only be executed in supervisor mode Arithmetical logical shift instructions compute a result using two source operands and place the result in a destination register In addition to standard arithmetic this processor includes tagged arithmetic operations to support languages such as LISP and Prolog Control transfer instructions include jumps calls branches and traps A summary of the complete instruction set is given in Appendix B 57 r 24 to r 31 ins r 16 to r 23 locals r 8 to r 15 outs r 0 to r 7 globals Table 2 19 SPARC Register Addressing 2 5 2 SPARC data formats SPARC supports nine data types Integer data types includes byte unsigned byte half word unsigned halfword word and unsigned word The IEEE floating point types include single double and extended A byte is 8 bit wide a halfword is 16 bits a word is 32 bits a single is 32 bits a double is 64 bits and an extended is 128 bits 2 5 3 SPARC registers The integer unit has two types of registers associated with it working registers r regis ters and control status registers Working registers are used for normal operations and control status registers keep track of control and the state of the IU The FPU has 32 working registers called f registers and two control status registers the Floating point State Register FSR and the Floatin
35. use of general trap handling facilities MC88100 include comparator circuits at the output to support fault detection There are several possible configurations possible for master checker operation and other redun dant designs 3 4 2 180960 To support debugging systems the i80960 provides a mechanism for monitoring processor activity by means of trace events The processor can be configured to detect seven different trace events including the instruction execution branch events calls supervisor calls returns prereturns and breakpoints When the processor detects a trace event it signals a trace fault and calls a fault handler 3 4 3 Am29000 Software debug is supported by the Trace Facility which guarantees exactly one trap after the execution of any instruction in a program being tested This allows a debug routine to follow the execution of instructions and to determine the state of the processor and system at the end of each instruction The processor has a built in Timer Facility which can be configured to cause periodic interrupts The Timer Facility consists of 2 special purpose registers the Timer Counter and the Timer Reload registers which are accessible only to supervisor mode programs The Timer Facility may be used to perform precise timing of system events Each Am29000 output has associated logic which compares the signal on the output with the signal which the processor is providing internally to the output drive
36. 000 instruction format All instructions for the Am29000 are 32 bits in length and are divided into four fields These fields have several alternative definitions In certain instructions one or more fields are not used and are reserved for future use 49 The instruction format is shown in table 2 16 and the various fields are interpreted as follows OP this field contains an operation code definig the operation to be performed In some instructions the least significant bit selects between two possible operands For this reason this bit is sometimes labelled A or M with the following interpretations Absolute the A bit is to differentiate between program counter relative A 0 and absolute A 1 instruction addresses when these addresses appear within instruc tions IMmediate the M bit selects between a register operand M 0 and an immediate operand M 1 when the alternative is allowed by the instruction RC the RC field contains a global or local register number 117 110 this field contains the most significant 8 bits of a 16 bit instruction address This is a word address and may be program counter relative or absolute depending on the A bit of the operation code 115 18 this field contains the most significant 8 bits of a 16 bit instruction VN this field contains an 8 bit trap vector number CE CNTL this field controls a load or store access RA the RA field contains a global or local register number SA the SA
37. 2 Saab Ericsson Space Stack RISC microprocessor instruction set architecture for prototype chip 1992 Sie82 Siewiorek D P Bell C G Newell A Computer Structures Principles and Exam ples McGraw Hill Singapore 1982 Smi83 Smith J E Pleszkun A R Katz R H Goodman J R Pipe A high performance visi architecture Proceedings of IEE International Workshop on computer sys tems organisation March 1983 Tab87 Tabak D RISC Architecture John Wiley amp Sons Inc New York 1987 You82 Young S J Real Time Languages Design and Development Ellis Horwood Chichester 1982 110 Appendix A Instruction set summaries A 1 MC88100 instruction set summary ADD rD rs1 IMM16 integer add ADD CAR rd rS1 rS2 ADDU CAR rD rS1 IMM16 unsigned integer add rD r 1 r 2 CMP rD rS1 IMM16 integer compare rD r 1 r 2 DIV rD rS1 IMM16 integer divide rD r 1 r 2 DIVU rD rS1 IMM16 integer unsigned divide rD r 1 r 2 MUL rD rs1 IMM16 integer multiply rD r 1 r 2 SUB rD rs1 IMM16 integer subtract SUB CAR rD r 1 r 2 SUBU rD rS1 IMM16 integer unsigned subtract SUBU CAR rD rS1 r52 Table A 1 MC88100 Integer Arithmetic Instructions 111 rD rs1 IMM16 logical and rD rs1 52 logical and rD rs1 IMM16 logical mask immediate rD rS1 IMM16 logical or rD r 1 r 2 logical or rD rs1 IMM16 logical exclusive or rD r 1 r 2 logical exclusive or Table A 2 MC88100 Logical Instructions rs2 unconditional
38. 31 26 OP 25 21 RS 25 0 TARGET 20 16 RT 15 0 IMMEDIATE Table 2 18 R2000 instruction formats e The two multiply divide registers HI LO store the double word 64 bits result of multiply operations and the quotient and remainder of divide operations e A 32 bit program counter e Exception Handling Registers the Cause register describe the last exception the EPC Exception Program Counter contains the address where processing can resume after an exception has been serviced the Status register contains all major status bits the BadVAddr Bad Virtual Address register saves the entire bad virtual ad dress for any addressing exception the Contezt register provides information useful for a software TLB exception handler the PRId Processor Revision Identifier register contains information that iden tifies the implementation revision level of the Processor and System Control Coprocessor 2 4 4 R2000 instruction format Every R2000 instruction consists of a single word 32 bits aligned on a word boundary There are three instruction formats described in table 2 18 The interpretation of the fields are as follows OP is a 6 bit operation code RS is a 5 bit source register specifier RT is a 5 bit target register source destination or branch condition IMMEDIATE is a 16 bit immediate branch displacement or address displacement TARGET is a 26 bit jump target address 54 e RD is a 5 bit shift amount e FU
39. C projects Berkeley SPUR Symbolic Processing Using RISC is a multiprocessor research machine for investigations in paralell processing Hil85 Hil86 The SPUR processor is a general purpose RISC with support for LISP and floating point arithmetic From 6 to 12 SPUR processors may be attached to shared memory and shared I O devices by the SPUR bus University of Wisconsin PIPE Parallel Instructions and Pipelined Execution project was an attempt to reduce three common processor bottlenecks with a reduced architecture Smi83 In the PIPE programs are decomposed in separate address and computation tasks Two independent identical processors performs these tasks An access processor is responsible for all memory addressing and access operations An execute processor performs all data processing Reading University RIMMS Reduced Instruction Set architecture for Multi Micropro cessor Systems resulted from a study of CPU design for SIMD and MIMD multiprocessor systems Mil83 The research group saw that the performance gains through concurrency have the potential beeing much more significant than performance gains throuh increased device speeds The Ben Gurion University MODHEL RISC system Tab87 was intended as an in vestigation tool in the study of RISC computing systems The MODHEL system can be used in experiments with benchmark programs in studies aimed at finding an optimal instruction set 22 Hewlett Packard has developed a fami
40. E 0 SRC1 Table 2 10 80960KB REG instruction format Process and Trace Controls The processors process controls are a set of 32 bits that control or show the current execution state of the processor The trace controls are a set of 32 bits that control the tracing facilities of the processor 2 2 4 80960KB instruction formats All of the 80960KB instructions are one word long and begin on word boundaries One group of instructions allows a second word which contains a 32 bit displacement There are four basic instruction formats REG COBR CTRL and MEM Each instruction has only one format which is defined by the opcode field of the instruction REG format The REG format Table 2 10 is for operations that are performed on data contained in the global local or floating point registers The opcode is 12 bits long and is split between bits 7 through 10 and bits 24 through 31 The SRC1 and SRC2 operand fields specify source operands for the instruction The operands can be either literals or registers The mode bits M1 for SRC1 M2 for SRC2 and the instruction type floating point or non floating point determine whether an operand is a register or a literal For non floating point instructions if a mode bit is set to 0 the respective SRC1 or SRC2 field specifies a global or local register If the mode bit is set to 1 the field specifies an ordinal literal 5 bits in the range of 0 to 31 For floating point instructions if the mod
41. EY and WORK PACK AGE 4 EVALUATION OF PROCESSOR CONFIGURATIONS PART 1 HARDWARE DESIGNS Keywords Hard Real Time Systems RISC architectures Contents 1 The Background Of RISC 16 1 1 Computer Architecture 16 1 2 Trendsincomputerarchitectures 17 1 3 Considerations that lead to the RISC 18 1 4 A RISC design decisiongraph 19 15 Early RISCs 20 1 6 A brief overwiev of some RISC projects 22 2 Description Of RISC Architectures 24 2 1 Motorola MC88100 25 2 1 1 MC88l00instructionset 25 2 1 2 MC88100dataformats 25 2 1 3 MC88l00 registers 26 2 1 4 MC88100 instruction formats addressing modes 26 2 1 5 MC88100 processor states 33 2 1 6 MC 88100 pipelining 35 2 2 Intel 80960KB 36 2 2 1 80960 KB instruction set 36 2 2 2 80960KBdataformats 36 2 2 3 80960KB registers 37 2 3 2 4 2 5 2 6 2 2 4 80960KB instruction formats 39 2 2 5 80960KB addressing Modes 42 2 2 6 80960 KB processor states ee ee ee 44 AMDAm29000
42. However the fundamental problem remains 87 since even very large register files may be exhausted A stack architecture such as T800 or THOR provides a natural convention stacking of all parameters This is simple and straightforward and there are no difficulties with nested calls Furthermore with THOR since the 32 bytes close to top of stack are present in on chip registers it is possible to take advantage of the rapidness with register passing without having to bother with save and restore in the case of nested calls Am29000 finally provides a solution similar to SPARC The large number of registers and the use of a run time stack made up by registers could be thought of as register windows where the calling and the called program share a set of registers All of the studied processors treat interrupts in a similar manner The elapsed time between an interrupt and the point at which processing starts at the appropriate interrupt handler address can be regarded as the interrupt latency time and is divided into three phases 1 Finish current instruction does not apply to exception 2 Check interrupt priority level versus current processor level i e whether the interrupt should be serviced or not 3 Save enough processor status to be able to continue processing after the interrupt has been serviced Finishing current instruction causes no significant delay provided that no possible instruction from the instruction set may last
43. I in r3 ptr to highest process tasks PCB in r5 move move move move move L1 cmp jmple move move move L2 move sub cmp jmpne PCBOPTR r2 r2 r5 10 r1 0 r3 0 r4 12 T PRI r3 L2 r2 T PRI r3 r2 T ID r4 r2 r5 12 T NEXT r2 1 r1 0 r1 L1 address of first PCB in r2 1 ptr to hi priority task 1 number of PCB s to search 1 initial priority lowest 1 initial PCB ID undefined 1 check PCB priority 10 branch if previous is greater 10 substitute new priority 1 remember task ID 1 remember PCB ptr 1 get next PCB pointer 10 exit 10 when 10 all PCB s searched 9 Figure B 1 Process Control Block structure In the following paragraphs the generic code will be translated to assembly code for the respective processors The total amount of required machine cycles used to perform the 141 PCB search will be approximated Register names are generalised to increase readability thus the register naming conventions proposed by each manufacturer are not always used It is assumed that r0 is a hard wired zero register It is further assumed that only one substitution of PCB is needed Figures within curly brackets denotes the assumed number of processor cycles with respect to possible pipeline penalties The code is not tested and not aimed for practical use The number of clock cycles required for storing restoring processor context is estimated by considering a m
44. NCT is a 6 bit function field 2 4 5 R2000 processor states The normal instruction execution may be preempted by an exception When the R2000 detects an exception the normal sequence of instruction execution is suspended the pro cessor is forced into Kernel mode where it can respond to the abnormal or asynchronous event When an exception occurs the R2000 loads the EPC Exception Program Counter with an appropriate restart location where execution may resume after the exception has been serviced The restart location in the EPC is the address of the instruction which caused the exception or if the instruction was executing in a branch delay slot the address of the branch instruction immediatly preceeding the delay slot The R2000 aborts the cur rent instruction which may be an instruction causing the exception and also aborts all those following in the instruction pipeline which have already began execution The R2000 then performs a direct jump into a designated exception handler routine The following exceptions are recognised by the R2000 e Reset Assertion of the R2000 s reset signal causes an exception that transfers control to the special vector at address 0xBFC00000 e UTLB miss User TLB miss A reference is made to a page that has no matching TLB entry e TLB miss A referenced TLB entry s valid bit is not set or there is a reference to a page that has no matching TLB entry e TLB modified During a store operation the vali
45. Processor performance in real time systems Roger Johansson Department of Computer Engineering Chalmers University of Technology 5 412 96 Goteborg Sweden E mail roger ce chalmers se October 9 1992 Abstract During the last decade RISC Reduced Instruction Set Computer processors intro duced mainly in work station applications have brought excellent performance at low costs In real time system design the question arises How do RISC processors comply to the specific demands of such a system This thesis describes seven RISC processors from an architectural point of view Their ability to perform in a real time system is elaborated and reported Finally real time system hardware considerations are made from six different designs using three different processors The system hardware considerations shows that in a real time system design there is not very much to gain with a modern general purpose RISC design such as SPARC On the contrary while the estimated performance for SPARC was just about the level of THOR the board area became approximatly 40 larger the power consumption 70 more and the expected failure became 45 greater This thesis is a revised version of two reports earlier published as a part of the ES TEC RISC evaluation study performed by Saab Space contract number 8686 89 NL JG SC during late 1990 namely WORK PACKAGE 3 SURVEY OF COMMERCIAL RISC processors PART 2 DETAILED ARCHITECTURAL SURV
46. R LOGEPRL LOGR LOGRL MOVR MOVRL MOVRE MULR MULRL REMR REMRL ROUNDR ROUNDRL SCALER SCALERL srcl src2 dst srcl src2 dst src dst src dst srcl src2 dst srcl src2 dst src mask src dst src src srel sre2 srel sre2 srel sre2 srel sre2 src dst src dst srcl src2 dst srcl src2 dst src dst src dst src dst src dst src dst src dst srcl src2 dst srcl src2 dst src dst src dst src dst src dst srcl src2 dst src lsrce2 dst srcl src2 dst srcl src2 dst src dst src dst src dst srcl src2 dst srcl src2 dst srcl src2 dst srcl src2 dst src dst src dst srcl src2 dst srcl src2 dst add real add long real atomic add arctangent real arctangent long real atomic modify classify real classify long real compare ordered real compare ordered long real compare real compare long real cosine real cosine long real copy sign real extended copy reversed sign real extended convert long integer to real convert integer to real convert real to integer convert real to integer long convert truncated real to integer convert truncated real to long integer divide real divide long real exponent real exponent long real log binary real log binary long real log epsilon real log epsilon long real log real log long real move real move long real move extended real multiply real multiply long real remainder real remainder long real round real round long real scale real scale long real Tabl
47. SEQ vn ra rb const8 assert equal to ASGE vn ra rb const8 assert greater than or equal to ASGEU vn ra rb const8 assert greater than or equal to unsigned ASGT vn ra rb const8 assert greater than ASGT vn ra rb const8 assert greater than unsigned ASLE vn ra rb const8 assert less than or equal to ASLEU vn ra rb const8 assert less than or equal to unsigned ASLT vn ra rb const8 assert less than ASLTU vn ra rb const8 assert less than unsigned ASNEQ vn ra rb const8 assert not equal to Table A 24 Am29000 Compare instructions AND rc const8 and logical ANDN rb const8 and not logical NAND rb const8 nand logical NOR rb const8 nor logical OR rb const8 or logical XOR rb const8 exclusive or logical XNOR rb const8 exclusive nor logical SLL rb const8 shift left logical SRA rb const8 shift right arithmetic SRL rb const8 shift right logical EXTRACT rb const8 extract word bit aligned Table A 25 Am29000 Logical shift instructions 122 LOAD LOADL LOADM LOADSET STORE STOREL STOREM EXBYTE ce cntl ra ce cntl ra ce cntl ra ce cntl ra ce cntl ra ce cntl ra ce cntl ra rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rc ra rb const8 load load and lock load multiple load and set store store and lock store multiple extract byte
48. Supervisor Mode Instruction TLB miss Supervisor Mode Data TLB miss Instruction TLB protection violation Data TLB protection violation 14 Timer 15 Trace 16 INTRO 17 INTRI 18 INTR2 19 INTR3 20 TRAPO 21 TRAP1 22 63 Reserved or associated with FP instructions 64 255 User defined COOnN no bp wnm o 10 11 w N Table 2 17 Am29000 exception vectors 52 2 4 MIPS R2000 processor The R2000 is based on research work carried out at Stanford in the beginning of the eight ies Especially a base level instruction set was proposed from the experience gained during work with optimizing compilers The R2000 processor consists of two tightly coupled pro cessors implemented on a single chip The first processor is a full 32 bit RISC CPU The second processor is a system control coprocessor CPO containing a TLB Translation Lookaside Buffer and control registers to support a virtual memory subsystem and sepa rate caches for instruction and data A predecessor R3000 adds a floating point processor to R2000 Thus what is said in this chapter also applies to the R3000 microprocessor 2 4 1 R2000 instruction set The R2000 instruction set contains 74 instructions divided into 6 groups load store computational jump and branch coprocessor coprocessor 0 and special instructions A summary is given in Appendix B The R2000 has two operating modes user mode and kernel mode The R2000 normally operates in the user mode until
49. TBR instruction does not affect the tt field In addition to this there is a Floating Point State Register FPR that contain FPU mode and status information 2 5 4 SPARC instruction formats addressing modes The SPARC instructions are classified into three major formats simply called format format 2 and format 3 These are summarised in tables 2 20 and 2 21 Two formats include subformats The OP field selects formats formatl format2 or format3 1 The format 1 is used by the CALL instruction and contains a 30 bit sign extended format 1 format 2 SETHI BRANCH bits encoding bits encoding bits encoding OP OP DISP30 A TCOND OP2 DISP22 Table 2 20 SPARC format 1 and format 2 instruction formats 60 other integer instructions FP COPROC operations bits encoding bits encoding bits encoding OP OP RD RD OP3 OP3 RS1 RS1 1 OPF OPC SIMM13 RS2 Table 2 21 SPARC format 3 instruction formats word displacement DISP30 2 The format 2 is used by SETHI and branch instructions e OP2 contains instruction opcode for format 2 e RD For store instructions this register selects an r register or an r register pair or an f register or an f register pair to be the source For all other instructions this field selects an r register or an r register pair or an f register or an f register pair to be the destination e The A bit means annul in format 2 instructions This bit changes the be haviou
50. Their values are not preserved across procedure calls e Registers 16 through 23 are saved registers their values must be preserved across procedure calls 77 e Registers 24 and 25 are used for expression evaluation their values are not pre served across procedure calls e Registers 26 and 27 are reserved for the operating system kernel e Register 28 contains the global pointer e Register 29 contains the stack pointer e Register 30 is a saved register like 16 523 e Register 31 contains the return address Used for expression evaluation According to software conventions four or fewer parameters could be passed in reg isters 3 1 5 SPARC register conventions The organisation of SPARC register windows was described in paragraph 2 5 3 page 58 Figure 2 1 page 59 shows how 32 general purpose registers are divided into 4 groups The outs 8 registers in the active window are are identical to the ins of the next window The out register r 15 is used for saving current address by the CALL instruction Thus seven parameters may be passed using registers during a subprogram call By software convention fewer parameters can be assumed thus providing additional local registers If a nesting depth exceeds 4 a trap occurs and the real time kernel must take approriate actions 3 1 6 T800 THOR Both T800 and THOR are stack architectures Consequently parameters are passed via the stack In THOR 32 wor
51. Unit FU determines the location of the next processor instruction to the decode stage The instruction is fetched either from the Instruction Prefetch Buffer the Branch Target Cache or an external instruction memory During the decode stage the Execution Unit EU decodes the instruction selected during the fetch stage and fetches and or assembles the required operands It also evaluates addresses for branches loads and stores During the execute stage the Execution Unit EU performs the operation specified by the instruction In the case of branches loads and stores the Memory Management Unit MMU performs address translation if required During the write back stage the results of the operation performed during the execution stage are stored In the case of branches loads and stores the physical address resulting from translation during the execute stage is transmitted to an external device or memory Most pipeline dependencies which are internal to the processor are handled by forward ing logic in the processor For these dependencies which result from the external system the Pipeline Hold mode insures proper operation In a few special cases the processor pipeline is exposed to software executing on the Am29000 51 vector exception Illegal Opcode Unaligned Address Out of Range Coprocessor Not Present Coprocessor Exception Instruction Access Violation Data Access Violation User Mode Instruction TLB miss User Mode Data TLB miss
52. a programming language hardware support i e dedicated registers and instructions for implementation of Ada Task Switches Rendezvous Inter rupts Exceptions and Real Time Clock Similar to the Inmos T800 THOR performs operations on an Evaluation Stack In addition to this data can be accessed Relative to the top of stack This makes THOR an interesting synthesis of a traditional stack computer architecture and a Reduced Instruction Set Computer The microprocessor has built in test support that allows test and debug of hardware software Like the T800 multiprocessor configurations are encouraged by the processor architecture 2 7 1 THOR instruction set The instruction set is made up from 76 different instructions Some of these are protected when the processor is running in user mode There is an unusual group of instructions supporting the ADA task concept added as extensive support for the ADA programming language A summary of all instructions is given in Appendix B Instructions may be executed either in privileged mode or user mode When in privi leged mode all instructions can be executed and no memory protection checks are made apart from ensuring that addresses are within the 2 GByte address space In user mode all accesses to each task s stack are protected from access by any other task using memory protect registers see below When in user mode some instructions are privileged an an exception will occur on an attempt
53. a formats e Integer signed 2 s complement and unsigned data formats 64 bits double word 32 bits word 16 bits half word 8 bits byte Data items are aligned so that they do not cross word boundaries i e half words may have only even addresses words may have addresses divisible by four double words may have addresses divisible by eight and byte data may be placed at any address An attempt to cause misaligned access causes an exeption if enabled Signed and unsigned bit fields from 1 to 32 bits e IEE 754 single precision 32 bits floating point IEE 754 double precision 64 bits floating point Bytes and half words are packed in memory according to the little endian or the big endian scheme The byte ordering in effect is controlled by a bit in the processor status register A signed byte or half word stored in a register is automatically signed extended Data is placed in the least significant part while remaining bits are filled with the sign of the data value In the case of unsigned byte or half word the most significant part of the register is filled with zeros The least significant bit in a data item is denoted b0 the next bit bl and so on 25 2 1 3 MC88100 registers The register set consists of general purpose registers registers dedicated for floating point operations and control registers There are also some internal registers not available in any of the register models they can only be used
54. ads or stores 3 registers within 6 cycles which makes a total of 31 6 3 cycles 146 B 5 SPARC B 5 1 PCB search PCB search exits with task identification number T ID in r4 task priority T PRI in r3 ptr to highest process tasks PCB in r5 sethi PCBOPTR gt gt 10 r2 add r2 PCBPTR amp Ox3FF r2 load immediate into r2 done add r2 0 r5 ptr to hi priority task 1 add r0 10 r1 number of PCB s to search 1 add r0 0 r3 initial priority lowest 1 add r0 0 r4 initial PCB ID undefined 1 L1 1dub r2 T PRI r6 r6 temp hold priority memory access 1 sub r6 r3 r7 compare priorities result in r7 1 ble a L2 branch if previous is greater 1 add r0 r6 r3 substitute new priority 11 1dub r2 T ID r4 remember task ID memory access 1 add r0 r2 r5 remember PCB ptr 1 L2 ld 12 T NEXT r2 get next PCB pointer memory access 1 sub ri i ri exit 1 bne a L1 when all PCB s searched 1 B 5 2 Register Store Restore The SPARC pipeline is similar to the R2000 and the same pipeline stalls occurs figure B 6 Thus loading as well as storing the entire SPARC register file will use 136 6 3 cycles B 6 T800 PCB search For the T800 there is no need for a software process scheduler since there is hardware sup port for this in the processor The T800 can run several processes concurrently Processes may be assigned either high or low priority and there may be any number of each The processor has a mic
55. al instructions 115 A 11 I80960KB Compare conditional compare instructions 115 A 12180960KB Branchinstructions 116 A 13180960KB Compare and branch instructions 0 116 A 14 I80960KB Bit bitfield instructions 117 A 15 I80960KB Call return instructions 117 A 16 180960KB Conditional fault instructions 117 A 17 180960KB Processor management instructions 118 A 18 180960KB Synchronous load and move instructions 118 A 19 180960KB Floating point instructions 119 A 20 180960KB Floating point instructions continued 120 A 21 180960KB Decimal arithmetic instructions 120 A 22 80960KB Miscellanous instructions 120 A 23 Am29000 Integer arithmetic instructions 121 A 24 Am29000 Compare instructions 122 A 25 Am29000 Logical shift instructions 122 A 26 Am29000 Data movement instructions 123 A 27 Am29000 Constant instructions 123 A 28 Am29000 Branch instructions 123 A 29 Am29000 Floating point instructions 124 A 30 Am29000 Miscellaneous instructions 124 A 31 R2000 Load Store instructions 125 A 32 R2000
56. alid memory accesses i e limits for memory space that the subprogram may access Some high level languages such as ADA supports differentiated error han dling 1 e different subprograms use different error handling routines for the same type of error which will cause extra overhead during run time As examples of subprogram exit code we have deallocation of local variables placing return values at appropriate location and possibly error checking In real time systems it often turns out that stack checking memory access violation checking and differentiated error handling must be discarded in favour of more dense code and faster execution However during the debug phase of real time system software these facilities may be of great importance 3 1 1 MC 88100 register conventions The outline of the MC88100 general purpose registers is described in paragraph 2 1 3 page 26 The register usage are as follows 75 e Register r0 always contains zero which is used in instructions requiring the constant zero as an operand This is a hardware convention the software can write to r0 but this operation has no effect e Register r contains the return pointer generated by bsr or sr to subroutine instruc tions This is a hardware convention both of these instructions overwrite the data in rf when they execute However this register is not protected software can read or overwrite the return pointer or any other data contained in r e Regis
57. alue to compute the address of a target instruction that the processor goes to as a result of a comparison The displacement field can range from 2 to 21 1 To determine the IP of the target instruction the processor converts the displacement value to a byte displacement It then adds the resulting byte displacement to the IP of the next instruction CTRL format The CTRL Table 2 12 format is used for instructions that branch to a new IP including the branch if bal and call instructions The return instruction also uses this format The opcode field for this format is 8 bits The instructions that use this format have no operands The target address for a branch is specified with the DISPLACEMENT field in the same manner as is done with the COBR format instructions Here the DISPLACEMENT field specifies a word displacement that can range from 2 to 221 1 For the return instruction DISPLACEMENT field are ignored 40 OPCODE DISPLACEMENT 0 Table 2 12 80960 CTRL instruction format MEMA bits encoding OPCODE OPCODE SRC DST SRC DST ABASE ABASE MD MODE 0 SCALE OFFSET 0 INDEX Table 2 13 80960 MEMA MEMB instruction formats MEM format The MEM A or MEM B table 2 13 formats is used for instructions that require a memory address to be computed These instructions include the load store and lda instructions Also the extended versions of the branch branch and link and call in s
58. ame time whithin the processor When this happens they are recognized by the processor according to a predefined priority Exceptions that have the same priority never occur simultaneously 2 1 6 MC 88100 pipelining There are four separate execution units which allow MC88100 to perform up to five dif ferent operations simultanously e Access program memory e Execute an arithmetic logical or bit field instruction e Access data memory e Execute floating point or integer divide instruction e Execute floating point or integer multiply instruction The instruction unit pipeline supplies the appropriate execution unit with instructions that are to be executed by a concurrent pipeline Data memory access instructions are dispatched to the data unit whereas floating point integer multiply and integer divide instructions are dispatched to the FPU The FPU contains two pipelines handling floating point add subtract compare and conversions between integer and floating point as well as integer and floating point divide instructions All other instructions are executed by the integer unit or instruction unit for branches in one machine cycle All execution units contain an additional level of parallelism Instruction decode and source operand fetches from the registers are performed simultanously Branch instruction decode and branch target address calculation are performed in parallel with the next instruction fetch Three internal register buses
59. and 4 outputs The Task Pointer TP points to the task information block in memory The Delay Register DR is the delay counter It holds the delay of the task This is a two s complement integer Normally the register is decremented every microsecond When decremented below zero and this task s Status Register DLY flag is set scheduling is performed The Task Register TR holds task status information for each of the on chip tasks TR holds the following information e Ready Flag RF is set when the task is ready to execute e Delay Flag DF is set when the task is delayed e Accept Wait Flag AW is set when this task is waiting for an accept statement e Entry Call Flag EF is set when this task is performing an entry call e Remote Task Flag RT is set when this task is doing a rendevouz with a remote task e Queued Entry Flag QE is set when queued calls exist for an entry called by this task e Rendevouz Field RZ is set to the calling task number when a rendevouz with this task starts or defines the entry number when this task performs an entry call e Priority Field PR reflects the tasks priority e Accept Field AR when an entry call is pending the bit corresponding to the calling task is set 69 Result Register Exception Register Status Register Top of Stack Top Register Program Counter End of Stack Beginning of Stack Table 2 25 THOR Task Control Registers For each task there is a T
60. as made e Motorola MC 88100 e Intel lapx80960 13 e MIPS R2000 R3000 e Cypress SPARC Another criterion was to select processors which are claimed by their manufacturers to facilitate real time system support and to be suitable for this range of applications From this group of processors the following selection was made e Advanced Micro Devices Am 29000 e Inmos T800 transputer e Saab Ericsson Space THOR From lack of sufficient time another selection had to be made for the hardware consid erations in chapter 4 The three processors SPARC T800 and THOR that were selected were considered as providing information representative for the entire group This thesis is a revised version of two reports earlier published as a part of the ES TEC RISC evaluation study performed by Saab Space contract number 8686 89 NL JG SC during late 1990 namely WORK PACKAGE 3 SURVEY OF COMMERCIAL RISC processors PART 2 DETAILED ARCHITECTURAL SURVEY and WORK PACK AGE 4 EVALUATION OF PROCESSOR CONFIGURATIONS PART 1 HARDWARE DESIGNS 14 Acknowledgements I wish to thank my supervisor Jan Torin He is a major contributor to this work I also thank Jiri Gaisler who pointed out disambiguities in the original reports Jonas Vasell who contributed with valuable aspects on the first three chapters Mats Svenningsson for his willingness of sharing his great knowledge in numerou
61. ask Control Block TCB on the processor chip The TCB s have identical sets of registers as described in table 2 25 The Result Register RR holds the least significant half of arithmetic instructions that yuilds 64 bit results The Exception Register ER points to the exception information block in the stack ER is a word pointer The Status Register SR holds condition codes hardware exception numbers and Ada support information as follows e The Negative Flag N Zero Flag Z Carry Flag C and Unsigned Flag U is set according to arithmetic conditions e The Task Switch Inhibited Flag TSI is set when no task switch should occur for this task e The User Mode Flag UM is set when this task is in user mode The TOS register points at the word on top of stack The TOP register holds the word at the stack top pointed at by TOS The 32 words next to top of the runtime stack are cached on the processor chip The Program Counter PC holds the address of the last instruction read from memory This address is a halfword address BOS and EOS defines the region in memory where this task s data stack is located The memory protection check is active in user mode If an access using the stack addressing mode is not within BOS and EOS or if TOS would move outside BOS or EOS an exception is raised 70 2 7 5 THOR processing states Normal executing may be preempted by an interrupt condition by an internal generated exc
62. before the instruction is completed Floating point coprocessor traps are caused by floating point coprocessor instructions and occur before the instruc tion is completed Asynchronous traps occur when an external event interrupts the pro cessor They are not related to any particular instruction and occur between the execution of instructions An instruction is defined to be trapped if any trap occurs during the course of its execution If multiple traps occur during one instruction the highest priority trap is taken Lower priority traps are ignored because the traps are arranged under the assumption that the lower priority traps persist recur or are meaningless due to the presence of the higher priority trap The ET bit in the PSR must be set for traps to occur normally If a synchronous trap occur while traps are disabled the processor halts and enters an error state The Trap Base Register TBR generates the exact address of a trap handling routine When a trap occurs the hardware writes a value into the trap type tt field of the TBR This uniquely identifies the trap and serves as an offset into the table whose starting address is given by the TBA field of the TBR The 8 bit wide tt field allows for 256 distinct types of traps as defined in table 2 22 62 reset instruction access exception illegal instruction privileged instruction fp disabled cp disabled window overflow window underflow mem address not aligned fp excep
63. ce constant equals constant constant store local constant store non local operate Table A 43 T800 Function codes logical and logical or logical xor bitwise not shift left shift right add subtract multiply fractional multiply div remainder greater than difference sum product for positive negative register A Table A 44 T800 Arithmetic Logical operations 132 long add long sub long sum long diff long multiply long divide long shift left long shift right normalise Table A 45 T800 Long arithmetic operations reverse extend to word check word extend to double check single minimum integer duplicate top of stack Table A 46 T800 General operations Instruction Comments MOVE2DINIT initialise data for 2D block move MOVE2DALL 2D block copy MOVE2DNONZERO 2D block copy non zero bytes MOVE2DZERO 2D block copy zero bytes Table A 47 T800 2D block move operations Instruction Comments CRCWORD calculate crc on word CRCBYTE calculate crc on byte BITCNT count bits set in word BITREVWORD reverse bits in word BITREVNBITS reverse bottom n bits in word Table A 48 T800 CRC and bit operations 133 BSUB byte subscript WSUB word subscript WSUBDB word double word subscript BCNT byte count WCNT word count LB load byte 5B store byte MOVE move message Table A 49 T800 Indexing array operations LDTIMER load timer TIN timer input TALT timer alt start TALTWT tim
64. d bit is set but the Dirty bit is not set e Bus Error Assertion of the R2000 s BERR signal due to such external events as bus timeout backplane bus parity errors invalid physical address or invalid access type e Address Error Attempt to load fetch or store an unaligned word that is a word or halfword at an address not evenly divisible by 4 or 2 respectively Also caused by reference to a virtual address with most significant bit set while in user mode e Overflow Two s complement overflow during add or subtract System Call Execution of the syscall instruction e Breakpoint Execution of the break instruction Reserved Instruction Execution of an instruction with an undefined or reserved major operation code or a special instruction whose minor opcode is undefined 55 e Coprocessor Unusable Execution of a coprocessor instruction when the CU Copro cessor Usable bit is not set for the target processor e Interrupt Assertion of one of the R2000 s six hardware interrupt inputs or setting of one of the two software interrupt bits in the Cause Register 2 4 6 R2000 pipeline The execution of a single instruction consists of five pipeline stages 1 IF Instruction Fetch Access the TLB and calculate the instruction address required to read an instruction from the I cache The instruction is not actually read into the processor until the beginning of the RD pipe stage 2 RD Read any required operands from CPU registe
65. d comparable at cost and analyzed to give an estimation of e maximum possible instruction execution rate e required number of devices e area of printed circuit board 90 e power consumtion e failure rate 4 1 General notes on the designs In the schematics see appendix C readability is emphasised The diagrams are not complete but rather focus on devices with major impact on the configuration function and performance For each design a description of a memory read cycle is given and analysis is carried out Estimations are performed using worst case assumptions The designs are optimised for the highest possible clockfrequency i e no attempt is made to reduce wait state penalties due to high clock frequence 4 2 Execution Rate Estimation The instruction mix is made up from e z percentage arithmetical logical instructions z percentage jump branch instructions x3 percentage load store instructions e x4 percentage floating point instructions as a consequense 21 22 3 z4 1 for a large number of executed instructions Parameters that describes the processor in effect are e X the number of processor cycles required to execute an arithmetical logical in struction e X composed by 0 1X31 0 9X22 where X is the number of processor cycles required for a branch not taken in struction Xo is the number of processor cycles required for a branch taken instruction Hence
66. de The Operand Effective Address is calculated relative to the top of stack TOS either implicit or by adding the parameter to TOS Program Counter Relative addressing mode The Operand Effective Address is calculated relative to PC by adding the parameter and PC shifted right one bit to get word boundary alignment Indirect X addressing mode The Operand Effective Address is calculated by adding the parameter and the value on the stack top appearing two instructions previously PC Indirect addressing mode The Operand Effective Address is calculated by adding PC shifted right one bit and the value on the stack top appearing two instructions previously TOS Indirect addressing mode The Operand Effective Address is calculated by adding TOS and the value on the stack top appearing two instructions previously 67 Sada Configuration Register Error Address Register Signal Input Register Signal Output Register Real Time Clock MSL Real Time Clock MSH Task Pointer Identification Register Table 2 24 THOR registers Immediate I The Operand Effective Address is the TOS and the source operand is part of the instruc tion Register R The parameter designates the register to be used either as source or as destination operand 2 7 4 THOR registers The processor maintains on chip registers as described in table 2 24 The Configuration Register is used for hardware specific parameters and includes
67. de buffer by providing information on the least recently used entry of the TLB when a TLB miss occurs 48 bits encoding 31 22 OP 22 A M 21 16 RC 117 110 115 18 VN CE CNTL 15 8 RA SA 7 0 RB RB or I I9 12 I7 10 UI RND FD FS Table 2 16 Am29000 instruction formats The unprotected special purpose registers are defined as follows Indirect Pointer C Allows the indirect access of a general purpose register Indirect Pointer B Allows the indirect access of a general purpose register Indirect Pointer A Allows the indirect access of a general purpose register Q Provides additional operand bits for multiply and divide operations ALU Status Contains information about the outcome of arithmetic and logical oper ations and holds residual control for certain instruction operations Byte Pointer Contains an index of a byte or half word within a word This register is also accessible via the ALU status register Funnel Shift Count Provides a bit offset for the extraction of word length fields from double word operands This register is also accessible via the ALU status register Load Store Count Remaining Maintains a count of the number of loads and stores remaining for load multiple and store multiple operations The count is initialised to the total number of loads or stores to be performed before the operation is initiated This register is also accessible via the Channel Control Register 2 3 4 Am29
68. diate rt immediate load upper word immediate rd rs rt logical OR rt rs immediate logical OR immediate rd rs rt logical exclusive or rt rs immediate logical exclusive or immediate rd rs rt subtract rd rs rt subtract unsigned rd rs rt logical NOR Table A 32 R2000 Computational instructions 125 rd rt amount shift left logical rd rt rs shift left logical variable rd rt amount shift right arithmetic rd rt rs shift right arithmetic variable rd rt amount shift right logical rd rt rs shift right logical variable Table A 33 R2000 Shift instructions BCzF offset branch if false coprocessor z condition is tested BCzT offset branch if true coprocessor z condition is tested BEQ rs rt offset branch if equal BGEZ rs offset branch on greater than equal to zero BGEZAL rs offset branch on greater than equal to zero BGTZ rs offset branch on greater than zero BLEZ rs oftset branch on less than equal to zero BLTZ rs offset branch on less than zero BLTZAL rs oftset branch on less than equal to zero BNE rs rt offset branch on not equal BREAK breakpoint trap J target unconditional jump JAL target unconditional jump and link JALR rs jump and link register JALR rd rs jump and link register JR rs jump register Table A 34 R2000 Jump branch instructions multiply unsigned multiply signed divide unsigned divide move from register LO move from register HI move to register LO move to r
69. ds from Top of Stack and downwords are reflected in registers on chip A writeback mechanism provide for consistency with memory contents The writeback is simultaneous with other processor activities 3 2 Deviation from normal execution By normal flow of instruction execution we generally mean the execution of sequential instructions in memory JUMP BRANCH and CALL instructions in short an easily predetermined behaviour from the computer system A break in normal flow of instruction execution is an event of some kind such as e An interrupt normally caused by an external device pulling a dedicated pin on the processor active That is A system activity 78 e An exception caused by the execution of an instruction preventing finishing execu tion of the instruction Examples are Arithmetic faults divide by zero attempt to draw the root from a negative number etc violation of permissions such as attempt to access supervisor memory in user mode attempt to execute privileged instruc tions etc An exception is also raised when a page fault occur in a virtual memory system An exception condition may leave the registers in a consistent state so that the elimination of the cause and the restart of the instruction will give correct re sults Such exceptions are often called faults An exception that potentially leaves the registers and memory in an indeterminate state is often called abort e A trap caused by a special instruction and
70. e 150 T800 HDO configuration 152 THOR HDO configuration 153 SPARC HDO configuration 154 T800 and SPARC EDAC 155 T800 THOR and SPARC memory 156 T800 HSO configuration 157 THOR HSO configuration 158 SPARC HSO configuration 159 12 Introduction As computers become smaller faster and more reliable the range of computer appli cations has grown From the computers initial role as equation solvers their usage has extended into several areas from toys to spacecraft control A rapidly expanding area of computer exploitation is applications that require infor mation processing in order to carry out their prime function rather than do the information processing as a prime function These types of computer applications are called real time systems A real time system can be understood as any information processing activity or system which has to respond to externally generated input stimuli within a finite and specified period You82 In a hard real time system the ability to respond within a spec ified time is as important as producing a correct result That is if the response or result arrives to late it is of no use The system will eventually crash or become unable to fulfill it s task A dedicated applica
71. e A 19 180960KB Floating point instructions 119 SINR src dst sine real SINRL src dst sine long real SQRT src dst square root real SQRTRL src dst square root long real SUBQ srcl src2 dst subtract ordinal with carry SUBR srcl src2 dst subtract real SUBRL srcl src2 dst subtract long real TANR src dst tangent real TANRL src dst tangent long real Table A 20 180960KB Floating point instructions continued src dst decimal move and test srcl src2 dst decimal subtract with carry srcl src2 dst decimal add with carry Table A 21 I80960KB Decimal arithmetic instructions SCANBYTE srcl sre2 scan byte for equality ROTATE len sre dst rotate bits CMPDECI srcl src2 dst compare and decrement integer CMPPDECO srcl src2 dst compare and decrement ordinal Table A 22 I80960KB Miscellanous instructions 120 Am29000 instruction set summary ADD ADDS ADDC ADDCS ADDCU SUB SUBC SUBCS SUBCU SUBR SUBRC SUBRCS SUBRCU SUBRS SUBRU SUBS SUBU MULTIPLU MULTIPLY MUL MULL MULU DIV DIVIDE DIVIDU DIVO DIVL DIVREM r e ra r e ra r e ra r e ra r e ra r e ra r e ra r e ra r e ra rs ra rs ra rs ra rs ra rs ra rs ra rs ra rs ra rc ra rb rc ra rb r e ra r e ra r e ra r e ra rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb c
72. e bit is set to 0 the respective SRC1 or SRC2 field specifies a register just as it does for non floating point instructions If the mode bit is set to 1 the field specifies either a floating point register or one of the two real number literals 0 0 or 1 0 The SRC DST field can specify either a source operand or a destination operand or 39 OPCODE SRC1 SRC2 M1 DISPLACEMENT 0 Table 2 11 80960KB COBR instruction format both depending on the instruction The mode bit M3 and the instruction type determine how this field is used For non floating point instructions if M3 is clear the SRC DST is a global or local register If M3 is set the SRC DST operand can be used only as a src operand that is an ordinal literal For floating point instructions the SRC DST field is only used to encode the destination operands If M3 is clear the destination operand is a global or local register If M3 is set the destination operand is a floating point register COBR format The COBR format Table 2 11 is used primarily for control and branch instructions The opcode field is 8 bits The SRC1 and SRC2 fields specify source operands for the instruction The SRC1 field can specify either a global or local register or a literal as determined by mode bit M1 The SRC2 field can only specify a local or global register The displacement field contains a signed two s complement number that specifies a word displacement The processor uses this v
73. e edac delay 36 ns e data bus buffer 11 ns Required From stable address to data latched 20 35 364 11 102ns Available 3 processor cycles 120 7 3 124ns Therefore a bus read cycle will require 3 processor cycles which implies 2 wait states 4 9 2 SPARC HDO configuration execution rate The following parameters were chosen to describe the SPARC configuration X 1 101 X2 1 XA3 3 X4 4 A SPARC instruction is encoded in 32 bits so U 1 From the previous section W 2 and Y W U 3 thus Zi Y W U 3 Z2 Y W U 3 Z3 5 Z4 X4 4 leading to 1 1 7 5 MmixredI PS 3 35 40 ns The memory power down facility may not be used since it is not possible to deassert memory chip select during interlocks and so the total memory power requirement is 650 mW device 4 10 The HSO configurations The HSO configuration is intendeded to estimate peak performance for a computer system with 1 MByte of memory It consists of e microprocessor with 1 MByte of static random access memory 4 11 General Notes on the HSO configurations The HSO configuration is accomplished by eliminating the EDAC circuitry and changing the memory devices from the HDO configuration Glue logic except from address decod ing and bus buffers is implemented using macro cells The memory is built from eight 64k 16 bit 25 ns static rams Address decoding is performed by high speed PAL devices eliminating any address bus skew which othe
74. e the displacement plus a constant of 8 is added to the IP of the instruction 43 2 2 6 80960 KB processor states The 80960 KB has four different operating states executing interrupted stopped and stopped interrupted The processor is placed in one of two states executing or stopped at initialization After that the processor and software controls the processor s state The processor can switch between the executing and interrupted states or between the stopped and stopped interrupted states However the processor never switches from the executing state to the stopped state unless it detects a series of fault conditions that it cannot handle Interrupts ACs and Faults The processor defines two methods of asynchronously requesting services from the proces sor interrupts and IAC InterAgent Communication messages Interrupts are the more common of the two An interrupt is a break in the control flow of a program so that the processor can handle a more urgent chore Interrupt requests are generally sent to the processor from an external source often to request I O services When the processor receives an interrupt request it temporarily stops work on its current task and begins work on an interrupt handling procedure Upon completion of the interrupt handling procedure the processor generally returns to the task that was interrupted and continues work where it left off Interrupts also have a priority which the processor uses t
75. e value BYTE is an unsigned 8 bit number INT 16 is a signed 16 bit number INT32 is a signed 32 bit number REAL32 conforms to the IEEE 754 single precision standard DD a e W N REAL64 conforms to the IEEE 754 double precision standard 2 6 2 T800 instruction set The T800 provides a vast instruction set with groups of instructions not found among conventional RISCs Besides loads stores integer arithmetic logical floating point arith metics control transfer and control operation instructions there are block moves cyclic redundancy check timer handling scheduling instructions to mention a few There are also facilities for real time system software debugging An instruction set summary is given in Appendix B 2 6 3 T800 instruction formats and addressing modes All instructions have the same format designed to give a compact representation Each instruction consists of a single byte divided into two 4 bits parts The four most significant 64 bits of the byte are the function code and the four least significant bits are a data value This representation provides for sixteen functions each with a data value ranging from 0 15 Ten of these are used to encode the most important functions Two more function codes allow the instruction to be extended in length prefix and negative prefix All instruc tions are executed by loading the four data bits into the least significant four bits of the operand register which is the
76. edure For each procedure that is called the processor allocates a separate set of 16 local registers For any one procedure within a program 36 registers are thus available the 16 global registers the 4 floating point registers and the 16 local registers These are all maintained on the processor chip Global Registers The 16 global registers are 32 bits registers Registers g0 through g14 are general purpose registers g15 is reserved for the current frame pointer FP The FP contains the address of the first byte in the current stack frame Floating Point Registers The four floating point registers fp0 through fp3 are 80 bits registers These registers can be accessed only as operands of floating point instructions All numbers stored in these registers are stored in extended real format The processor automatically converts floating point values from real or long real format into extended real format when a floating point 37 register is used as a destination for an instruction Local Registers The 16 local registers are 32 bits registers like the global registers The purpose of the local registers is to provide a separate set of registers aside from the global and floating point registers for each active procedure Each time a procedure is called the processor automatically sets up a new set of local registers for that procedure and saves the local registers for the calling procedure Local registers r0 through r2 are rese
77. egister HI Table A 35 R2000 Multiply divide instructions 126 a MFCO move from system control coprocessor MFCz move from coprocessor z MTCO move to system control coprocessor MTCz move to coprocessor RFE restore from exeption SYSCALL system call TLBP probe TLB for matching entry TLBR read indexed TLB entry TLBWI write indexed TLB entry TLBWR write random TLB entry CFCz move control from coprocessor z COPz coprocessor operation CTCz move control to coprocessor z Table A 36 R2000 Special coprocessor instructions 127 A 5 SPARC CY7C601 instruction set summary ADD ADDcc ADDX ADDXcc TADDCC TADDCCTV AND ANDcc ANDN ANDNcc SUB SUBcc SUBX SUBXcc TSUBCC TSUBCCTV MULSCC OR ORCC ORN ORNCC XOR XORCC XNOR XNORCC SLL SRL SRA SETHI rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm tbr rs1 rs2 imm tbr rs1 rs2 imm tbr rs1 rs2 imm tbr rs1 rs2 imm rd rs1 rs2 imm rd rs1 rs2 imm rd const rd integer add integer add modify icc integer add with carry integer add with carry modify icc tagged add and modify icc tagged add modify icc and trap on overflow logical and logical and m
78. el branch on less or equal BGE label branch on greater or equal BL label branch on less BGU label branch on greater unsigned BLEU label branch on less or equal unsigned BCC label branch on carry clear BCS label branch on carry set BPOS label branch on positive BNEG label branch on negative BVC label branch on overflow clear BVS label branch on overflow set FBA label floating point branch always FBN label floating point branch never FBU label floating point branch on unordered FBG label floating point branch on greater FBUG label floating point branch on unordered or greater FBL label floating point branch on less FBUL label floating point branch on unordered or less FBLG label floating point branch on less or greater FBNE label floating point branch on not equal FBE label floating point branch on equal FBUE label floating point branch on unordered or equal FBGE label floating point branch on greater or equal FBUGE label floating point branch on unordered or greater or equal FBLE label floating point branch on less or equal FBULE label floating point branch on unordered or less or equal FBO label floating point branch on unordered CBA label branch always on coprocessor condition CBN label branch never on coprocessor condition CBx label branch on coprocessor x condition CBxy label branch on coprocessor x or y condition CBxyz label branch on coprocessor x or y or z condition CALL label call subroutine JMPL address
79. eption or by exceptions raised by software THOR interrupt handling THOR s six input pins reflected in SIR is regarded as different priority interrupt pins Anyone turning to an active state forces an interrupt condition Upon receiving an in terrupt THOR activates a hardware scheduler the interrupt priority which also may be regarded as a task number causes the scheduler to dispatch the corresponding task This mechanism may be used to synchronise tasks running under different microprocessors in a multiprocessor environment The entire scheme has some similarities with a conventional vectored interrupt External events is thus rapidly gaining the microprocessors attention which ensures a minimal interrupt latency time THOR exception handling THOR exception handling has adapted the Ada language definition To each fragment of code or rather each subprogram there exists an Exception Information Block dynam ically allocated and initialised before the subprogram entrance This provides for different exception processing in different subprograms of same type of exception This strategy obviously decrease the overhead required by a software kernel To each exception there is a corresponding Exception number The first 15 numbers are defined by hardware ta ble 2 26 but they can also be raised by software remaining exception numbers are user defined 2 8 Conclusions Historically the major goal with developing new processor arch
80. er alt wait ENBT enable timer DIST disable timer Table A 50 T800 Timer handling operations IN input message OUT output message OUTWORD output word OUTBYTE output byte ALT alt start ALTWT alt wait ALTEND alt end ENBS enable skip DISS disable skip RESETCH reset channel ENBC enable channel DISC disable channel Table A 51 T800 Input Output operations 134 return load pointer to instruction general adjust workspace general call loop end Table A 52 T800 Control operations STARTP start process ENDP end process RUNP run process LDPRI load current priority Table A 53 T800 Scheduling operations Instruction Comments CSUBO check subscript from 0 CCNT1 check count from 1 TESTERR test error and clear STOPERR stop on error SETERR set error CLRHALTERR clear halt on error SETHALTERR set halt on error TESTHALTERR test halt on error Table A 54 T800 Error handling operations TESTPRANAL test processor analysing SAVEH save high priority registers SAVEL save low priority registers STHF store high priority front pointer STHB store high priority back pointer STLF store low priority front pointer STLB store low priority back pointer STTIMER store timer Table A 55 T800 Processor initialisation operations 135 Instruction Comments FPLDNLSN fp load non local single FPLDNLDB fp load non local double FPLDNESNI fp load non local indexed single FPLDNLDBI fp load non local indexed doub
81. es to support off chip cache and memory devices The hierarchy must permit fetching of instructions and operands at a rate that is high enough to prevent pipeline stalls Optimizing Compilers provide a mechanism to prevent or reduce the number of pipeline faults by reorganizing code From these observation we may conclude that RISC designs are intended for personal computers work stations and embedded systems where high performance is the primary goal In a real time system high performance is of course desirable However the set of needs extends due to the specific tasks that the system should carry out Real time systems must provide rapid process switches and fast interrupt handling so as to meet time requirements It must be able to perform real time synchronisation of events High level language support and optimizing compilers are essential and fall into several underlying characteristics for example e The instructions set should be a suitable target for high level languages used for real time systems e Real time systems require reliable memory devices which in turn are large power consuming and expensive Consequently there is an implicit demand for compilers that produce dense code for the target processor 74 e Subprograms are frequently used by application programmers and the processor should provide for subprogram calls with a minimum of overhead This chapter will discuss essential real time system support provided by the
82. f different instructions in application programs In these studies they found that approximately 20 percent of the available instructions were used 80 percent of the time Also complexity of the control unit necessary to support rarely used instructions slows the execution of all instructions Thus through careful study of program characteristics one can specify a smaller instruction set consisting only of instructions which are used most of the time and are executed quickly Rad83 The first major university RISC research project was at the University of California Berkeley David Patterson Carlos S quin and a group of graduate students investi gated the effective use of VLSI in microprocessor design The Berkeley RISC concept was adopted by Sun Microsystems where the SPARC architecture was defined Pat82 Shortly after the Berkeley group began its work researchers at Stanford University un der the direction of John Hennessy began looking into the relationship between computers and compilers Their research evolved into the design and implementation of optimizing compilers and reduced instruction sets Since this research pointed to the need for sin gle cycle instruction sets issues related to complex deep pipelines were also investigated This research resulted in a RISC processor for VLSI that is commonly referred to as the Stanford MIPS Microprocessor without Interlocked Pipeline Stages Hen84 1 6 A brief overwiev of some RIS
83. fetch 3 data2 data3 Figure B 4 I80960KB multiple store sequence Pipeline occupation cycle by cycle fetch 1 fetch 2 fetch3 stall stall stall stall fetch 4 dec 1 dec 2 dec 3 effaddl effadd2 effadd3 addrl datal data2 data3 writ3 Figure B 5 I80960KB multiple load sequence Pipeline occupation cycle by cycle fetch 1 fetch 2 fetch 3 stall fetch 4 dec 1 write2 write3 Figure B 6 MIPS R2000 multiple load store sequence 150 Appendix C Schematics 151 Figure C 1 T800 HDO configuration 152 Figure C 2 THOR HDO configuration 153 Figure C 3 SPARC HDO configuration 154 Figure C 4 T800 and SPARC EDAC 155 Figure C 5 T800 THOR and SPARC memory 156 Figure C 6 T800 HSO configuration 157 Figure C 7 THOR HSO configuration 158 Figure C 8 SPARC HSO configuration 159
84. ffset from address 0 of the address space ranging from 2 to 231 At the machine level two absolute addressing modes are provided depending on the instruction format i e MEMA or MEMB For the MEMB format the offset is an integer called a displacement ranging from 2 to 25 1 After evaluating an absolute address the assembler will convert the address into an offset and select the appropriate machine level instruction type and addressing mode Register Indirect The Register Indirect addressing modes allow an address to be specified with an ordinal value 32 bits in a register or with an offset or displacement added to a value in a register Here the value in the register is referred to as the address base Register Indirect with Index The register indirect with index addressing modes allow a scaled index to be added to the value in a register The index is specified by means of a value placed in a register This index value is then multiplied by the scale factor The allowable scale factors are 1 2 4 8 and 16 A displacement may also be added to the address base and scaled index Index with Displacement A scaled index can also be used with a displacement alone Again the index is contained in a register and multiplied by a scaling constant before the displacement is added to it IP with Displacement The IP with displacement addressing mode is often used with load and store instructions to make them IP relative With this mod
85. field contains a special register number RB the RB field contains a global or local register number RB or I this field contains either a global or local register number or an 8 bit instruction constant depending on the value of the M bit of the operation code I9 12 this field contains the least significant 8 bits of a 16 bit instruction address This is a word address and may be program counter relative or absolute depending on the A bit of the operation code I7 10 this field contains the least significant 8 bits of a 16 bits instruction constant UI RND FD ES this field controls the operation of the CONVERT instruction 2 3 5 Am29000 processor states Normal program flow may be preempted by an interrupt or trap for which the processor is enabled The effect on the processor is identical for interrupts and traps the distinction is in the different mechanisms by which the interrupt and traps are enabled The intension is that interrupts be used for suspending current program execution and causing another program to execute while traps be used to report errors and exception conditions 50 An interrupt or trap is said to occur when all conditions which define the interrupt or trap are met An interrupt or trap which occurs is not necessarily recognized by the processor either because of various enables or because of the processor s operational mode An interrupt is taken when the processor recognizes the interrupt and alters
86. field is ignored If the trap is taken execution is transferred to the bound check exception vector by concatenation of the VBR bounds check exception vector and three trailing zeroes forming a 30 bits instruction address 2 Register with 9 bit vector table index is used by bit test trap instructions where the bit in S1 field register specified by the B5 field is tested for either a set or clear condition It is also used by the conditional trap instructions where the source 2 register is tested for the conditions specified in the M5 field see below In either case if the test condition is true the contents of VBR is concatenated with the VEC9 field of the instruction and three trailing zeroes Exception processing starts 31 triadic register 9 bit vector table bits encoding bits encoding 31 26 OPCODE OPCODE B5 M5 SUBOPCODE SUBOPCODE 52 VEC9 Table 2 8 MC88100 Flow control triadic register and 9 bit vector table indez instruction formats at the vector specified by the resulting address The SUBOPCODE field specifies the particular instruction The M5 field specifies which out of four possible conditions to test out e bit 25 Reserved must be zero e bit 24 Maximum negative number e bit 23 Less than zero e bit 22 Equal to zero e bit 21 Greater than zero Note that multiple conditions can be specified by setting more than one bit in this field 3 Register with 16 bit displacement immediate is used by branch and tra
87. for more than one or a few cycles This is true for today s RISC architectures Processor activities are assigned priorities determined by the type of activity For example reset handling has the highest priority and thus cannot be interrupted Interrupts are assigned priorities to predetermine the behaviour when simultaneous events occur and to assure that no high priority processor activity may be interrupted The saved processor status required to restart an interrupted program is determined by the activities required to service the interrupt In general the processor does not save general register contents when servicing an interrupt The interrupt handler routine is responsible for saving and restoring register contents which might be altered by the service routine Since a real time system according to the conventions described in the Introduction of this thesis must have the ability to respond within a finite time and events external from the system may require immediate attention the question of fast rescheduling be comes important Process switches in real time systems can be a time consuming matter Moreover since processes are created and removed dynamically it becomes very difficult to predict the time spent on these activities In analyzing the processor s ability to perform fast task switches the important observations are e The register file should be reasonably sized since a task switch process switch re quires the enti
88. format conforms to the IEEE standard for binary floating point arithmetic The processor provides three instructions that perform operations on decimal values when the values are presented in ASCII format Each decimal digit is contained in the least significant byte of an ordinal 32 bits For decimal operations bit 8 through 31 of the ordinal containing the decimal are ignored An individual bit is specified for a bit operation by giving its bit number in the ordinal in which it resides The least significant bit of a 32 bit ordinal is b0 The most significant bit is b31 A bit field is a contignous sequence of bits of from 0 to 32 bits in length within a 32 bit ordinal A bit field is defined by giving its length in bits and the bit number of its lowest numbered bit Triple and Quad words refer to consecutive bytes in memory or in registers a triple word is 12 bytes and a quad word is 16 bytes These data types facilitate the moving of blocks of bytes 2 2 3 80960KB registers The processor provides three types of data registers global floating point and local The 16 global registers g0 g15 constitute a set of general purpose registers the contents of which are preserved across procedure boundaries The 4 floating point registers are pro vided to support extended floating point arithmetic Their contents are also preserved across procedure boundaries The 16 local registers r0 r15 are provided to hold param eters specific to a proc
89. g HventReq to the execution of the microcode interrupt handler in the CPU is four cycles 3 2 7 THOR THOR interrupt handling is described in paragraph 2 7 5 As opposed to a more general interrupt handling approach THOR gives hardware support for synchronisation between processes running on different processors On the other hand in a single processor system interrupts may be treated in a more conventional and general manner The hardware defined exceptions are listed in table 2 26 All of these exceptions can also be raised by software The Exception Register ER is used when an exception is raised It points to an Exception Information Block in the stack This block holds the program counter for the exception handler to call and the pointer to the next outer scope Exception Information Block When a hardware generated exception is raised the following actions occur e Top of stack is set to the value of ER e Stack top value i e address of the exception handler is popped into PC e Stack top value now the new ER is popped into ER e The exception number is pushed according to the preceding table Control transfers to appropriate exception handler 82 3 3 Task Switch In a real time environment each program under execution constitutes a process Another name for a process is a task both terms will used here For each process there must exist e A Process Control Block PCB used by the operating system to maintain the pro
90. g point Queue FQ All r registers are 32 bits wide They are divided into 8 global registers and 7 blocks called windows Each window con tain 24 r registers The windows are addressed by the CWP a field of the Processor State register PSR The CWP is incremented by a RESTORE or RETT instruction and decremented by a SAVE instruction The active window is defined as the window currently pointed to by the CWP The Window Invalid Mask WIM is a register which under software control detects the occurence of IU register file overflows and underflows The registers in each window are divided into ins outs and locals Registers are addressed as shown in table 2 19 The globals may be addressed when any window is active Each window shares its ins and outs with adjacent windows The register overlap in such a way that given a register with address o where 7 lt o lt 16 o refers to exactly the same register as o 16 after the CWP is decremented by 1 modulo 7 points to the next window The windows are joined together in a circular stack where the highest numbered window is adjacent to the lowest The outs of window 6 are the ins of window 0 The global register r 0 is hardwired to zero Thus reading this register yields a zero result while writing to it has no effect The out register r 15 is used for storing the return address when a CALL instruction is executed 58 1131 ins 1 24 1123 locals r 16 active window r
91. he address of the target instruction The computed address is placed in the FIP causing program execution to be transferred to that address The OPCODE field identifies the particular branch instruction 32 16 bit displacement 26 bit displacement bits encoding bits encoding 31 26 OPCODE 31 26 OPCODE 20 16 S1 ENE Table 2 9 MC88100 16 bit displacement and 26 bit displacement instruction formats 2 1 5 MC88100 processor states The MC88100 may be in one of three states e Normal instruction execution e Exception e Reset Normal Execution During normal execution the processor operates at either the supervisor or user level of privilege These levels defines which memory space is accessed during external bus trans actions and which registers are available to the programmer When operating in supervisor mode memory access reference the supervisor address space in data or instruction memory This mode allows execution of all instructions and allows access to all control registers and general purpose registers Kernel software typically executes in supervisor mode The kernel may provide services such as resource allocation exception handling and software execution control Execution control normally includes control of user programs and protecting the system from acci dental corruption by a user program The user mode changes to supervisor mode if e an exception occurs e a reset is signalled e a trap instruction i
92. ides in the same memory and are accessed sequentially Hence the presence of data obstructs the speed of instruction fetching This is a fact with influence on RISC design considerations The principle of a stored program or a von Neumann architecture can be imple mented in several ways which has also been done To distinguish between different von Neumann architectures we speak more generally about computer architecture This con cept created by Amdahl while working with the IBM 360 can be summarized as The image that the computer presents to the machine language programmer and the compiler writer 16 Consequently the processors instruction set its registers and other details that are essential for programming the device The coding and interpretation of a program consti tutes the instruction set thus this is a main component of a computer architecture The register file is heavily utilized by a compiler writer thus it is another major component of the architecture Different instructions exhibit different execution times therefore in some special occasions there is need for the programmer to know something about the CPU datapaths or at least the instruction timing Recently the term computer architecture has been given an extended meaning Hen90 which makes it cover computer hardware and computer organization as well For the subject as treated in this work however Amdahls definition will suffice 1 2 Trends in compu
93. ing mode along with the lda instruction allows a constant of from 0 to 4096 to be loaded into a register For the register indirect with offset addressing mode the MD bit is set the value in the OFFSET field is added to the address in the ABASE register Setting the offset value to zero creates a register indirect addressing mode however this operation can generally be carried out faster by using the MEMB version of this addressing mode 2 MEMB format The MEMB format provides seven addressing modes absolute displacement regis ter indirect register indirect with displacement register indirect with index register indirect with index and displacement index with displacement IP with displace ment The ABASE and INDEX fields specify local or global registers the contents of which are used in the address computation When the INDEX field is used in an ad dressing mode the processor automatically scales the value in the index register by the amount specified in the SCALE field The optional displacement field is contained in the word following the instruction word The displacement is a 32 bit signed two s complement value 2 2 5 80960KB addressing Modes The processor offers 11 modes for addressing operands These modes are grouped as follows Literal Register Absolute Register Indirect Register Indirect with displacement IP with displacement Most of the instructions use only the literal and register modes The remaining
94. ingle cycle execution of all instructions every cycle needs an instruction fetch A shorter instruction format i e more dense code will decrease the need for instruction fetches The Instruction Mix is essential since for example load store instructions introduces extra memory accesses thus increasing AMA Instruction Execution Timing affects memory activity since the fact that all instruc tions do not execute in one cycle will reduce the need for instruction fetches Thus the higher execution times the lower the AMA Here AMA is estimated by 1 T1 T2 T3 Ta AMA 2 4242424 Dx Xo X3 Xa 93 4 4 Instruction Mix The following instruction mix is assumed e 50 arithmetical logical instructions 25 jump branch instructions e 10 load store instructions e 15 floating point instructions 4 5 Notes on the Failure Rate estimation Failure rate estimation is carried out according to the MIL HDBK 217 E For temperature acceleration factor calculation the thermal resistivity factor was used whenever it was available from manufacturer s documentation However since such information was rare assumptions had to be made about the junction temperature For complex circuits such as CPU s and FPU a junction temperature of 110 degrees Celsius was assumed For all others a junction temperature of 80 degrees Celsius was assumed 4 6 The HDO configurations Special requirements for the HDO configuration are e microp
95. ion with careful memory hierarchy design memory management units and floating point support on chip or of chip by a coprocessor these RISC processors seems suitable for embedded systems such as laser printers and other general purpose systems such as work stations These observations are true for MC88100 180960 R2000 Am29000 and SPARC T800 and THOR shows another approach these processors facilitates stack architectures which eliminates the need for a large register file The instruction format is flexible while pipelined execution is maintained and few addressing modes are available 73 Chapter 3 Real Time System requirements The design of reduced instruction set computers is guided by a design philosophy It does not rely upon inclusion of a set of required features There is no strict definition of what constitutes a RISC design However one may observe some common features Pipelining is used in all RISC designs to provide simultaneous execution of multiple instructions Simple instructions addressing modes are used This results in an instruction decoder that is small fast and easy to design With few addressing modes it is easier to map instructions onto a pipeline since the pipeline can be designed to avoid computation related conflicts A carefully designed memory hierarchy is required for increased processing speed A typical hierarchy includes high speed registers cache buffers located on the CPU chip memory management schem
96. ions of memory that contain program data Register to register instructions access only the general purpose registers or in some cases the control registers 26 NAME proposed usage r0 Zero rl subroutine return pointer r2 r9 called procedure parameter registers r10 r13 called procedure temporary registers r14 125 calling procedure reserved registers r26 linker r27 linker r28 linker r29 linker r30 frame pointer r31 stack pointer Table 2 1 MC88100 general purpose registers name usage fcr0 f p exeption cause register ferl f p source operand 1 high register fer2 f p source operand 1 low register fer3 f p source operand 2 high register fer4 f p source operand 2 low register fer5 precise operation type register fer6 f p result high register fer7 P result low register fcr8 p imprecise operation type register fer62 f p user status register fcr63 f p user control register Table 2 2 MC88100 floating point registers 27 name usage processor identification register processor status register exeption time processor status register shadow scoreboard register shadow execute instruction pointer shadow next instruction pointer shadow fetched instruction pointer vector base register transaction register 0 data register 0 address register 0 transaction register 1 data register 1 address register 1 transaction register 2 data register 2 address register 2 supervisor storage register O s
97. it is assumed that 90 of all conditional branches are taken 91 e X3 denotes the number of processor cycles required to execute a load store instruc tion For simplicity these are considered equal in this sense e X4 denotes the number of processor cycles required for the execution of a floating point instruction In order to describe wait state penalties and different instruction formats the following parameters are introduced e W denotes the number of wait states required for a read bus cycle determined by the system configuration e U denotes the averages number of instructions that becomes available for execution as a result of one 32 8 bits fetch If for example 70 of the instruction set consists of instructions encoded in 16 bits and the rest are encoded in 32 bits then U 0 7 24 03 1 7 e Y W U denotes average cycles required to feed the processor with one instruction This is a function of wait state penalties and instruction format o it W cycles U instruction Y Since instruction fetch and execution is performed simultaneously in a pipe lined archi tecture we write Z maz X1 Y W U Zz maz X2 Y W U Z3 X3 W Za maz X4 Y W U We obtain an expression for the Execution Rate Estimation ERE ERE Z101 Z2 2 2343 Zatal cycles where ERE denotes the average number of cycles required to execute one instruction Including the cycle time CT in seconds we arrive at a final ex
98. itectures has been to acheive increased performance without dramatical increase of the cost The RISC approach single cycle execution offers high performance at resonable costs Current RISC architectures are characterisized by e a large register file e instructions that are fast to decode e pipelined execution e few addressing modes e fixed instruction format 71 Bus Error An external memory access failed to complete within 255 clock cycles Address Error Attempt to access non physical or protected memory Data Error Uncorrectable error in data read Instruction Error Attempt to execute privileged instruction in user mode or illegal instruction Jump Error Attempt to jump to call or return to an invalid address Reserved Reserved Constraint Error A constraint of a CLL or CUL instruction was not satisfied Access Check Attempt to use a zero indirect address with the PSHX and POPX instructions i e follow a null pointer Storage Error Attempt to access memory outside the task s stack in user mode Overflow Check Overflow of signed integer or float arithmetic operation Underflow Check Underflow or denormalised result of float arithmetic operation Division Check Attempt to divide by zero Illegal Operation Illegal float arithmetic instruction caused by any denormalised NaN operand Tasking Error Reserved for future use currently not raised by hardware Table 2 26 THOR exception numbers 72 In combinat
99. its behaviour accordingly Interrupts are caused by signals applied to any of the external inputs INTRO INTR3 or by a timer facility The processor may be disabled from taking certain interrupts by the masking capability provided by the Disable all interrupts and traps DA Disable Interrupts DI bit and Interrupt Mask IM field in the current processor status reg ister The INTRO cannot be disabled by the IM field thus its a non maskable interrupt line Traps are caused by signals applied to one of the inputs TRAP0 TRAP1 or by excep tional conditions such as protection violation Interrupt and trap processing relies on the existence of a user managed vector area in external instruction data memory or instruction read only memory instruction ROM The Vector Area begins at an address specified by the Vector Area base Address Register and provides for 256 different exception handling routines The processor reserves 32 routines for system operation and 32 routines for FP multiply and divide instructions When an exception is taken the processor determines an 8 bit vector number as sociated with the exception Vector numbers are either predefined or specified by an instruction causing the trap as shown in table 2 17 2 3 6 Am29000 pipelining The Am29000 implements a four stage pipeline for instruction execution The four stages are fetch decode execute and write back During the fetch stage the Instruction Fetch
100. le FPLDZEROSN fp load zero single FPLDZERODB fp load zero double FPLDNLADDSN fp load non local and add single FPLDNLADDDB fp load non local and add double FPLDNLMULSN fp load non local and multiply single FPLDNLMULDB fp load non local and multiply double FPSTNLSN fp store non local single FPSTNLDB fp store non local double FPSTNLIB2 fp store non local int32 Table A 56 T800 Floating point Load Store operations FPENTRY floating point unit entry floating point reverse floating point duplicate Table A 57 T800 Floating point general operations set rounding mode to round nearest set rounding mode to round zero set rounding mode to round positive set rounding mode to round minus Table A 58 T800 Floating point rounding operations FPCHKERROR check fp error FPTESTERROR test fp error false and clear FPUSETERROR set fp error FPUCLEARERROR clear fp error Table A 59 T800 Floating point error operations 136 Instruction Comments FPGT fp greater than FPEQ fp equality FPORDERED fp orderability FPNAN fp not a number FPNOTFINITE fp not finite FPUCHKI32 check in range of type int32 FPUCHKI64 check in range of type int64 Table A 60 T800 Floating point comparison operations FPUR32TOR64 real 32 to real 64 FPUR64TOR32 real 64 to real 32 FPRTOI32 real to int 32 FPI32TOR32 int 32 to real 32 FPI32TOR64 int 32 to real 64 FPB32TOR64 bit 32 to real 64 FPUNOROUND real 64 to real 32 no round
101. lly in most computer applications a program written in assembly lan guage exhibits the shortest execution times This has been due to the fact that assembly language programmers know the computer architecture well and are capable of taking ev ery advantage of it It is difficult to accomplish this in an automatic manner and for general cases which are the requirements for compiler to generate code However assembly lan guage programming as a way of increasing program performance suffers from some heavy disadvantages It is probably the most time consuming method to write software Thus it is very expensive and yields results much later than high level programming Hence for a new processor architecture theres has to be a compiler for a high level language It has been found that it is difficult to construct an efficient compiler for a computer with a large instruction set The compiler cannot make use of all of the sophisticated instructions that the architecture offers Therefore the compiler uses simpler instructions and generates larger code thus making programs run slower and wasting primary memory in a way that should not be needed if an assembly language programmer wrote the same piece of code With the experience of these facts some designers began to question whether CISCs are as fast as they could be bearing the capabilities of the underlying technology in mind A few designers offered the hyphothesis that increased performance should be
102. ly of computers based upon RISC design Two of these computers the Series 930 and the Series 950 are realizations of the HP Precision Architecture Bir85 RISC type system The IBM 6151 RT PC is basically a workstation which uses the IBM ROMP Research Office products division MicroProcessor and a MMU Memory Management Unit Hin86 The ROMP MMU represents one of the commercial spinoffs from the IBM 801 research project 23 Chapter 2 Description Of RISC Architectures In this chapter a detailed description of seven RISC processors mostly from an architec tural point of view will be given Basic features that will be described are e Instruction Set e Data formats e CPU register description e Instruction formats and addressing modes e Processor states The following literature was chosen as sources See the bibliography for a complete reference MC88100 RISC microprocessor user s manual Mot90 80960KB program mer s reference manual Int88 MIPS R2000 RISC architecture MIP87 SPARC RISC user s guide ROS90 The Transputer databook Inm89 Am29000 streamlined instruction processor user manual Adv88 THOR Stack RISC microprocessor instruc tion set architecture for prototype chip Saa92 For THOR additional information was gathered from draft issues of a forthcoming user s manual The purpose of this chapter is to give a standardised description of the selected RISC p
103. modes are used for memory related instructions Literals The processor recognizes two types of literals ordinal literal and floating point literal An ordinal literal can range from 0 to 31 5 bits When an ordinal literal is used as an operand the processor expands it to 32 bits by adding leading zeroes If the instruction specifies an operand larger than 32 bits the processor zero extends the value to the operand size If an ordinal literal is used in an instruction that requires integer operands the processor treats the literal as a positive integer value The processor also recognizes two floating point literals 0 0 and 1 0 These float ing point literals can only be used with floating point instructions As with the ordinal literals the processor converts the floating point literals to the operand size specified by the instruction A few of the floating point instructions use both floating point and non floating point operands e g the convert integer to real instructions Ordinal can be used in these in structions for non floating point operands 42 Register A register is referenced as an operand by giving the register number Both floating point and non floating point instructions can reference global and local registers in this way However floating point registers can only be referenced in conjunction with floating point instructions Absolute Absolute addressing is used to reference a memory location directly as an o
104. multiply s p rc ra rb f p subtract s p Table A 29 Am29000 Floating point instructions EMULATE vn ra rb trap to software emulation routine HALT enter halt mode INV invalidate IRET interrupt return IRETINV interrupt return and invalidate SETIP rc ra rb set indirect pointers CLZ rc rb const8 count leading zeros CONVERT rc ra conversion convert data format Table A 30 Am29000 Miscellaneous instructions 124 A 4 R2000 instruction set summary rt offset base rt offset base rt offset base rt offset base rt offset base load byte offset addr signed load byte offset addr unsigned load halfword offset addr signed load halfword offset addr usigned load word offset addr signed rt offset base load word to coprosessor rt offset base load word left rt offset base load word right rt offset base store byte rt offset base store halfword rt offset base rt offset base store word from coprocessor z rt offset base store word left store word Table A 31 R2000 Load Store instructions rd rs rt signed add trap on overflow rt rs immediate signed immediate add trap on overflow rt rs immediate unsigned immediate add rd rs rt unsigned add rd rs rt set on less than rt rs immediate set on less than immediate rt rs immediate set on less than immediate unsigned rd rs rt set on less than unsigned rd rs rt logical and rt rs immediate logical and imme
105. n used as the instructions operand All instructions except the prefix instructions end by clearing the operand register ready for the next instruction The prefix instruction loads its four data bits into the operand register and then shifts the operand register left four bits The negative prefix instruction is similar except that it complements the operand register before the shifts Consequently operands can be extended to any length up to the length of the operand register by a sequence of prefix instructions In particular operands in the range 256 to 255 can be represented using one prefix 2 6 4 The T800 registers Expressions are evaluated on the evaluation stack formed by three registers No hardware mechanism is provided to detect that more than three values are loaded onto the stack The entire user accessible register set consists of e The Workspace Pointer which points to an area for local variables e The Instruction Pointer which points to the next instruction to be executed e The Operand Register which is used in the formation of instruction operands e Three registers A B and C which form an Evaluation stack The Evaluation stack is used for expression evaluation to hold the operands of scheduling and communica tion instructions and to hold parameters of procedure calls 65 2 7 Saab Ericsson Space THOR THOR is a microprocessor primarily intended for embedded real time systems Among other things it facilitates Ad
106. nPC into r 17 and r 18 respectively of the new window 5 It sets the tt field of the TBR to the appropriate value 6 If the trap is not a reset it writes the PC with the contents of TBR and the nPC with the contents of TBR 4 If the trap is a RESET it loads the PC with 0 and the nPC with 4 3 2 6 T800 The T800 FventReq and EventAck pins provide an asynchronous handshake interface be tween an external event and an internal process When an external event interrupt pulls 81 Event Reg active the external event channel additional to the external link channels is made ready to communicate with a process When both the event channel and the process are ready the processor pulls EventAck active and the process if waiting is scheduled Only one process may use the event channel at any given time If no process requires an event to occur PventAck will never be activated If the process is a high priority one and no other high priority process is running the latency is typically 19 processor cycles Setting a high priority task to wait for an event input allows the user to interrupt a transputer program running at low priority The following functions take place e Sample EventReq at pad and synchronise e Edge detect the synchronised EventReg and form the interrupt request e Sample interrupt vector for microcode ROM in the CPU e Execute the interrupt routine for Event rather than the next instruction The time taken activatin
107. nts are to be transferred to the selected control register This field is ignored in load instructions The OP field identifies the particular instruction The SFU field specifies a special function unit accessed by the instruction the value zero specifies the integer control unit registers the value one specifies the floating point unit registers Other values 2 7 cause an SFU precise exception for the addressed SFU The 2 field finally must contain the same value as the S1 field for decoding purposes Data Memory Access Instructions MC88100 supports three adressing modes for accessing data in memory or to generate a memory address Address calculations are performed by the use of unsigned arithmetic Overflows are not detected and results are truncated to the number of available bits 1 Register Indirect with 16 bits zero extended immediate index The contents of rS1 is added to the 16 bit zero extended immediate index contained in the 116 field of the instruction The result is a data memory address This address is e for LDA instruction loaded into the register specified by the D field e for STORE and EXCHANGE instructions used as the memory address where contents of D field register are stored 30 immediate index register index bits encoding bits encoding OPCODE 31 26 OPCODE D Si SUBOPCODE 52 Table 2 7 MC88100 indexed addressing instruction formats e for LOAD instruction used as the memory addres
108. number specified by the content of a special purpose register rather than the instruction field Three independent indirect register numbers are contained in three separate special purpose registers The number for Global Register 0 specifies indirect register addressing An instruction can specify an indirect register for any or all of the source operands or result General registers may be partitioned into segments of 16 registers for the purpose of access protection A register in a protected segment may be accessed only by a program executing in the Supervisor mode An attempted access by a User mode program causes a trap to occur The Am29000 contains 23 special purpose registers which provide controls and data for certain processor functions Special Purpose registers are accessed by data movement only Any special purpose register can be written with the contents of any general purpose register and vice versa Some special purpose registers are protected and can be accessed only in the Supervisor mode This restriction applies to both read and write accesses Any User mode program violation of this restriction causes a trap to occur The special purpose registers are partitioned into protected an unprotected registers Special purpose registers numbered 0 127 and 160 255 are protected and the remaining are unprotected Not all of these are implemented The special purpose registers and their definitions are listed in table 2 15 Vector Base A
109. o determine whether to service the interrupt immediatly or to postpone the service until a later time The 80960 KB processor provides an alternate method of communicating with other agents in the system called IAC messages or simply IACs Using the IAC mechanism other agents on the system bus are able to communicate with the processor through messages that are exchanged in a reserved section of memory Like interrupts ACs are used to request that the processor stop work on its current task and begin work on another task However where an interrupt generally causes an temporary break in the execution of a program an IAC often causes a permanent change in the control flow of the processor While executing instructions the processor is able to recognize certain conditions that could cause it to return an inappropriate result or that could cause it to go down a wrong and possibly disastrous path One example of such a condition is a divisor operand of zero in a divide operation Another example is an instruction with an invalid opcode These conditions are called faults The processor handles faults almost the same way that it handles interrupts When the processor detects a fault it automatically stops its current processing activity and begins work on a fault handling procedure 44 2 3 AMD Am29000 In 1987 Advanced Micro Devices AMD released the first microprocessor ever designed by the company the Am29000 The processor operates at a
110. odify icc logical and not logical and not modify icc subtract integer subtract integer modify icc subtract with carry subtract with carry modify icc tagged subtract and modify icc tagged subtract modify icc and trap on overflow multiply step inclusive or inclusive or modify icc inclusive or not inclusive or not modify icc exclusive or exclusive or and modify icc exclusive nor exclusive nor and modify icc shift left logical shift right logical shift right arithmetic zero least sign 10 bits replace high order bits Table A 37 SPARC Arithmetic Logical Shift instructions 128 LDSB LDSBA LDSH LDSHA LDUB LDUBA LDUH LDUHA LD LDA LDD LDDA LDF LDDF LDFSR LDC LDDC LDCSR LDSTUB LDSTUBA STB STBA STH STHA ST STA STD STDA STF STDF STFSR STDFQ STC STDC STCSR STDCQ SWAP SWAPA address rd address asi rd address rd address asi rd address rd address asi rd address rd address asi rd address rd address asi rd address rd address asi rd address frd address frd address fsr address creg address creg address creg address rd address asi rd rd address rd address asi rd address rd address asi rd address rd address asi rd address rd address asi frd address frd address fsr address fq address creg address creg address csr address cq address source rd regsource fast rd
111. onst8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rb const8 rc ra rb rc ra rb rc rb const8 rc ra rb rc ra rb const8 add signed add add with carry signed add with carry unsigned add with carry subtract subtract with carry subtract with carry signed subtract with carry unsigned subtract reverse subtract reverse with carry subtract reverse with carry signed subtract reverse with carry unsigned subtract reverse signed subtract reverse unsigned subtract signed subtract unsigned integer multiply unsigned integer multiply signed multiply step multiply last step multiply step unsigned divide step integer divide signed integer divide unsigned divide initialize divide last step divide remainder Table A 23 Am29000 Integer arithmetic instructions 121 CPBYTE rc ra rb const8 compare bytes CPEQ rc ra rb const8 compare equal to CPGE rc ra rb const8 compare greater than or equal to CPGEU rc ra rb const8 compare greater than or equal to unsigned CPGT rc ra rb const8 compare greater than CPGTU rc ra rb const8 compare greater than unsigned CPLE rc ra rb const8 compare less than or equal to CPLEU rc ra rb const8 compare less than or equal to unsigned CPLT rc ra rb const8 compare less than CPLTU rc ra rb const8 compare less than unsigned CPNEQ rc ra rb const8 compare not equal to A
112. or the processor sets the appropriate priority bit and vector bit in pending interrupt record and continues work on its current task When the processor in executing state decides to service the interrupt it 1 saves the current state of process controls and arithmetic controls in an interrupt record on the stack that the processor is currently using 2 if the execution of an instruction was suspended the processor includes a resumption record for the instruction in the current stack and sets the resume flag in the saved process controls 3 switches to the interrupted state 4 sets the state flag in the process controls to interrupted its execution mode to supervisor and its priority to the priority of the interrupt 5 clears trace fault pending and trace enable flags 6 allocates a new frame on the interrupt stack and switches to the interrupt stack 7 sets the frame return status field 8 performs an implicit call extended operation at the address specified by the interrupt table for the specified interrupt vector 3 2 3 Am29000 The following operations are performed by the processor when an interrupt or trap is taken 1 Instruction execution is suspended 2 Instruction fetching is suspended 3 Any in progress load or store operation is completed Any additional operations are cancelled in the case of load multiple and store multiple 4 The contents of the Current Processor Status Register are copied into the Old
113. ords are numbered in increasing order from left to right starting with 0 big endian scheme or right to left little endian scheme as controlled by the processor configuration register Most instructions deal directly with word length integer data integers may be either signed or unsigned depending on the instruction Some instruction e g AND treat word length operands as strings of bits In addition there is support for character half word and Boolean data types Floating point data single and double precision are defined but not directly supported by processor hardware The processor supports character data through extraction EXBYTE and insertion INBYTE operations on word length operands and by a compare CPBYTE operation on byte length fields within words The processor supports half word data through extraction EXHW and insertion INHW operations on word length operands There is also an Extract Half Word Sign 45 absolute register general purpose register number Indirect Pointer Access Stack Pointer Not Implemented Global Registers 64 127 Local Register 125 Local Register 126 Local Register 127 Local Register 0 Local Register 1 Local Register 123 Local Register 124 Table 2 14 Am29000 general purpose registers Extended instruction EXHWS which acts similar to EXHW The Boolean format used by the processor is such that the Boolean values TRUE and FALSE are represented by 1 or 0 respectively in the m
114. ossible to a limited extent Built in fault tolerance support such as selfcheck memory error detection and correction is provided only by THOR while MC88100 and Am29000 provides support for redundant designs 89 Chapter 4 System Hardware Considerations A physical real time system when used in aerospace for example must meet some im portant needs It should be small in size have low weight and low power consumption The system should be reliable and thus only high quality components at least military qualified should be used Fault tolerance support is desirable and memory errors must be detected and preferably corrected See Jan90 for a thourougly description of re quirements on microcomputers in critical applications The purpose with this chapter is to highlight how demands on system hardware impacts on system performance and dependability This chapter discusses six computer designs that use the Inmos T800 Transputer the Saab Ericsson Space THOR and the Cypress SPARC microprocessors respectively in order to evaluate hardware aspects of the three processors in two different configurations e A Real time System application called the High Dependability Oriented configura tion HDO The HDO configuration should be thought of as an on board computer for a space craft e A general purpose embedded system application called the High Speed Oriented configuration HSO The designs which not are realised are considere
115. ost significant bit of a word The floating point format defined for the processor conforms to the IEEE Floating Point standard P754 2 3 3 Am29000 register description The Am29000 has three classes of registers which are accessible by instructions These are general purpose registers special purpose registers and translation look aside buffer TLB registers Any operation available can be performed on the general purpose regis ters while the special purpose registers and the TLB registers are accessed only by explicit data movement to or from a general purpose register Table 2 14 lists the 192 general purpose registers and their functions The following terminology is used to describe the addressing of general purpose regis ters 1 Register Number is a software level number for a general purpose register 0 255 2 Global Register Number is a software level number for a global register ranging from 0 127 3 Local Register Number is a software level number for a local register ranging from 0 127 46 4 Absolute Register number is a hardware level number used to select a general purpose register in the Register File These numbers range from 0 255 The 192 registers are divided into 64 global and 126 local registers Global registers are addressed with absolute register numbers while local registers are addressed relative to an internal stackpointer The general purpose registers may be accessed indirectly with the register
116. ower Requirement mW 119576 104767 169453 Failure Intensity FITS Table 4 2 Summary general purpose system configuration 106 Chapter 5 Concluding Remarks Several descisions has to be made during the design of a new computer architecture These descisions are based upon the designers experience as well as the systems require ments From RISC design concepts several high performance microprocessors has been constructed In this thesis we have studied how seven different microprocessors could perform in real time systems Four of these processors are general purpose RISC processors Motorola 88100 Intel 80960kb MIPS R2000 and Cypress SPARC while three processors AMD 29000 Inmos T800 and Saab Ericsson Space THOR are targeted for real time systems From observations in this study we may conclude that important real time requirements such as fault tolerance precise time handling and rapid response on external events pro cess switch and debug facilities has not had a major influence on the design of the general purpose processors Rather they are optimized for highest possible execution rate A real time system requirement such as fault tolerance places several restrictions on the system hardware design It turns out that a high execution rate cannot be maintained due to the fact that memory devices for these applications are to slow Moreover since the communication between processor and memory must be checked by dedica
117. p 1221 1246 December 1984 Hen90 Hennessy J L Pattersson D A Computer Architecture A Quantitative Approach Morgan Kaufmann publishers San Mateo California 1990 Hil85 Hill M D et alt Spur A vlsi multiprocessor workstation Technical report Computer Science Division University of California Berkeley December 1985 Hil86 Hill M D et alt Design decisions in spur IEE Computer vol 19 no 11 pp 8 22 November 1986 Hin86 Hindin H J Ibm risc workstation features 40 bit addressing ComputerDesign pages pp 28 30 February 1986 Inm89 Inmos limited Transputer databook second edition 1989 Int88 Intel Corporation 80960KB programmer s reference manual 1988 Jan90 Jan Torin Characterisation of microcomputers for embedded real time systems directions and basic criteria Technical report Department of Computer Engi neering Chalmers University of Technology 1990 Mil83 Milutinovi V M editor High Level Languages in Computer Architecture Com puter Science Press Inc Oxford 1983 MIP87 MIPS Computer Systems Inc MIPS R2000 RISC architecture 1987 Mot90 Motorola Inc MC88100 RISC microprocessor user s manual second edition 1990 Pat82 Patterson D A S quin C H A vlsi risc Computer pages pp 8 22 September 1982 Rad83 Radin G The ibm 801 minicomputer IBM Journal R amp D vol 27 no 3 pp 237 246 May 1983 109 ROS90 ROSS technology Inc SPARC RISC user s guide 1990 Saa9
118. p instructions for target address and test condition generation The OPCODE field identifies the particular instruction For bit test branch instructions the bit in source 1 specified by the BS field is tested for either a set or clear condition For condition test branch instructions source 1 is tested for the condition s specified in the M5 field In either case if the test condition is true the 16 bit displacement specified in the instruction D16 field is shifted left two positions and sign extended to 32 bits This value is added to the XIP and the result is loaded into FIP thus program execution is transferred to that address For trap generating bound check instructions the data in source 1 is compared to the specified immediate operand A trap is taken if the register data is greater than the unsigned operand If the trap is taken the bounds check vector number is combined with VBR the result is concatenated with three trailing zeroes and loaded into the FIP Exception processing begins from the bounds check exception vector 4 26 bit branch displacement This form is used to specify the branch target instruction in unconditional branch instructions which use a sign extended 26 bit displacement to calculate the loca tion of a new target instruction The displacement is shifted left by two bits and sign extended to 32 bits The two least significant bits are cleared to force word alignement This value is then added to the XIP to form t
119. pression for the execution rate 1 instructions ER ERE CT second 92 4 3 Memory Power Consumtion The memory used in the HDO configuration 64k nibble Cypress CY7C194 is a 24 pin device with 35 ns access time Memory is organized as 40 bits words 32 data and 8 check bits thus each memory access will activate all of the ten devices If we define the Average Memory Activity AMA as the fraction of processor cycles that accesses memory in an instruction mix the memory power consumtion could be estimated as Paverage AMA Pactive 1 AMA Pstandby For this memory device Pactive 650 mW Pstandby 100 mW Determination of AMA is complicated by several factors The memory device needs typically one cycle to enter standby mode after beeing accessed Obviously the memory power requirement depends on the instruction execution order If for example load store instructions were ordered as every other instruction rather than consecutive instructions then there would be more memory active cycles since we actually need two consecutive cycles that do not access memory to reach the standby mode In the estimations the instruction order as well as wait state cycles are ignored and AMA is considered a function of 1 Instruction Fetch Rate 2 Instruction Mix 3 Instruction Execution Timing Instruction Fetch Rate is limited by the instruction format For example with an instruction format of 32 bits and assuming s
120. r The processor signals situations where the output of any enebled driver does not agree with its input For a single processor the output comparision detects short circuits in output signals but does not detect open circuits It is possible to connect a second processor in parallel with the first where the second processor has its outputs disabled due to the Test mode The second processor detects open circuit signals as well as providing a check of the output of the first processor 3 4 4 R2000 The instruction set includes a BREAK instruction which causes a BREAK trap to occur Control is transferred to the applicable system routine 86 3 4 5 SPARC Software debugging is only supported by the means of general trap instructions 3 4 6 T800 Software debugging is supported by a variety of instructions that affects status bits When the processor Analyse pin is taken high the transputer will halt at a descheduling point The T800 offers the possibility to respond differently on interrupts depending on the processor s current mode The T800 incorporate a timer The implementation directly supports the occam model of time Each process can have its own independent timer which can be used for internal management or real time scheduling Hardware redundancy is acheived by the means of multiple transputer configurations 3 4 7 THOR THOR has a built in real time clock to keep track of system time Furthermore each process ha
121. r 5 ns setup before data is latched Taking into account the delay introduced by the 7138 16 ns Memory requires 35 ns from CS to valid data Data bus buffers delay data by 11 ns Thus wee need a cycle time 15 16 35 11 5 82ns The THOR cycle time is 67 ns and therefore one wait state is required 4 8 2 THOR HDO configuration execution rate The following parameters were chosen to describe the THOR configuration X 1 X2 1 X3 2 Xa 4 95 of THOR instructions are encoded in 16 bits the rest are encoded in 32 bits hence U 1 95 and with W 1 from previous section Y W U 1 03 99 Thus leading to 1 1 ER R 1073 6T ns 8 9 MmixedIPS For the memory activity AMA 0 410 The total memory power requirement 326 mW device 4 9 SPARC HDO configuration The CY7C601 chip running at 25 MHz is available in mil spec Com U1 U2 U3 U4 Uli U12 MU1 EU1 EU2 EU3 EU4 EUS EUS ponent list Device Qty Power mW Area mm2 FITS CY7C601 1 1750 1998 365 CY7C344 1 1000 289 170 CY7C602 1 1750 1600 358 U6 74ACT244 3 59 220 3 74ACTO4 1 50 154 3 MC146818 1 20 255 49 MU10 CY7C194 35 10 650 255 218 IDT49C460B 1 625 1944 92 cYC7C361 1 750 280 170 74ACT32 1 44 154 3 OTO50 1 100 270 27 EU8 74ACT245 4 59 220 3 74ACT244 1 59 220 3 Not Available in mil spec 100 4 9 1 SPARC Read Cycle Delays e A2 A17 to CS PLD decoder 20 ns e memory data setup time 35 ns
122. r of the instruction encountered immediatly after a control transfer e TCOND This field selects the condition code for format 2 instructions e The IMM22 field contains 22 bit constant value used by the SETHI instruction e DISP22 This field contains a 22 bit sign extended value used for PC relative addressing when a branch is taken 3 Remaining instruction uses format 3 e The 0P3 op3 field selects one of the format 3 opcodes e ASI This 8 bit field is the address space identifier generated by load store alternate instructions e RS1 This 5 bit field selects the first source operand from either the r registers for integer instructions a f register for floating point instructions or a e register for coprocessor instructions e RS2 This 5 bit field selects the second source operand from either the r registers for integer instructions a f register for floating point instructions or a e register for coprocessor instructions e SIMM13 This field is a sign extended 13 bit immediate value used as the second ALU operand It is sign extended to full word size when used e OPF OPC This 9 bit field identifies a floating point operate FPop instruction or a coprocessor operate CPop instruction 61 2 5 5 SPARC traps and exceptions SPARC supports three types of traps synchronous floating point coprocessor and asyn chronous Asynchronous traps are also called interrupts Synchronous traps are caused by an instruction and occur
123. rcl src2 dst remainder ordinal srcl src2 dst modulo integer Table A 8 I80960KB Integer arithmetic instructions 114 src dst src dst src dst src dst move move long move triple move quad Table A 9 I80960KB Move instructions SHLO SHRO SHLI SHRI SHRDI len src dst len src dst len src dst len src dst len src dst shift left ordinal shift right ordinal shift left integer shift right integer shift right dividing integer AND A and B A and not B not A and B A or B not A and not B not A B A B not A not A or B A or not B not A or not B srcl src2 dst ANDNOT srcl src2 dst NOTAND srel sre2 dst OR srcl src2 dst NOR srcl src2 dst XOR srcl src2 dst XNOR srcl src2 dst NOT srcl src2 dst NOTOR srcl src2 dst ORNOT srcl src2 dst NAND srcl src2 dst Table A 10 I80960KB Shift rotate and logical instructions CMPI srel sre2 compare integer CMPO srel sre2 compare ordinal srcl src2 conditional compare integer srcl src2 conditional compare ordinal srcl src2 dst compare and increment integer srcl src2 dst compare and increment ordinal CONCMPI CONCMPO CMPINCI CMPINCO Table A 11 I80960KB Compare conditional compare instructions 115 targ branch targ branch extended targ branch and link targ dst branch and link extended targ branch if equal targ branch if not equal targ branch if less targ branch if less than or equal targ branch if greater
124. rd jump and link TA address trap always TN address trap never Table A 39 SPARC Control Transfer instructions continued 130 address trap on not equal address trap on equal address trap on greater address trap on less or equal address trap on greater or equal address trap on less address trap on greater unsigned address trap on less or equal unsigned address trap on carry clear address trap on carry set address trap on positive address trap on negative address trap on overflow clear address trap on overflow set Table A 40 SPARC Control Transfer instructions y rd read y register psr rd read processor state register wim rd read window invalid mask register tbr rd read trap base register rs1 rs2 imm y write y register rs1 rs2 imm psr write processor state register rs1 rs2 imm wim write window invalid mask register rs1 rs2 imm tbr write trap base register Table A 41 SPARC Read Write control register operations CPop coprocessor operations FPop coprocessor operations UNIMP const22 unimplemented instruction IFLUSH address flush instruction cache Table A 42 SPARC Miscellaneous instructions 131 A 6 T800 instruction set summary adress jump constant load local pointer prefix constant load non local constant load constant constant load non local pointer negative prefix constant load local constant add constant adress call subroutine adress conditional jump constant adjust workspa
125. re processor context to be exchanged 88 e Hardware support for task switches is an essential feature to reduce the time spent for rescheduling A large register file will delay processor context switch significantly Therefore a large register file which has proved essential for increase of system performance could become a bottleneck with unpredictable consequenses From paragraph 3 3 we may conclude that a stack architecture such as T800 or THOR with hardware support for process switches provides considerably better performance than any of the other processors In applications where speed is far beyond human control and the tolerances are small there are often needs for precise time handling i e processes that require a precise delay should get that delay and nothing else Three of the studied processors addressed these issues with on chip timer facilities Am29000 T800 and THOR Real time systems are used to maintain surveillance and control processes where a system failure might have disastrous consequenses Nuclear plants aircrafts spacecrafts just to mention a few In the years to come we will see even more applications with steadily growing demands for reliability and security Consequently hardware software debugging support and fault tolerance are also important parts of real time system design All of the processors provide some kind of software debug support Furthermore T800 provides facilities that makes real time debugging p
126. rea Address Defines the beginning of the interrupt trap Vector Area Old Processor Status Stores a copy of the current processor status when an interrupt or trap is taken It is later used to restore the current processor status on an interrupt return Current Processor Status contains control information associated with the currently executing process such as interrupt disables and the supervisor mode bit Configuration contains control information which normally varies only from system to system and is usually set only during system initialisation Channel Address Contains the address associated with an external access and retains the address if the access does not complete successfully The Channel Address Register in conjunction with the Channel Data and Channel Control registers allow restarting of unsuccessfull external accesses Channel Data Contains Data associated with a store operation and retains data if the operation does not complete successfully AT register number register protected registers number Vector Base Address Old Processor Status Current Processor Status Configuration Channel Address Channel Data Channel Control unprotected registers Indirect Pointer C Indirect Pointer B Indirect Pointer A Q ALU Status Byte Pointer Funnel Shift Count Register Bank Protect Timer Counter Timer Reload Program Counter 0 Program Counter 1 Program Counter 2 MMU Configuration LRU Recommendation
127. ring the last cycle a prefetch of next instruction is possible thus loading 31 registers will be accomplished within 31 9 3 1 cycles B 2 180960KB B 2 1 PCB search Assuming Normal case execution time Register moves are word sized PCB search exits with task identification number T ID in r4 task priority T PRI in r3 ptr to highest process tasks PCB in r5 lda PCBOPTR r2 address of first PCB in r2 1 move r2 r5 ptr to hi priority task 1 move 10 r1 number of PCB s to search 1 move 0 r3 initial priority lowest 1 move 0 r4 initial PCB ID undefined 1 L1 ldl T PRI r2 r6 memory access 40 cmpibge has to wait for r6 cmpibge r3 r6 L2 branch if previous is greater 30 move r6 r3 substitute new priority 1 ldl T ID r2 r4 remember task ID memory access 2 move r2 r5 remember PCB ptr 1 L2 ldl T NEXT r2 r2 get next PCB pointer memory access 20 subo ri i ri exit 10 cmpobg r1 r0 L1 when all PCB s searched 27 B 2 2 Register Store Cycles 4 6 figure B 4 are memory data accesses that prevents instruction fetch therefore 180960KB will finish 3 stores within every sixth cycle and so storing 80 registers will use 80 6 3 160 cycles B 2 3 Register Restore Cycles 4 9 are memory data accesses that prevents instruction fetch therefore 180960 will finish 3 loads within every tenth cycle During the last cycle a prefetch of next instruction 143 is possible thus
128. rocessor with 256kB primary memory e only space qualified components e low power consumtion e small printed circuit board area The HDO configuration designs consists of e cpu e 256 kB of static random access memory e error detection and correction circuitry e real time clock 94 In the failure rate estimation for HDO configuration the following assumptions were made e Quality Factor e Voltage Factor S 0 25 1 e Application Environment Factor Space Flight 0 9 The T800 and SPARC designs both utilise an error detection and correction unit EDAC The introduced delay 36 ns worst case for the EDAC in use is inserted by the EDAC control and assures that memory Ready signal will not be asserted until correct data is guaranteed THOR has a built in EDAC so there is no need for this unit in the THOR HDO configuration 4 7 T800 HDO configuration T800 chip running at 17 5 MHz is available in mil spec Since the T800 has an on chip timer no such peripheral device is required Component list Ul U2 U5 U6 U7 U8 U9 U11 U12 U13 U14 MU1 MU10 EU1 EU2 EU3 EU4 EU5 EUS EU9 1 Estimated for the current application 2 Average according to AMA Device T800 G17S 74ACT245 74ACT08 74ACT244 74HCT373 74ACTO4 OTO5 54HCT393 CY7C194 35 IDT49C460B CYC7C361 L66DMB 1 74ACT32 OTO50 74ACT245 74ACT244 Qty 1 4 1 1 2 1 1 2 10 1 1 1 4 1 Power mW 1200 1
129. rocessors The varying ways of implementing floating point support memory manage ment etc will only be mentioned briefly and no detailed descriptions will be given 24 2 1 Motorola MC88100 In early 1988 Motorola Inc presented 88000 The basic architecture consists of a processor chip MC88100 and two identical cache chips MC88200 This offers a full system solution for a reduced instruction set architecture The MC88100 has capability for concurrent operations There are four execution units the Integer Bit Field Unit and the Floating Point Unit execute data manipulation instructions The Data Unit performs data memory accesses while the Instruction Unit performs instruction prefetches There are separate data and instruction memory ports Harvard Bus Structure and pipelined Load and Store operations The MC88100 also has three internal buses a source 1 bus a source 2 bus and a destination bus that are used for passing operands between the register file and the different execution units 2 1 1 MC88100 instruction set The MC88100 instruction set contains 51 instructions All integer arithmetic logical bitfield and certain flow control instructions execute in a single clock cycle Memory access and floating point instructions are performed by dedicated execution units All instructions are implemented directly in hardware precluding the need for microcoded operations An instruction set summary is given in appendix 2 1 2 MC88100 dat
130. rocoded scheduler which enables any number of concurrent processes to be executed together sharing the processor time At any time a concurrent process may be 147 e Active Being executed On a list waiting to be executed e Inactive Ready to input Ready to output Waiting until a specified time The scheduler operates in such a way that inactive processes do not consume any processor time It allocates a portion of the processors time to each process in turn Active processes waiting to be executed are held in two linked lists of process workspace one of high priority processes and one of low priority processes Each list is implemented using two registers one of which points to the first process in the list the other to the last Each process runs until it has completed its action but is descheduled whilst waiting for communication from another process or transputer or for a time to complete In order for several processes to operate in parallel a low priority process is only permitted to run for a maximum of two time slices before it is forcibly descheduled at the next descheduling point The time slice period is approximately 1 ms A process can only be descheduled on certain instructions known as descheduling points As a result en expression evaluation can be guarenteed to execute without the process being timesliced part way through Whenever a process is unable to proceed its instruction pointer i
131. rs while decoding the instruction 3 ALU Perform the required operation on instruction operands 4 MEM Access memory D Cache if required for Load Store instructions 5 WB Write back ALU results or value loaded from D cache to register file Each of these steps require approximatly one CPU cycle The R2000 uses different technique internally to enable execution of all instructions in a single cycle However as discussed earlier there are load and store instruction as well as jump and branch which could disturb the smooth flow of instructions through the pipeline In R2000 the execution continues despite the delay Loads jumps and branches do not interrupt the normal flow of instructions through the pipeline The processor always executes the instruction immediately following one of these delayed instructions Instead of having the processor deal with pipeline delays the R2000 turns over the responsibility for dealing with delayed instructions to software 56 2 5 Cypress SPARC CY7C600 The SPARC Scalable Processor ARChitecture designed by Sun Microsystems is an open computer architecture SPARC is an architecturally driven standard with binary compatibility of software between processor versions ensured by enforcing compliance to the architecture standard CY7C600 chip set is a 32 bit custom CMOS implementation of the SPARC architecture currently available in clock speed of 40 MHz The chip set includes integer unit floa
132. rved for special functions as follows register r0 contains the previous frame pointer PFP rl contains the stack pointer SP and r2 contains the return instruction pointer RIP The processor accesses the local registers at the same speed as it does the global registers Register Scoreboarding A mechanism called register scoreboarding can in certain situations permit instructions to execute concurrently While an instruction is being executed the processor sets a scoreboard bit to indicate that a particular register or group of registers is being used in an operation If the instruction that follows does not use registers in that group the processor is in some instances able to execute those instructions before execution of the prior instruction is complete Instruction Pointer The instruction pointer IP is the address of the instruction currently being executed This address is 32 bits and the 2 least significant bits are always zero Instructions in the processor are one or two words long The IP gives the address of the lowest order byte of the first word of the instruction Arithmetic Controls The processor arithmetic controls are made up of a set of 32 bits These bits include condition codes floating point control and status bits integer control and status bits and a bit that controls faulting on imprecise faults i e faults where the entire processor status is not known 38 OPCODE SRC DST SRC2 M3 M2 M1 OPCOD
133. rwise may arise in high clock frequency sys tems Failure Rate Estimations assumes commercial quality components and a Ground benign environment 102 4 12 T800 HSO configuration Component list Device Ul T800 G30S U2 CY7C343 U3 U7 74ACT245 US U11 74ACT244 MU1 MU8 CYM1624 MU9 MU10 CY7C338 4 12 1 NOP OP FP OO ty Power mW 1200 775 71 71 2750 750 Area mm2 1451 311 220 220 442 226 T800 HSO configuration execution rate FITS 13907 4527 490 490 11242 3398 From the T800 read cycle diagram and with the chosen configuration we conclude that an external memory read cycle may be performed without wait state penalty This also implies that there is nothing to gain from a cache memory It should however be emphasised that the T800 internal memory 4 kByte is not considered Hence W 2 U 2 leading to Y W U 1 5 and Z1 2 Z2 3 8 Z3 4 Z4 8 The HSO T800 configuration runs at 30 MHz and thus ER 3 55 33 ns 1 4 13 THOR HSO configuration Component list Device Ul THOR U2 CY7C343 MU1 MU8 CYM1624 MU9 MU1O CY7C338 MU11 MU14 74ACT245 MU15 MU17 74ACT244 w eN Aa O ty Power mW 1500 775 2750 750 35 60 103 8 5 MmizedI PS Area mm2 2450 311 442 226 220 220 FITS 78 4527 11242 3398 490 490 4 13 1 THOR HSO config execution rate In the proposed configuration THOR 25 MHz does not require any wait state so W 0 U 1 95
134. s discussions his ideas and encouragement Arne Carlsson who shared his great experience from the design and construction of real time systems 15 Chapter 1 The Background Of RISC 1 1 Computer Architecture A Computer is a high speed device that performs arithmetic operations and symbol ma nipulation through a set of machine dependent instructions A computer consists of several important parts there are memory systems input output devices ranging within a large scale of complexity the Central Processing Unit CPU with datapaths control unit and other subsystems There are at least two principal different ways of managing the central processing One of these is the data flow machine another is the von Neumann machine A von Neumann machine does information processing by sequentially executing algoritms which are organized as programs and stored in a memory The programs detail interpretation and processing of information coded as data and stored in the same memory The von Neumann machine consists consequently of at least one processor that sequentially inter prets instructions in the program and a primary memory that stores program and data These architectures may degrade performance from the so called von Neumann bottle neck which means that execution speed is highly dependent of the rate at which primary memory can be accessed the memory bandwith This comes from the fact that code pro cessor instructions and data res
135. s a Delay register causing interrupt after a specified delay This provides for an efficient implementation of a high level language real time delay function since kernel software is released from polling a delay queue each time a scheduling is to be performed Also the unique TASK instructions implemented in THOR serves as a powerful support for introducing the ADA task concept as constituting a process in a real time system There are instructions for scheduling and delaying tasks as well as performing rendezvous between tasks THOR provides hardware selfcheck as well as an Error Detection And Correction EDAC unit for check of processor communication with memory on chip 3 5 Conclusions The large register file present in several of the studied processors allows optimizing compil ers to arrange for fast subprogram calls by passing parameters in registers When a large register file is available there is a good chance that all or most of the parameters could be passed this way The MC88100 and R2000 are good examples Both architectures provide large register sets and the usage of these registers could be optimized by a compiler The drawback here comes in the case of nested subprogram calls only the highest program level can take full advantage of this construction With a register window design as in SPARC or I80960KB it is possible to increase the number of program levels that will benefit from parameters passed in registers
136. s executed by a user program e an interrupt or memory access fault occur 33 Exceptions Exceptions are conditions that causes the processor to suspend execution of the current stream and perform exception processing Exceptions can occur at any time during normal instruction execution Exceptions are recognized internally when the processor is between instructions Exceptions occur due to to four types of conditions e Interrupts which are signalled externally e Externally signaled errors such as bus errors e Internally recognized errors such as zero divide e Trap instructions The processor begins exception handling at the next instruction boundary after the event is recognized It freezes the execution context in shadow and exception time registers which also precludes other interrupts from occuring and enters the supervisor mode The FPU is disabled and the data unit is allowed to complete pending accesses Instruction execution transfers in an orderly manner to the appropriate interrupt han dler routine which is defined by the exception vector associated with that particular interrupt Exceptions fall into two categories precise and imprecise With a precise exception the exact processor context when the exception occured is available and the exact cause of the exception is always known With an imprecise exception the exact processor context is not known when the exception is processed The context is no
137. s from which the D field register is loaded 2 Register indirect with index is similar to the previous mode but contents of register specified by the S2 field are used as index rather than as immediate value SUBOPCODE field specifies the particular instruction 3 Register indirect with scaled index The index is scaled by the size of the access before it is used in the address calculation Here SUBOPCODE specifies the particular instruction as well as the scaling factor Flow Control Instructions Flow control instruction address or reference instruction memory by the use of four dif ferent addressing modes Address calculations are performed using signed arithmetic Overflows are not detected and results are truncated to the number of available bits 1 Triadic Register Addressing is used to specify the target of a jump instruction or the operands of a trap on bound instruction All three of the operands do not have to be used The SUBOPCODE identifies the particular instruction For jump instructions the S2 field specified register contents are placed in the FIP causing program execution to be transferred to that address The lower two bits of S2 field register are ignored so that FIP contains a word address The S1 and D fields are ignored For trap generating bound checks instructions the data in registers specified by S1 and S2 fields are compared A trap is taken if the source 1 data is greater than the source 2 data unsigned The D
138. s saved in the process workspace and the next process taken from the list Process scheduling pointers are updated by instructions which cause scheduling operations and should not be altered directly Actual process switch times are less than 1 micro second as little state needs to be saved and its not necessary to save the evaluation stack on rescheduling The T800 supports two levels of priority Priority 1 low priority processes are ex ecuted whenever there are no active priority 0 high priority processes High priority processes are expected to execute for a short time If one or more high priority processes are able to proceed then one is selected and runs until it has to wait for a communication a timer input or it completes processing If no process at high priority is able to proceed but one or more processes at low priority are able to proceed then one is selected If there are n low priority processes then the maximum latency from the time at which a low priority process becomes active to the time when it starts processing is 2n 2 timeslice periods It is then able to execute for between one and two timeslice periods less any time taken by high priority processes This assumes that no process monopolises the transputer time that is has a distribution of descheduling points 148 B 7 THOR PCB search THOR like the T800 facilitates hardware support for task switching There are 6 different Signal In pins SIO SI5 which f
139. se requirements at the same time In fact there are several contradictions in the requirements such as 1 and 3 2 and 3 and a closer look will show even more The RISC approach is to reduce Cand T This can only be done at the cost of I To minimize this cost one attempts to reduce J with the aid of highly optimizing compilers Therefore one must bear in mind that the absence of such program development tools will dramatically affect a RISC system 1 4 A RISC design decision graph The RISC approach leads to several design decisions Figure 1 1 illustrates how funda mental criteria lead to design decisions that constitutes a RISC processor An attempt to acheive single cycle execution i e reduce C without affecting cycle time T leads to a pipe lined architecture The pipe line should be divided into stages wich all meet the cycle time requirement stated as 7 To fully exploit the advantages of a pipe line a uniform instruction fetch and execu tion must be accomplished This may possibly be disturbed by data dependencies which prevent an early stage of an instruction from being executed before a later stage of the preceeding instruction has been completed Changes in program flow forces a stop flush and refill of the pipe line A score board mechanism that indicates registers in use will 19 detect data dependencies Pipe line forwarding technique may prove helpful for reducing the penalties Delayed branch which means that
140. ser code and data but access to the kernel procedures called supervisor procedures is only allowed through a controlled interface This interface is provided by the system procedure table 2 2 2 80960KB data formats The 80960KB operates on seven data types Integer real ordinal and decimal data types can be thought of as numeric data types The remaining types bit field triple word and quad word represent grouping of bits or bytes that the processor can operate on as a whole regardless of the nature of the data contained in the group Integers are signed whole numbers which are stored and operated on in two s comple ment format The processor recognizes four sizes of integers 8 bit byte integers 16 bit short integers 32 bit integers and 64 bit long integers Ordinals are a general purpose data type The processor recognizes four sizes of or dinals 8 bit byte ordinals 16 bit short ordinals 32 bit ordinals and 64 bit long ordinals The processor uses ordinals for both numeric and non numeric operations For numeric operations ordinals are treated as unsigned whole numbers The processor pro vides several arithmetic instructions that operate on ordinals For non numeric operations ordinals contain bit fields byte strings and Boolean values 36 Reals are floating point numbers The processor recognizes three sizes of reals 32 bit reals 64 bit long reals and 80 bit extended reals The real number
141. simple case was analyzed for the studied processors A real time system with ten runable processes is considered A complete process switch is assumed accomplished by storing old process context selecting a new process load the new process context into processor registers Table 3 1 summarises the processor cycles required to complete a search in the list of PCB s for each processor The number of cycles required for storing restoring processor context is given in table 3 2 From these figures and the systems clock frequency the total time required to perform a process switch could be estimated Table 3 3 For THOR and T800 there is hardware support for rescheduling while for the other processors process switch had to be programmed Assembly language listings of these programs and notes about the calculations giving the figures are gathered in Appendix B 84 Processor Freq Total Time MHz mikro seconds MC88100 12 2 I80960KB 21 4 Am29000 13 1 MIPSR2000 6 8 SPARC 17 2 T800 less than 1 THOR less than 1 Table 3 3 Total time required for a process switch estimated 3 4 Real Time System Support As stated earlier in this chapter a real time system should provide synchronisation between events This requires data structures for wait and delay queues and a timer function used to maintain system time and for process delay purposes Another important issue is the problem with synchronising local system time with global
142. studied processors That includes subprogram calls interrupt handling process switch real time synchronisation facilities and debug support Other aspects on the high level language support are not within the scope of this work 3 1 Subprogram Calls A subprogram call is a result of a high level language function procedure call statement In the case of func p1 p2 pn the compilers function is to generate code for a subprogram call with n parameters The traditional way to do this is to push the n parameters on stack and perform a subroutine subprogram call then modify the stackpointer and continue But this requires at least n memory accesses with possible penalty and degraded performance Thus it is preferable to hold and pass the parameters in registers This is made possible by a large number of registers and conventions for the use of these register That is directives for the compiler writer of how to dispose the register set The register usage conventions are connected with the processor architecture and these conventions will be described in the next paragraphs Besides parameter passing a compiler generates specific code for a subprogram which is to be executed before the actual translated high level program subprogram entry as well as after the high level program subprogram exit Subprogram entry code should for example allocate memory required for local variables possibly perform stack checking check pointers for v
143. t modify bit Table A 14 I80960KB Bit bitfield instructions targ call a new precedure targ call a system procedure targ call extended return from procedure Table A 15 180960KB Call return instructions fault if equal fault if not equal fault if less fault if less or equal fault if greater fault if greater or equal fault if ordered fault if unordered Table A 16 I80960KB Conditional fault instructions 117 MODTC MARK FMARK MODPC FLUSHREG MODAC TESTE TESTNE TESTL TESTLE TESTG TESTGE TESTO TESTNO modify trace controls generate breakpoint trace event force mark modify process controls flush local registers modify arithmetic control test for equal test for not equal test for less test for less or equal test for greater test for greater or equal test for ordered mask src dst src mask src dst mask src dst dst dst dst dst dst dst dst dst test for unordered Table A 17 I80960KB Processor management instructions SYNCF SYNLD SYNMOV SYNMOVL SYNMOVQ synchronize faults synchronize load synchronous move synchronous move long synchronous move quad src dst dst src dst src dst src Table A 18 I80960KB Synchronous load and move instructions 118 ADDR ADDL ATADD ATANR ATANRL ATMOD CLASSR CLASSRL CMPOR CMPORL CMPR CMPRL COSR COSRL CPYRSRE CPYSRE CVTILR CVTIR CVTRI CVTRIL CVTZRI CVTZRIL DIVR DIVRL EXPR EXPRL LOGBNR LOGBNRL LOGEP
144. t known because concurrent operations have affected the information that comprises the processor context The integer unit maintains copies of certain internal registers for use during MC88100 exception processing The data unit and FPU also maintain copies of internal registers to allow full recovery when exceptions occur The copies of internal registers are referred to as shadow registers and are updated on every clock cycle when shadowing is enabled For shadowing to occur it must be specifically enabled This may be done by clearing the shadow freeze bit in PSR or by executing an rte instruction The shadow freeze bit is set by hardware when an exception is processed in order to preserve the processor context Exception vectors are entry points into the interrupt handler routines The MC88100 maintain a vector table consisting of 512 exception vectors on a 4 KB memory page pointed to by the vector base address in the vector base address register VBR Each interrupt and exception vector has a corresponding number which is generated by hardware or specified as a nine bit field in a trap instruction This number is used as an index into the vector table Each exception vector is two instructions eight bytes 34 long Exception vectors 0 127 are reserved for various events while exception vectors 128 511 are user defined Due to concurrent execution units of the MC88100 multiple exceptions can occur at the s
145. ted logic the memory bandwith is further reduced Precise time handling is essential for the control of several processes in real time system applications The general purpose processors relies on timer functions provided by other devices in the system and this is probably not sufficient The ability to respond within a finite time on an external event is dependent of the processors support for a software process switch Minimizing the latency of switch between to processes requires hardware support for this event The general purpose processors do not provide such support 107 Debug capabilities of hardware as well as software are necessary for the design of high dependable systems such as real time systems The general purpose processor s do not provide extensive support for debugging of a real time system Am29000 despite that the manufacturer claims it to be designed for real time systems is similar to the general purpose processors T800 has several features which support real time systems while THOR is the only of the studied processors that seems to be dedicated for use in real time systems 108 Bibliography Adv88 Advanced Micro Devices Am29000 streamlined instruction processor 1988 Bir85 Birnbaum J S Worley W S Beyond risc High precision architecture Hewlett Packard Journal vol 36 no 8 pp 4 10 August 1985 Hen84 Hennessy J L Vlsi processor architecture EE Transactions on Computers vol C 33 no 12 p
146. ter architectures To gain understanding of the design decisions behind RISC machines it is necessary to recapture the historical development of processors and their instruction sets Ever since the first digital processing units the instruction sets have been extended and the instruc tions have grown in complexity The MARK 1 1948 had seven quite simple instructions while a mainframe from the late seventies such as VAX has over 300 instructions Some of these instructions are extremely complex requiring a large amount of hardware and several clock cycles to be executed This in turn leads to sophisticated technics for pipelining prefetching and the use of cache memories This development from small and simple to large and complex instruction sets is remarkable when it comes to single chip processors For example if comparing the Motorola 6800 with the 68020 we find that eleven new ad dressing modes have been added the number of instructions has doubled new functions have been added for instruction caches and coprocessors Furthermore the instructions complexity has grown tremendously The general trend towards modern CISC Complex Instruction Set Computer is a result of several factors New models within a computer family have to be compatible with their predecessors As a result the number of functional units in the processor increases In this way new functions can be added in new machines without wasting earlier software development effor
147. ters r9 through r2 are used for passing parameters to a called routine These registers can be overwritten by the called routine This is a software convention e Registers r13 through r10 are used for temporary storage They can be overwritten by a called routine but do not contain parameters for the called routine This is a software convention e Registers r25 through r14 are used as data storage for the current routine A called routine must ensure that the data in these registers is returned without modification when it finishes execution These registers must be preserved for the calling routine This is a software convention e Registers r29 through r26 are reserved for use by the linker which is a software convention e Register r30 is reserved for use as a software frame pointer which is a software convention e Register r21 is reserved for use as a software stack pointer which is a software convention Thus the architecture gives good support to subprogram calls with up to eight param eters passed in registers It should be noted though that nested subprogram calls require stacking of registers used for parameters during the previous call 3 1 2 I80960KB register conventions The 80960 provides sets of 16 local register for each subprogram There are 4 sets of these registers on chip If a nesting depth larger than 4 is used the processor automatically saves the local register contents on stack thus freeing local registers
148. the following fields e CLK Clock Frequency is used to set a division factor 1 to 255 of the chip clock to get the real time clock and delay register frequency nominally 1 MHz Clocks are stopped when this field is zero e CC Cache Control controls the use of data and instruction cache e RM Controls the IEE 754 floating point Rounding Mode e S Determines the Scheduling Mode used e F Enables flow control e B Enables bus timeout exception e WS Waitstate sets the number of waitstates in the first 1 GByte of memory From 0 up to 6 waitstates can be used Setting this field to 7 indicates use of the Ready signal 68 e DC Data Check sets the data error checking mode in the first 1 GByte of memory Mode may be one of Odd Even Parity EDAC or disabled The Error Address Register EAR is set to the first external memory address which caused an error The register contains a word address The Identification Register IR is a read only register holding the chip manufac turer identity part number and version number The Real Time Clock RTL RTM is a 64 bits value read as two 32 bit registers Incrementation of this register is due to contents in the Configuration Register The Signal Registers are used to hold the status of the chip signals used for mul tiprocessing and interrupts There is one input register SIR and one output register SOR Each bit in the registers corresponds to a signal on the chip There are 6 inputs
149. the price for this superior performance Table 4 1 however gives another picture The restrictions made on the real time system configuration degrades total SPARC system performance notably here it is com parable with both THOR and T800 The explanation lies in the absence of cache memory and the presence of an EDAC which prevents the system from gaining from the benefits that the SPARC architecture offers At the same time the expected failure rate and the total board area required are considerably larger than for THOR The power requirement more than doubled compared to both T800 and THOR 4 16 Conclusions The system hardware considerations shows that in a real time system design there is not very much to gain with a modern general purpose RISC design such as SPARC On the contrary while the estimated performance for SPARC was just about the level of THOR the board area became approximatly 40 larger the power consumption 70 more and the expected failure became 45 greater 105 Pisa MOR KA o Clock Frequency MHz Mixed instruction execution rate MmixedIPS Number of required devices Total area for devices mm2 Total power requirement mW Failure Intensity FITS Table 4 1 Summary real time system configuration 30 25 40 Clock Frequency MHz 8 5 14 3 23 0 Mixed instruction execution rate MmixedIPS 21 19 23 Number of Required Devices 7730 8289 12785 Total area for devices mm2 26114 26020 36190 Total P
150. ther the processor is in supervisor mode or not Supervisor mode can only be entered by a software or hardware trap The PS bit contains the value of the bit at the time of the most recent trap ET is the Trap Enable bit When it is set traps are enabled When ET is disabled all asynchronous traps are ignored A synchronous trap will cause the processor to halt and enter error mode i e perform a RESET CWP comprise the Current Window Pointer which points to the current active r register window It is decremented by traps and the SAVE instruction and incremented by RESTORE and RETT instructions The Window Invalid Mask Register WIM is used to determine whether a window overflow or window underflow trap should be generated by a SAVE RESTORE or RETT instruction Each bit in the WIM corresponds to a window The register may be written by WRWIM and read by RDWIM instructions Bits corresponding to nonexistent windows read as zeroes and values written are ignored The Trap Base register TBR contains three fields that generate the address of the trap handler when a trap occur The Trap Base Address TBA which is controlled by software It contains the most significant 20 bits of the trap table address The TBA field can be written by the WRTBR instruction The trap type tt field is an 8 bit field that is written by the processor at the time of a trap and retains its value until the next trap It provides an offset into the trap table The WR
151. ting point unit cache memory management controllers and cache RAMs In this chapter the integer unit as well as the floating point unit will be referred to with the name SPARC 2 5 1 SPARC instruction set SPARC defines 55 basic integer instructions 14 basic floating point instructions and two coprocessor operate instruction formats The instructions fall into five basic categories load store arithmetic logical shift control transfer read write control register and float ing point operate coprocessor operate Load and store instructions are the only way to access memory or external registers Addresses are calculated using the contents of two registers or one register and a constant The destination may be either an integer unit floating point unit or coprocessor register which either supplies or receives the data SPARC employs a supervisor user mode model of operation The state determines which address space is accessed with the ASI bits see below and whether or not privi leged instructions may be used Privileged instructions restrict control register access to supervisor software preventing user programs from accidentally altering the state of the machine Whenever an address is sent to the address bus the processor also generates 8 bits of address space identifier ASI The ASI pins identify for the external system which of the 256 possible address spaces is to be accessed The address space identifier is intended for use by
152. tion cp exception data access exception tag overflow 10 trap instruction 128 255 CO oo NDO A a NN interrupt level 15 31 interrupt level 14 30 interrupt level 13 29 interrupt level 12 28 interrupt level 11 27 interrupt level 10 26 interrupt level 9 25 interrupt level 8 24 interrupt level 7 23 interrupt level 6 22 interrupt level 5 21 interrupt level 4 20 interrupt level 3 19 interrupt level 2 18 interrupt level 1 17 Table 2 22 SPARC trap vector table 63 2 6 INMOS T800 transputer Transputer is a family of 16 bit and 32 bit processors It is a RISC designed for multipro cessor applications The architecture allow multiprocessor network of arbitrary size and topology to be built A word length independent architecture allows the same software to run on any Transputer Inmos has developed OCCAM a language that provides a model for concurrency and communication for all Transputers The Transputer has a stack oriented instruction set Most of the instruction operates on top of an evaluation stack It has extensive hardware support for concurrency and special communication links supporting large multiprocessor systems The IMS T800 is a 32 bit microcomputer with a 64 bit floating point unit and graphics support It has 4 KBytes on chip RAM a configurable memory interface and four standard INMOS communication links 2 6 1 T800 data formats The OCCAM model provides 7 different data formats 1 BOOL is a true or fals
153. tion set 66 2 7 2 THOR data types 66 2 7 3 THOR instruction formats and addressing modes 66 2 7 4 THORregisters 68 2 7 5 THOR processing states 71 2 8 Conclusions 71 3 Real Time System requirements 74 3 1 SubprogramCals 75 3 1 1 MC 88100 register conventions 75 3 1 2 I80960KB register conventions 76 3 1 3 Am29000 register conventions 77 3 1 4 MIPS R2000 register conventions 77 3 1 5 SPARCregisterconventions 78 3 1 6 T800 THOR 78 3 2 Deviation from normal execution 78 3 2 1 MC88100 79 3 2 2 180960KB 79 3 2 3 Am29000 80 3 2 4 MIPS R2000 81 3 2 5 SPARC 81 3 2 6 T800 81 3 2 7 THOR 82 33 Task Switch aaa ee ee 83 3 4 Real Time System Support 85 3 4 1 MC88100 85 3 4 2 i80960 86 3 4 3 Am29000 86 344 R2000 oaoa a 86 3 4 5 SPARC
154. tion system such as for process control etc is an embedded system Throughout this thesis the terms real time system will be used in the meaning of an embedded hard real time system During the last decade RISC Reduced Instruction Set Computer processors introduced mainly in work station applications have brought excellent performance at low costs In real time system design the question arises How do RISC processors comply to the specific demands of such a system This thesis describes seven RISC processors from an architectural point of view Their ability to perform in a real time system is elaborated and reported Finally real time system hardware considerations are made from six different designs using three different processors The subject will be treated as follows chapter 1 will recapture the development path leading to today s RISC architectures In chapter 2 different processors will be described in detail from an architectural point of view Chapter 3 will give a thorough discussion of real time systems requirements and how the studied processors meet these demands A real time system s hardware requirements tend to degrade the total system performance which is the reason why hardware considerations are emphasised in chapter 4 Chapter 5 gives concluding remarks Seven different processors have been selected for this study One selection criterion was to include RISC processors commonly used today The following selection w
155. to execute them 2 7 2 THOR data types Different instruction operates on one or more of the following data types 32 bit integer unsigned signed 32 bit IEE 754 single precision floating point 2 7 3 THOR instruction formats and addressing modes There are five different instruction formats Table 2 23 The format determines the instruction length in bytes and how to interpret the parameter if present A 16 bit encoded instruction designated 2 The format designated 2a is still encoded in 16 bits but includes a parameter P which is interpreted as a twos complement value 127 128 The format 2b is identical with 2a except from the interpretation of the parameter P In this format it is interpreted as a binary value 0 255 The format 4a is encoded in 32 bits and contains a parameter which is interpreted as a twos complement number 2 to2 1 The format 4b is identical with 4a except from 66 opcode opcode ext opcode parameter opcode parameter Table 2 23 THOR instruction formats the interpretation of the parameter P In this format it is interpreted as a binary value 0 to 22 1 All instructions with operands use the stack top as implicit source and or destination operand effective address There are five different addressing modes Stack relative program counter relative indirect immediate and register Stack Relative addressing mo
156. to the stack cache The entries in the stack cache are likely to remain there for re use because the dynamic nesting depth of activated procedures tends to remain near a given depth for long periods of time As a result the size of the run time stack does not change very much over long intervals of program execution Since activation records are allocated and de allocated within the local registers most procedure linkage can occur without external references Also during procedure execu tion most data accesses occur without external references because the scalar data in an activation record is most frequently referenced Activation records are typically small so the 128 locations in the local register file can hold many activation records from the run time stack 3 1 4 MIPS R2000 register conventions Mips R200 assembler denotes the 32 general purpose registers 0 1 31 The register usage are as follows e Register 0 always contains zero which is used in instructions requiring the constant zero as an operand e Register 1 is reserved for the assembler e Registers 2 and 3 are used for expression evaluations and to hold integer function results They are also used to pass the static link when calling nested procedures e Registers 4 through 7 are used to pass the first 4 words of integer type actual arguments their values are not preserved across procedure calls e Registers 8 through 15 are used for temporary storage
157. truction count This is a very strong implication for optimizing compilers For implementation a constant chip area should be maintained A simple decoding logic saves chip and implies simple instructions Uniform instruction execution demands uniform instruction fetch One instruction should be fetched in each cycle but disturbances from data traffic make this difficult to acheive Since the memory bandwidth is assumed to be constant we have another implication for a large on chip register file We may thus conclude The RISC high performance relies heavily on low cycle time single cycle execution which implies a Reduced Instruction Set with simple uniform instructions and efficient optimising compilers 1 5 Early RISCs The RISC concept was in fact adapted very early by Seymour Cray in an effort to design a very fast vector processor The CDC 6600 was register based and all operations used data from registers local to the arithmetic units The instruction set was simple and executions were pipelined Cray realized that all operations must be simplified for maximal performance One bottleneck in processing may cause all other operations to degrade performance Sie82 Starting in the mid 1970s the IBM 801 research team investigated the effect of a small 20 Figure 1 1 A Risc Design Decision Graph 21 instruction set and optimizing compiler design on computer performance They performed dynamic studies of the frequency of use o
158. tructions uses this format The MEMB format offers the option of including a 32 bit displacement contained in a second word to the instruction Bit 12 of the first word of the instruction determines whether the format is MEMA clear or MEMB set 1 MEMA format For both formats the opcode field is 8 bits long The SRC DST field specifies a global or local register For load instructions the SRC DST field specifies the destination register for a word loaded into the processor from memory or for operands larger than one word the first of successive destination registers For store instructions this field specifies the register or group of registers that contain the source operand to be stored in memory The mode bit or for MEMB mode bits determine the address mode used for the instruction The MEMA format provides two addressing modes absolute offset and register indirect with offset The offset field specifies an unsigned byte offset from 0 to 4096 The ABASE field specifies a global or local register that contains an address in memory The address is interpreted as either a virtual address or a physical address depending on whether the processor is operating in virtual addressing or physical addressing mode respectivly For the absolute offset addressing mode the MD bit is clear the processor interprets the offset field as an offset from byte 0 of the current address space The ABASE field 41 is ignored The use of this address
159. ts Several efforts have been done to decrease the semantic gap between high level programming languages and the instruction set This has been done by implementing instructions that were close to the high level statements Such instructions have a tendency of being extremely complex and not applicable for every possible language Thus it turns out that the compiler can not make use of these special instructions Meanwhile these instructions require a lot of hardware which in many cases increases the processor cycle time To make the machines run faster designers have moved functions from assembly pro gram to microcode and further on from microcode to hardware By adding extra hardware in the decoding unit one could get to a point where a machine cycle has to be lengthened Thus adding a certain instruction may slow down the execution of every instruction in the set Development tools and methods used in the design of large VLSI circuits is a 17 support for design of large architectures Microcoding is a particular interesting technic that encourages complex instructions It is a structured way of implementing creating and modifying those algoritms that control the execution of complex instructions in the processor The steady grow of CISC functions is further supported by large micromemorys It is easy to add a new instruction if only there is room enough in the micromemory 1 3 Considerations that lead to the RISC At least historica
160. ull operation 2 Register with 10 bit immediate addressing is used in bit field instructions Data in rS1 is processed and the result is placed in rD The 10 bit immediate value represents Triadic register 10 bit immediate bits encoding bits encoding 31 26 OPCODE OPCODE D Si SUBOPCODE SUBOPCODE 52 IMM10 Table 2 5 MC88100 Triadic register and 10 bits immediate instruction formats 29 16 bit immediate control register bits encoding bits encoding OPCODE OPCODE D D Si OP SFU CRS CRD 52 Table 2 6 MC88100 16 bit immediate and control register addressing instruction formats two 5 bit fields specifying the bit field width and offset respectively 3 Register with 16 bit immediate addressing is used by arithmetic and logical instruc tions requiring a 16 bit unsigned immediate value This value is zero extended be fore processed by any arithmetical instruction 4 Control Register Addressing is used to reference the general control and FPU control registers General purpose registers may be loaded from stored to or exchanged with the control registers The CRS CRD field specifies the control register which is a source register in the case of a load instruction a destination register otherwise The D field specifies a general purpose register that is loaded with the contents of the selected control register This field is ignored in store operations The 1 field specifies the general purpose register whose conte
161. ultiple store as well as a multiple load sequence Since we are interested in the architectures impact only we assume no wait state penalty from slow memory devices B 1 MC388100 B 1 1 PCB search PCB search exits with task identification number T ID in r4 task priority T PRI in r3 ptr to highest process tasks PCB in r5 lda h r2 r0 PCBOPTR address of first PCB in r2 1 add r5 r0 r2 ptr to hi priority task 1 add ri r0 10 number of PCB s to search 1 add r3 r0 0 initial priority lowest 1 add r4 r0 0 initial PCB ID undefined 1 L1 ld b r6 r2 T PRI priority to r6 memory access 40 cmp r7 r3 r6 compare priorities result in r7 10 bb1 HS BIT r7 L2 branch if previous is greater 19 add r3 r0 r6 substitute new priority 11 lda h 14 r2 T ID remember task ID memory access 14 add r5 r0 r5 remember PCB ptr 1 L2 lda h r2 r2 T NEXT get next PCB pointer memory access 40 sub ri ri i exit 10 bend gt0 r1 L1 when all PCB s searched 18 B 1 2 Register Store Figure B 2 outlines pipe line occupation during multiple store cycles 4 6 are memory data accesses that prevents instruction fetch therefore MC88100 will finish 3 stores within every sixth cycle and so storing 31 registers will use 31 6 3 62 cycles 142 Register Restore From figure B 3 we conclude cycles 4 6 are memory data accesses that prevents in struction fetch therefore MC88100 will finish 3 loads within every tenth cycle Du
162. unctionality equals ordinary interrupt signal lines There are further four different SIGNAL OUT SO0 503 Each SIGNAL IN is corresponding to a specific task so that when a SIGNAL IN occurs the hardware will ensure that the corresponding task will be scheduled next This mechanism provides for a very rapid response to external events and indeed supports multiprocessor configurations where different tasks may run in separate processors and the synchronisation between these tasks is accomplished throug the SIGNAL OUT and SIGNAL IN pins Fast software taskscheduling is accomplished by hardware The chip include registers aimed to hold task related data i e PCB The mechanism insures that the highest priority process will be scheduled next Priorities range between 1 32 It further insures that a delayed task receives immediate attention att the end of the delay THOR thus do not need a software kernel to perform process scheduling Due to the stack architecture of THOR there are very little context to be saved and so it is reasonably to assume a process switch time below 1 microsecond 149 Pipeline occupation cycle by cycle fetch 1 fetch 2 fetch 3 fetch 4 data2 data3 Figure B 2 MC88100 multiple store sequence Pipeline occupation cycle by cycle fetch 1 fetch 2 fetch 3 stall stall fetch 4 dec 1 data2 data3 writ3 Figure B 3 MC88100 multiple load sequence Pipeline occupation cycle by cycle fetch 1 fetch 2
163. upervisor storage register 1 supervisor storage register 2 supervisor storage register 3 Table 2 3 MC88100 control registers 28 name function eXecute Instruction Pointer contains the address of the instruction that is currently being executed Next Instruction Pointer contains the address of the instruction that is currently being received from memory and decoded by the instruction unit Fetch Instruction Pointer points to the memory location of the next accessed instruction For sequential execution FIP XIP 4 Jump target addresses are received from the jump instruction operand Unconditional branch addresses are computed from the XIP and a 26 bit signed displacement i e FIP XIP d26 Conditional branch addresses for the branch taken case are calculated as FIP XIP d16 Scoreboard Register contains a bit corresponding to each register rl r31 If a bit is set the corresponding register is currently in use Table 2 4 MC88100 internal registers Register to Register Instructions Depending on instruction this format provides four addressing modes 1 Triadic Register Addressing uses three five bit fields to specify two source register fields 51 52 and a destination register field D The OPCODE field directs processing to the integer unit or the floating point unit Not every instruction uses all three register selection fields For arithmetic and logical instructions there is a SUBOPCODE field wich specifies the f

Processor performance in real

Contents

Download Pdf Manuals

Related Search

Related Contents