Home

Input operand size and hi/low word selection control in data

1. ert e ADDA 0 1 416384 0 45 22 For bit exact implementation of 8 before the command indicates saturation The assembler also supports sum of _i b_i gt gt k quite common used in TrueSpeech the standard Piccolo code would be CMNCMN dest srcl 32 lt 2 32 gt lt scale gt 50 CMNCMP dest srcl 32 lt src2__32 gt lt scale gt CMPCMN dest srcl 32 lt 2 32 lt scale gt MUL 0 5 0 5 dest srcl 32 lt 2 32 lt scale gt ADD ans ans t1 MUL 1 1 ASR k ADD ans ans t2 generated by the standard instructions with no write back 55 Flags There are two problems with this code it is too long and the C is set if there is a carry out of bit 15 when adding the adds are not to 48 bit precision so guard bits can t be used two upper sixteen bit halves A better solution is to use ADDA Z is set if the sum of the upper sixteen bit halves is 0 N is set if the sum of the upper sixteen bit halves is 60 negative MUL t1 a_0 b_0 ASR k V is set if the signed 17 bit sum of the upper sixteen bit MUI jeder l ame ASRA halves will not fit into 16 bits post scale ADDA ans t1 t2 ans ety SZ SN SV and SC are set similarly for the lower 16 bit halves This gives a 25 speed increase and retains 48 bit accuracy 65 Reason for inclusion Add Subtract in Parallel instructions perform addition and The parallel Add and S
2. Only a 32 bit value can be read with the upper and lower halves optionally swapped Source operand 2 can be one of the following formats src2 will be a shorthand for three options lt scr2__maxmin gt lt scr2_shift gt shift instructions provide a limited subset of src2 See above lt 2 gt more physical registers than be specified by the lt acc gt a source register of the form RniRn l Rn hiRn x plus scale scale of the final result an optionally shifted eight bit constant immed 8 gt but no scale of the final result a six bit constant lt immed_6 gt plus a scale scale of the final result is the same as src2 but a scale is not permitted for details as for lt 2 shift For instructions which specify a third operand is short for any of the four accumulator registers 2 3 48 bits are read No refill can be specified The destination register has the format dest extension register which is short for RnIRn lRn h IT With no the full register is written 48 bits in the case of an accumulator In the case where no write back to the register is required the 5 881 259 35 36 continued used is unimportant The assembler supports the omission of a destination register to indicate that write back is not required or 1 to indicate that no writeback is required but flags should be
3. lt src2 lt scale gt 1 SMUL dest srcl 16 lt src2 lt scale gt Flags See section above Reasons for inclusion Signed and saturated multiplies are required by many processes Register List Operations are used to perform actions on a set of registers The Empty and Zero instructions are pro vided for resetting a selection of registers prior to or in between routines The Output instruction is provided to store the contents of a list of registers to the output FIFO 30 35 40 45 Mnemonics 000 EMPTY register list 001 ZERO register list 010 Unused 011 Unused 100 OUTPUT register list lt scale gt 101 OUTPUT register list lt scale gt 110 SOUTPUT register list lt scale gt 111 SOUTPUT register list lt scale gt Flags Unaffected EXAMPLES EMPTY A0 A1 0 3 ZERO fY0 Y3 OUTPUT Xxo Y1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8 765 432 10 OPC specifies the type of instruction Action OPC 000 for k 0 k 16 k if bit k of the register list is set then register k is marked as being empty 001 for k 0 k 16 k if bit k of the register list is set then register k is set to contain 0 010 Undefined 011 Undefined 100 for k 16 k if bit k of the register list is set then register k gt gt scale is written to the output FIFO 55 60 65 11111 a REGISTER_LIST_16 S
4. Piccolo must be halted before a PMIR can be performed The MCR encoding of this opcode is 10 15 18 This section discusses the Piccolo instruction set which controls the Piccolo data path Each instruction is 32 bits long The instructions are read from the Piccolo instruction cache Decoding the instruction set is quite straight forward The top 6 bits 26 to 31 give a major opcode with bits 22 to 25 providing a minor opcode for a few specific instructions Bits shaded in grey are currently unused and reserved for expansion they must contain the indicated value at present There are eleven major instruction classes This does not fully correspond to the major opcode filed in the instruction for ease of decoding some sub classes 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 7 6 54 3 2 1 0 id B 2 TS 25 0000 5 881 259 19 continued REGISTER LIST 16 The instructions the above table have the following names Standard Data Operation Logical Operation Conditional Add Subtract Undefined Shifts Select Undefined Parallel Select Multiply Accumulate Undefined Multiply Double Undefined Move Signed Immediate Undefined Repeat Repeat Register List Operations Branch Renaming Parameter Move Halt Break The format for each class of instructions is described in detail in the following sections The source and destination operand field
5. eseg uejgeseg deiMeseg oujeseg 89 N 09 1 UlOd WEN LONON jueuie 3 eDelolS 99 bu ios M J9julod oseg i E b 2 vL JequinN 04 89 21 jequinN eots ug dewey 1201607 5 881 259 Sheet 7 of 7 Mar 9 1999 U S Patent ZB 18314 99018 5 881 259 1 INPUT OPERAND SIZE AND HI LOW WORD SELECTION CONTROL IN DATA PROCESSING SYSTEMS BACKGROUND OF THE INVENTION 1 Field of the Invention This invention relates to data processing systems More particularly this invention relates to data processing systems having a plurality of registers for storing data words to be manipulated by an arithmetic logic unit operating under control of program instruction words 2 Description of the Prior Art It is known to provide program instructions that specify whether an operation is to be performed on input operands contained in one register or an input operand contained in two registers but treated as a single input operand SUMMARY OF THE INVENTION Viewed from one aspect the present invention provides apparatus for data processing said apparatus comprising i a plurality of registers for storing data words to be manipulated each of said registers having at least an N bit capacity and ii an arithmetic logic unit responsive program instruc tion words to perform arithmetic logic
6. set as though the result is a 16 bit quantity denotes that the value is written to the output FIFO lt scale gt available scales ASR 0 1 2 3 4 6 8 10 ASR 12 to 16 represents a number of arithmetic scales There are fourteen LSL 1 lt immed_8 gt stands for a unsigned 8 bit immediate value This consists of a byte rotated left by a shift of 0 8 16 or 24 Hence values 0xYZ000000 Ox00YZ0000 0 0000 700 and 0x000000YZ can be encoded for any YZ The rotate is encoded as a 2 bit quantity immed 6 Stands for an unsigned 6 bit immediate PARAMS is used to specify register re mapping has the following format Arithmetic instructions can be divided into two types parallel and full width The full width instructions only lt BANK gt lt BASEINC gt n lt RENUMBER gt w lt BASEWRAP gt lt BANK gt can be XIYIZ lt BASEINC gt can be 1 11 21 4 lt RENUMBER gt can be 0218 lt BASEWRAP gt can be 21418 The expression cond is shorthand for any one of the following condition codes Note that the encoding is slightly different from the ARM since the unsigned LS and HI codes have been replaced by more useful signed overflow underflow tests The V and N flags are set differently on 30 set the primary flags whereas the parallel operators set the primary and secondary flags based on the upper and lower 16 bit halves of the result The N Z and V flags are calc
7. 5 881 259 31302928 27262524 23222120 19181716 15141312 COND 1110 0000 where BANK 3 0 is used to turn off the unaligned mode on a per bank basis If BANK 1 is set unaligned mode on bank X is turned off BANK 2 and 3 turn off unaligned mode on banks Y and Z if set respectively N B This is a CDP operation MPR is encoded as 31302928 27262524 23 22 21 20 19181716 15141312 111098 0000 0000 PICCOLO1 BANK 10 111098 12 765 43210 765 43210 ow To Te Tr me recor o MPRW is encoded as 31 3029 28 27 26 25 2423 222120 19 18 17 16 15 14 131211109 8 76 54 32 10 Pelee EX where DEST is 1 3 for the destination register 0 0 70 The output FIFO can hold up to eight 32 bit values These are transferred from Piccolo by using one of the following ARM opcodes STP lt cond gt lt 16 32 gt Rn 1 lt size gt MRP Rn The first saves lt size gt 4 words from the output FIFO to the address given by the ARM register Rn indexing Rn if the is present To prevent deadlock lt size gt must not be greater than the size of the output FIFO 8 entries in the this implementation If the STP16 variant is used endian spe cific behaviour may occur to the data returned from the memory system The MRP instruction removes one word from the output FIFO and places it in ARM register Rn As with MPR no endian specific operations are applied to the data
8. The ARM encoding for STP is 31302928 272625 24 23 22 21 30 35 40 19181716 15141312 memory or peripherals and must therefore take care to load 16 bit packed data in the correct manner Piccolo i e the DSP adapted coprocessor like the ARM e g the ARM7 microprocessors produced by Advanced RISC Machines Limited of Cambridge United Kingdom has BIGEND configuration pin which the programmer can control perhaps with a programmable peripheral Pic colo uses this pin to configure the input reorder buffer and output FIFO When the ARM loads packed 16 bit data into the reorder buffer it must indicate this by using the 16 bit form of the LDP instruction This information is combined with the state of the BIGEND configuration input to place data into the holding latches and reorder buffer in the appropriate order In particular when in big endian mode the holding register stores the bottom 16 bits of the loaded word and is paired up with the top 16 bits of the next load The holding register contents always end up in the bottom 16 bits of the word transferred into the reorder buffer The output FIFO may contain either packed 16 bit or 32 bit data The programmer must use the correct form of 111098 76543210 comm TeTeTSTSTe D sm Dom Teen sas where N selects between STP32 1 and STP16 0 For the the STP instruction so that Piccolo can ensure that the 16 bit definitions of the P U and W bits refer to
9. YO 1 A2 a2 d2 c0 MULA X1 h YO 1 a3 13 0 and load c4 NEXT go round loop and advance remapping illustrate how the multiply accumulate instructions operate we will consider the first four MULA instructions The first instruction multiplies the data value within the first or lower 16 bits of the X bank register zero with the lower 16 bits within Y bank register zero and adds the result to the accumulator register AQ At the same time the lower 16 bits of the X bank register zero are marked by a refill bit this indicating that that part of the register can now be refilled with a new data value It is marked in this way since as will be apparent from FIG 7 once data item dO has been multiplied by the coefficient cO this being represented by the first MULA instruction then dO is no longer required for the rest of the block filter instruction and so can be replaced by a new data value The second MULA instruction then multiplies the second or higher 16 bits of the X bank register zero with the lower 16 bits of the Y bank register zero this representing the multiplication 41 0 shown FIG 7 Similarly the third and fourth MULA instructions represent the multiplications d2xc0 and d3xc0 respectively As will be apparent from FIG 7 once these four calculations have been performed coefficient CO 15 no longer required and so the register 0 1 is marked by a refill bit to enable it to be overwritten
10. size must be at most 32 In many circumstances size will be smaller than this limit to avoid deadlock The 16 32 field indicates whether the data being loaded should be treated as 16 bit data and endianess specific action taken see below or as 32 bit data Notel In the following text when referring to LDP or LDPW this refers to both the 16 bit and 32 bit variants of the instructions Note2 A word is a 32 bit chunk from memory which may consist of two 16 bit data items or one 32 bit data item LDP instruction transfers a number of data items marking them as destined for a full register The instruction will load lt size gt 4 words from address Rn in memory inserting them into the ROB The number of words that can be transferred is limited by the following The quantity size must be a non zero multiple of 4 size must be less than or equal to the size of the ROB for a particular implementation 8 words in the first version and guaranteed to be no less than this in future versions 10 20 25 30 35 40 45 50 55 60 65 8 The first data item transferred will be tagged as destined for dest the second as destined for lt dest gt 1 and so on with wrapping from Z3 to A0 If the is specified then the register Rn is incremented by size afterwards If the LDP16 variant is used endian specific action is performed on the two 16 bit halfwords forming the 32 bit data
11. telephones in which radio signals are received and trans mitted that require decoding and encoding typically using convolution transform and correlation operations to and from an analogue sound signal Another example is disk driver controllers in which the signals recovered from the disk heads are processed to yield head tracking control In the context of the above there follows a description of a digital signal processing system based upon a micropro cessor core in this case an ARM core from the range of microprocessors designed by Advanced RISC Machines Limited of Cambridge United Kingdom cooperating with a coprocessor The interface of the microprocessor and the coprocessor and the coprocessor architecture itself are spe cifically configured to provide DSP functionality The microprocessor core will be referred to as the ARM and the coprocessor as the Piccolo The ARM and the Piccolo will typically be fabricated as a single integrated circuit that will often include other elements e g on chip DRAM ROM D to A and A to D convertors etc as part of an ASIC Piccolo is an ARM coprocessor it therefore executes part of the ARM instruction set The ARM coprocessor instruc tions allow ARM to transfer data between Piccolo and memory using Load Coprocessor LDC and Store Coprocessor STC instructions and to transfer ARM reg isters to and from Piccolo using move to coprocessor MCR and move from coprocessor MRC instructions O
12. the logical and physical register values are relative to the particular bank if Logical Register lt REGCOUNT Physical Register Logical Register Base MOD REGCOUNT else Physical Register Logical Register end if At the end of the loop before the next iteration of the loop begins the following update to the base pointer is performed by the base update logic 58 Base Base BASEINC MOD BASEWRAP At the end of a remapping loop the register remapping will be switched off and all registers will then be accessed as physical registers In preferred embodiments only one remapping REPEAT will be active at any one time Loops may still be nested but only one may update the remapping variables at any particular time However it will be appre ciated that if desired remapping repeats could be nested To illustrate the benefits achieved with regards to code density as a result of employing the remapping mechanism according to the preferred embodiment of the present invention a typical block filter algorithm will now be discussed The principles of the block filter algorithm will first be discussed with reference to FIG 7 As illustrated in FIG 7 accumulator register AO is arranged to accumulate the results of a number of multiplication operations the multiplication operations being the multiplication of coef ficient cO by data item dO the multiplication of coefficient c1 by data item 41 the multiplication of coefficient c2
13. 76 54 32 100 1 10 S DEST AJR SRC1 SRC2_MULA D 1 1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8 76 5 4 32 1 0 1 10 1 0 11 5 DEST A R 5 1 SRC2 SCALE P D 1 1 042 C 50 The field OPC specifies the type of instruction Action OPC 00 dest acc srcl src2 gt gt scale 01 dest acc srcl src2 gt gt scale In each case the result is saturated before being written to the destination if the Sa bit is set Mnemonics 00 S MULA dest srcl 16 src2 16 acc lt scale gt 01 S MULS dest lt srcl_16 gt src2 16 acc lt scale gt An S before the command indicates saturation Flags See section above OPC specifies the type of instruction Action OPC 0 dest SAT acc SAT 2 srcl src2 gt gt scale 1 dest SAT acc SAT 2 srcl src2 gt gt scale 55 Mnemonics 0 SMLDA dest srcl 16 lt src2_16 gt lt acc lt scale gt 1 SMLDS dest srcl 16 src2 16 lt acc lt scale gt 60 Flags See section above Reasons for inclusion The MLD instruction is required for G 729 and other 65 algorithms which use fractional arithmetic Most DSPs provide a fractional mode which enables a left shift of one bit at the output of the multiplier prior to accumulation or 5 881 259 49 writeback Supporting this as specific instruction provides more
14. Base Update block 58 During the first iteration of the 10 15 20 25 30 35 40 45 50 55 60 65 28 instruction loop the BASESTART signal is passed by the multiplexor 60 to the storage element 66 whereas for subsequent iterations of the loop the next base pointer value is supplied by the multiplexor 60 to the storage element 66 The output of the storage element 66 is passed as the current base pointer value to the ReMap logic 56 and is also passed to one of the inputs of an adder 62 within the Base Update logic 58 The adder 62 also receives a BASEINC signal that provides a base increment value The adder 62 is arranged to increment the current base pointer value sup plied by storage element 66 by the BASEINC value and to pass the result to the modulo circuit 64 The modulo circuit also receives a BASEWRAP value and compares this value to the output base pointer signal from the adder 62 If the incremented base pointer value equals or exceeds the BASEWRAP value the new base pointer is wrapped round to a new offset value The output of the modulo circuit 64 is then the next base pointer value to be stored in storage element 66 This output is provided to the multiplexor 60 and from there to the storage element 66 However this next base pointer value cannot be stored in the storage element 66 until a BASEUPDATE signal is received by the storage element 66 from the loop hardware managing the REPEAT inst
15. IMMEDIATE 00 0x000000XY 01 0 0000 00 10 0 00 0000 11 0 000000 The 6 bit Immediate encoding allows the use of 6 bit unsigned immediate range 0 to 63 together with a scale applied to the output of the ALU The general Source 2 encoding is common to most instruction variants There are some exceptions to this rule which support a limited subset of the Source 2 encoding or modify it slightly Select Instructions Shift Instructions Parallel Operations Multiply Accumulate Instructions Multiply Double Instructions Select instructions only support an operand which is a register or a 6 bit unsigned immediate The scale is not available as these bits are used by the condition field of the instruction Shift instructions only support an operand which is a 16 bit register or a 5 bit unsigned immediate between 1 and 31 No scale of the result is available Register number Hi Lo In the case of parallel operations if a register is specified as the source of the operand a 32 bit read must be per formed The immediate encoding has slightly different meaning for the parallel operations This allows an imme diate to be duplicated onto both 16 bit halves of a 32 bit operand A slightly restricted range of scales are available for parallel operations 5 881 259 23 24 SCALE_PAR SRC2_PARALLEL IMMED 8 IMMED 6 SCALE PAR If the 6 bit immediate is used then it is always duplicated 2 are
16. Two opcodes REPEAT and NEXT are provided for defining a hardware loop the NEXT opcode being used merely as a delimiter and not being assembled as an instruc tion The REPEAT goes at the start of the loop and NEXT delimits the end of the loop allowing the assembler to calculate the number of instructions in the loop body In preferred embodiments the REPEAT instruction can include remapping parameters such as the REGCOUNT BASEINC BASEWRAP and REGWRAP parameters to be employed by the register remapping logic 52 Anumber of registers can be provided to store remapping parameters used by the register remapping logic Within these registers a number of sets of predefined remapping parameters can be provided whilst some registers are left for the storage of user defined remapping parameters If the remapping parameters specified with the REPEAT instruc tion are equal to one of the sets of predefined remapping parameters then the appropriate REPEAT encoding is used this encoding causing a multiplexor or the like to provide the appropriate remapping parameters from the registers directly to the register remapping logic If on the other hand the remapping parameters are not the same as any of the sets of predefined remapping parameters then the assembler will generate a Remapping Parameter Move Instruction RMOV which allows the configuration of the user defined register remapping parameters the RMOV instruction being followed by the REPEAT
17. YO h 33 d4 cl and load c5 MULA AO X1 1 Y1 1 AO a0 d2 c2 and load 46 MULA 1 X1 h Y1 1 A1 al d3 c2 MULA A2 X0 1 Y1 1 A2 a2 d4 c2 MULA XO h 1 1 a3 d5 c2 and load MULA AO X1 h Y1 h AO 0 d3 c3 and load 47 MULA 1 X0 1 Y1 h A1 1 d4 c3 MULA A2 X0 h Y1 h A2 a2 d5 c3 5 881 259 31 continued MULA X1 1 a3 d6 c3 and load c7 In this example the data values are placed in the X bank of registers and the coefficient values are placed in the Y bank of registers As a first step the four accumulator registers A0 Al A2 and A3 are set to zero Once the accumulator registers have been reset an instruction loop is then entered which is delimited by the REPEAT and NEXT instructions The value Z1 identifies the number of times that the instruction loop should be repeated and for the reasons that will be discussed later this will actually be equal to the number of coefficients cO c1 c2 etc divided by 4 The instruction loop comprises 16 multiply accumulate instructions MULA which after the first iteration through the loop will result in the registers AO 1 A2 including the result of the calculations shown in the above code between the REPEAT and the first MULA instruction To 10 15 32 d4xc2 and d5xc2 whilst the final four calculations corre spond to the calculations d3xc3
18. an ARM data data is provided on the correct halves of the data bus When sheet The ARM encoding for MRP is 31302928 27262524 23 22 21 20 19181716 15141312 111098 configured as big endian the top and bottom 16 bit halves are swapped when the 16 bit forms of STP are used 765 43210 ow m DIDI om De pesoTo ppm The Piccolo instruction set assumes little endian operation internally For example when accessing a 32 bit register as 16 bits halves the lower half is assumed to occupy bits 15 to 0 Piccolo may be operating in a system with big endian 65 Piccolo has 4 private registers which can only be accessed from the ARM They are called 50 52 They can only be accessed with MRC and MCR instructions The opcodes are 5 881 259 MPSR Sn Rm MRPS Sn These opcodes transfer a 32 bit value between ARM gt register Rm and private register Sn They are encoded in ARM as a coprocessor register transfer 31302928 27262524 232221 14 Writing to the program counter will start Piccolo execut ing a program at that address leaving halted state if it is halted On reset the program counter is undefined since Piccolo is always started by writing to the program counter During execution Piccolo monitors the execution of instructions and the status of the coprocessor interface If it detects that 20 19181716 15141312 111098 765 43210 where L is 0 for the MPSR and 1 for the MRPS Register S0 contains the Piccolo u
19. d4xc3 d5xc3 and d6xc3 Since in the above described embodiment registers are not remappable each multiplication operation has to be reproduced explicitly with the specific register required being designated in the operands Once the sixteen MULA instructions have been performed the instruction loop can be repeated for coefficients c4 to c7 and data items d4 to d10 Also because the loop acts on four coefficient values per iteration then the number of coefficient values must be a multiple of four and the computation Z1 no of coeffs 4 must be calculated By employing the remapping mechanism in accordance with the preferred embodiment of the present invention the instruction loop can be dramatically reduced such that it now only includes 4 multiply accumulate instructions rather than the 16 multiply accumulate instructions that were otherwise required Using the remapping mechanism the code can now be written as follows start with 4 new data values ZERO A0 A3 REPEAT 71 X n4 w4 r4 Y n4 w4 r4 Z1 number of coefficients Remapping is applied to the X and Y banks Four 16 bit registers in these banks are remapped The base pointer for both banks is incremented by one on each iteration of the loop The base pointer wraps when it reaches the fourth register in the Zero the accumulators bank MULA 0 17 0 1 AO a0 40 0 and load 44 MULA 1 X0 h YO 1 1 al d1 c0 MULA 2 X1 1
20. data item in the particular register 10 identified by the physical register reference The remapping mechanism of the preferred embodiment allows each bank of registers to be split into two sections namely a section within which registers may be remapped and a section in which registers retain their original register references without remapping In preferred embodiments the remapped section starts at the bottom of the register bank being remapped A number of parameters are employed by the remapping mechanism and these parameters will be discussed in detail with reference to FIG 6 which is a block diagram illus trating how the various parameters are used by the register remapping logic 52 It should be noted that these parameters are given values that are relative to a point within the bank being remapped this point being for example the bottom of the bank The register remapping logic 52 can be considered as comprising two main logical blocks namely the Remap block 56 and the Base Update block 58 The register remapping logic 52 employs a base pointer that provides an offset value to be added to the logical register reference this base pointer value being provided to the remap block 56 by base update block 58 A BASESTART signal can be used to define the initial value of the base pointer this for example typically being zero although some other value may be specified This BASESTART signal is passed to multiplexor 60 within the
21. empty Half Register lt entry lt entry both halves entry marked entry marked valid empty empty Half Register Rn l lt entry h lt entry h high half valid entry marked entry marked empty empty summarise the two halves of a register may be refilled independently from the ROB The data in the ROB is either marked as destined for a whole register or as two 16 bit values destined for the bottom half of a register Data is loaded into the ROB using ARM coprocessor instructions How the data is marked in the ROB depends on which ARM coprocessor instruction was used to perform the transfer The following ARM instructions are available for filling the ROB with data LDP cond 16 32 dest Rn lt size gt LDP lt cond gt lt 16 32 gt W lt dest gt lt wrap gt Rn lt size gt LDP lt cond gt 16U bank Rn MPR lt cond gt lt dest gt Rn MRP lt cond gt lt dest gt Rn The following ARM instruction is provided for configur ing the ROB LDPA lt bank list gt The first three are assembled as LDCs MPR and MRP as MCRs LDPA is assembled as a CDP instruction In the above dest stands for a Piccolo register A0 Z3 Rn for an ARM register size for a constant number of bytes which must be a non zero multiple of 4 and wrap for a constant 1 2 4 8 Fields surrounded by are optional For a transfer to be able to fit into the Reorder Buffer
22. fetch data from the cache or main memory The invention recognises the above consideration and provides the solution of using an input operand size flag and a high low location flag to indicate the input operand size and in which portion of the register it is stored In this way a single register can hold more than one input operand so more efficiently util ising the register resources of the device and yet those input operands may be separately manipulated The advantages of the present invention are enhanced further when an N bit data bus links the data storage device to the register In this case the data bus may be used to transfer two operands at a time so more efficiently using the bus bandwidth and reducing the possibility of performance bottleneck 10 15 20 25 30 35 40 45 50 55 60 65 2 In preferred embodiments of the invention said arith metic logic unit is responsive to at least one parallel opera tion program instruction word that performs separate arith metic logic operations upon a first N 2 bit input operand data word and a second N 2 bit input operand data word stored within respective high order bit positions and low order bit positions of a single source register The provision of parallel operation program instruction words allows two independent calculations to be performed by the arithmetic logic unit making full use of its N bit datapath capabilities even though the input opera
23. flag is set correctly Overflow can occur when 1H0 AL Writing to a 16 bit register when the result is not in the i reserved range 2715 to 2 15 1 Writing to a 32 bit register when the result is not in the Since Piccolo deals with signed quantities the unsigned range 2 31 to 2 31 1 LS and HI conditions have been dropped and replaced by 55 Parallel add subtract instructions set the Z and V flags and VN which describe the direction of any overflow Since independently on the upper and lower halves of the result the result of the ALU is 48 bits wide MI and LT now When writing to an accumulator the V flag is set as if perform the same function similarly PL and GE This leaves writing to a 32 bit register This is to allow saturating 3 slots for future expansion instructions to use accumulators as 32 bit registers operations are signed unless otherwise indicated 6 saturating absolute instruction SABS also sets the The primary and secondary condition codes each consist overflow flag if the absolute value of the input operand of would not fit in designated destination N negative The Carry flag is set by add and subtract instructions and Z zero 65 is used as a binary flag by the MAX MIN SABS and CLB C carry unsigned overflow V signed overflow instructions All other instructions including multiply operations preserve the Carry flag s 5 881 259 37 For and subtract operations the Carry i
24. in the control register was cleared This allows any future extensions of the instructions set to be trapped and optionally emulated on existing implementa tions Accessing Piccolo State from ARM is as follows State access mode is used to observe modify the state of Piccolo This mechanism is provided for two purposes Context Switch Debug Piccolo is put in state access mode by executing the PSTATE instruction This mode allows all Piccolo state to be 76543210 D 1 OPC specifies the type of instruction Action OPC 00 01 dest 2 gt 0 7 srcl lt lt src2 gt gt src2 dest 2 gt 0 2 srcl gt gt src2 8101 lt lt src2 60 65 saved and restored with sequence of STC and LDC instructions When put into state access mode the use of the Piccolo coprocessor ID PICCOLO1 is modified to allow the state of Piccolo to be accessed There are 7 banks of Piccolo state All the data in a particular bank can be loaded and stored with a single LDC or STC Bank 0 Private registers 1 32 bit word containing the value of the Piccolo ID Register Read Only 5 881 259 57 1 32 bit word containing the state of the Control Register 1 32 bit word containing the state of the Status Register 1 32 bit word containing the state of the Program Counter 58 The STC instruction is used to store Piccolo state when Piccolo is in state access mode The BANK field specifies which bank is being stor
25. instruction Preferably the user defined remapping parameters would be placed by the RMOV instruction in the registers left aside for storing such user defined remapping parameters and the multiplexor would then be programmed to pass the contents of those registers to the register remapping logic In the preferred embodiments the REGCOUNT BASEINC BASEWRAP and REGWRAP parameters take one of the values identified in the following chart PARAMETER DESCRIPTION REGCOUNT This identifies the number of 16 bit registers to perform remapping on and may take the values 0 2 4 8 Registers below REGCOUNT are remapped those above or equal to REGCOUNT are accessed directly This defines by how many 16 bit registers the base pointer is incremented at the end of each loop iteration It may in preferred embodiments take the values 1 2 or 4 although in fact it can take other values if desired including negative values where appropriate This determines the ceiling of the base calculation The base wrapping modulus may take the values 2 4 8 This determines the ceiling of the remap calculation The register wrapping modulus may take the values 2 4 8 REGWRAP may be chosen to be equal to REGCOUNT BASEINC BASEWRAP REGWRAP Returning to FIG 6 an example of how the various parameters are used by the remap block 56 is as follows in 10 15 20 25 35 40 45 50 55 60 65 30 this example
26. is disabled only private registers 0 and 1 the ID and Status registers are accessible and only then from a privileged mode Access to any other state or any access from user mode will cause an ARM undefined instruction exception Disabling Piccolo causes it to halt execution When Piccolo has halted 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8 7 6 5 4 3 2 COND 1110 0000 0000 0000 PICCOLO1 0000 When this instruction is executed the following occurs All registers are marked as empty ready for refill Input ROB is cleared Output FIFO is cleared Loop counters are reset Piccolo is put into halted state and H bit of S2 will be set Ee the PRESET instruction may take several cycles to complete 2 3 for this embodiment Whilst it is 25 executing following ARM coprocessor instructions to be executed on Piccolo will be busy waited In state access mode Piccolo s state may be saved and restored using STC and LDC instructions see the below execution it will acknowledge the fact by setting the E bit in the status register Piccolo is enabled by executing the PENABLE instruc tion PENABLE Enable Piccolo This instruction is encoded as 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211 109 8 7 6 5 4 3 2 1 COND 1110 0010 0000 0000 PICCOLO1 0000 regarding accessing Piccolo state from ARM enter state access mode the PSTATE instruction must first be executed PSTATE E
27. operations speci fied by said program instruction words wherein iii said arithmetic logic unit is responsive to at least one program instruction word that includes a a source register bit field specifying a source register of said plurality of registers storing an input operand data word for said program instruction word b an input operand size flag specifying whether said input operand data word has an N bit size or an N 2 bit size and c when said input size flag specifies an N 2 bit size a high low location flag indicating in which of high order bit positions of said source register and low order bit positions of said source register said input operand data word is located There has been a trend in data processing to increase the datapath widths of the systems Early systems had 8 bit datapaths These then developed to 16 bit datapaths and it is now common to have 32 bit and 64 bit datapaths With these increases in datapath widths the registers within the data processing systems have also increased to have matching widths The present invention recognises that when the data words to be manipulated are smaller than the datapath widths then using a full register to store these words is wasteful of the register resources of the device This is particularly the case in load store architecture machine in which all data to be manipulated must be in a register and for which you wish to reduce the number of times you need to
28. programming flexibility The name equivalents for some of the G series basic operations are L_msu gt SMLDS L_mac gt SMLDA These make use of the saturation of the multiplier when left shifting by one bit If a sequence of fractional multiply accumulates is required with no loss of precision MULA can be used with the sum maintained in 33 14 format A left shift and saturate can be used at the end to convert to 1 15 format if required Multiply Operation instructions perform signed multiplication and optional scaling saturation The source registers 16 bit only are treated as signed numbers 10 50 continued 101 for k 0 k 16 k if bit k o register k gt gt scale is written hen o the output FIFO and as being empty the register list is set register k is marked for k 0 k 16 k if bit k of the register list is set then SAlI register k gt gt for k 0 k lt 16 k if bit k of the register list is set then SAT register k gt gt register k is marked 110 scale is written to the output FIFO 111 scale is written to the output FIFO and as being empty 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 109 8 7 6 5 43210 00011 O FIS DEST IS1 R SRC1 SRC2 P D 1 C OPC specifies the type of instruction Action OPC 0 dest srcl src2 gt gt scale 1 dest SAT srcl src2 gt gt scale Mnemonics 0 MUL dest srcl 16
29. the operation Action OPC 0 S ADDA dest lt srcl gt src2 acc lt scale gt d S SUBA dest lt srcl gt src2 acc lt scale gt 000 dest h srcl h src2 h gt gt scale srcl 1 src2 1 gt gt scale 30 001 dest h srcl h src2 h gt gt scale 5 before the command indicates saturation dest src1 l src2 1 gt gt scale Flags 100 dest h srcl h src2 h gt gt scale See above dest srcl 1 src2 1 gt gt scale 101 dest h srcl h src2 h gt gt scale Reasons for inclusion dest srcl 1 src2 1 gt gt scale The ADDA add accumulate instruction is useful for 35 summing two words of an array of integers with an accu mulator for instance to find their average per cycle The Each sum difference is independently saturated if the Sa SUBA subtract accumulate instruction is useful in calcu nemonics lating the sum of the differences for correlation it subtracts two separate values and adds the difference to a third 40 register Addition with rounding can be done by using 000 S ADDADD dest srcl 32 src2 32 lt scale gt dest different from acc For example X0 X1 X2 001 SJADDSUB dest srcl 32 src2 32 lt scale gt 16384 gt gt 15 can be done in one cycle by keeping 16384 in T ETA de uc 252 Addition with a rounding constant be done by
30. wide Each of Piccolo s registers can be treated as containing two independent 16 bit values Bits 0 to 15 contain the low half bits 16 to 31 contain the high half Instructions can specify a particular 16 bit half of each register as a source operand or they may specify the entire 32 bit register 10 15 20 25 30 35 40 45 50 55 60 65 6 Piccolo also provides for saturated arithmetic Variants of the multiply add and subtract instructions provide a satu rated result if the result is greater than the size of the destination register Where the destination register is a 48 bit accumulator the value is saturated to 32 bits i e there is no way to saturate a 48 bit value There is no overflow detection on 48 bit registers This is a reasonable restriction since it would take at least 65536 multiply accumulate instructions to cause an overflow Each Piccolo register is either marked as empty E flag see FIG 2 or contains a value it is not possible to have half of a register empty Initially all registers are marked as empty On each cycle Piccolo attempts with the refill control circuit 16 to fill one of the empty registers by a value from the input reorder buffer Alternatively if the register is written with a value from the ALU it is no longer marked as empty If a register is written from the ALU and at the same time there is a value waiting to be placed in the register from the re
31. with another coefficient c4 The next four MULA instructions represent the calcula tions d1xcl d2xc1 d3xc1 and d4xc1 respectively Once the calculation d1xc1 has been performed the register X0 h is marked by a refill bit since d1 is no longer required Similarly once all four calculations have been performed the register is marked for refilling since the coefficient 1 is no longer needed Similarly the next four MULA instructions correspond to the calculations d2xc2 d3xc2 35 40 45 50 55 60 65 As before the first step is to set the four accumulator registers A0 A3 to zero Then the instruction loop is entered delimited by the REPEAT and NEXT opcodes The REPEAT instruction has a number of parameters associated therewith which are as follows X indicates that BASEINC is 1 for the X Bank of registers n4 indicates that REGCOUNT is 4 and hence the first four X Bank registers X0 1 to X1 h are to be remapped w4 indicates that BASEWRAP is 4 for the X Bank of registers r4 indicates that REGWRAP is 4 for the X Bank of registers Y indicates that BASEINC is 1 for the Y Bank of registers n4 indicates that REGCOUNT is 4 and hence the first four Y Bank registers Y0 1 to Y1 h are to be remapped w4 indicates that BASEWRAP is 4 for the Y Bank of registers r4 indicates that REGWRAP is 4 for the Y Bank of registers It should also be not
32. 2 This data was earlier transferred from memory 8 by the ARM 2 The instructions are streamed from the instruction cache 6 the instruction cache 6 drives the data bus as a full bus master A small Piccolo instruction cache 6 will be a 4 line 16 words per line direct mapped cache 64 instructions In some implementations it may be worthwhile to make the instruc tion cache bigger Thus two tasks are run independently ARM loading data and Piccolo processing it This allows sustained single cycle data processing on 16 bit data Piccolo has a data input mechanism illustrated in FIG 2 that allows the ARM to prefetch sequential data loading the data before it is required by Piccolo Piccolo can access the loaded data in any order automatically refilling its register as the old data is used for the last time all instructions have one bit per source operand to indicate that the source register should be refilled This input mechanism is termed the reorder buffer and comprises an input buffer 12 Every value loaded into Piccolo via an LDC or MCR see below carries with it a tag Rn specifying which register the value is destined for The tag Rn is stored alongside the data word in the input buffer When a register is accessed via a register selecting circuit 14 and the instruction specifies the data register is to be refilled the register is marked as empty by asserting a signal E The register is then automatically refilled by a refill contr
33. 2 bit registers in which case there will be four additional 16 bit registers not directly accessible to the programmer However these extra four registers can be made available by the remapping mechanism thereby providing additional reg isters for the storage of data items The following assembler syntax may will be used gt gt means logical shift right or shift left if the shift operand is negative see lt Iscale gt below gt means arithmetic shift right or shift left if the shift operand is negative see lt scale gt below RORmeans Rotate Right SAT a means the saturated value of a saturated to 16 or 32 bits depending on the size of the destination register Specifically to saturate to 16 bits any value greater than 0x7fff is replaced by 0x7fff and any value less than 0 8000 is replaced by 0 8000 Saturation to 32 bits is similar with extremes 0x7fffffff and 0x80000000 If the destination register is 48 bits the saturation is still at 32 bits Source operand 1 can be one of the following formats lt srcl gt will be used a shorthand for RnIRn 1 Rn hIRn x 7 In other words all 7 bits of the source specifier are valid and the register is read as a 32 bit value optionally swapped or a 16 bit value sign extended For an accumulator only the bottom 32 bits are read The specifies register refill srcl 16 is short for Rn 1 Rn h Only 16 bit values can be read lt srcl_32 gt is short for RnlRn x
34. 5 14131211109 8 10 25 30 35 40 45 continued 10 dest src2 gt 0 7 srcl gt gt src2 srel lt lt src2 11 dest 2 gt 0 7 srcl ROR src2 srcl ROL src2 Mnemonics 00 ASL dest lt srcl gt lt 2 16 gt 01 LSR dest lt srcl gt lt src2__16 gt 10 ASR dest lt srcl gt lt 2 16 gt 11 ROR dest lt srcl gt lt src2__16 gt Flags Z is set if the result is zero N is set if the result is negative V is preserved C is set to the value of the last bit shifted out as on the ARM The behaviour of register specified shifts is LSL by 32 has result zero set to bit 0 of srcl LSL by more than 32 has result zero C set to zero LSR by 32 has result zero C set to bit 31 of srcl LSR by more than 32 has result zero C set to zero ASR by 32 or more has result filled with and C equal to bit 31 of 1 ROR by 32 has result equal to 1 and set to bit 31 of srcl ROR by n where n is greater than 32 will give the same result and carry out as ROR by n 32 therefore repeat edly subtract 32 from n until the amount is in the range 1 to 32 and see above Reasons for inclusion Multiplication division by a power of 2 Bit and field extraction Serial registers Undefined Instructions are set out above in the instruction set listing Their execution will cause Piccolo to halt execution and set the U bit in the status register and disable itself as if the E bit
35. C1 101110000000 D 1 10 Mnemonics dest is set to the number of places the value 1 must be shifted left in order for bit 31 to differ from bit 30 This is a value in the range 0 30 except in the special cases where 0 HALT srel is either 1 or 0 where 31 is returned 15 Mnemonic 1 CLB dest srcl 20 Flags Flags Z is set if the result is zero Unaffected N is cleared C is set if 1 is either 1 or 0 5 Logical Operation instructions perform a logical opera tion on a 32 or 16 bit register The operands are treated as unsigned values V is preserved Reasons for inclusion Step needed for normalisation Halt and Breakpoint instructions are provided for stop ping Piccolo execution 313029 28 27 26 25 24 23 222120 19 18 17 16 15 141312111098 76543210 00000000000000000000000 40 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 121110 9 8765 43210 1 OPC F S DEST 5 1 SRC2 D 1 OPC specifies the type of instruction OPC encodes the logical operation to perform Action OPC 55 Action OPC 00 dest srcl amp src2 gt gt scale 01 dest 1 src2 gt gt scale 10 dest srcl amp src2 gt gt scale 0 Piccolo execution is stopped and the Halt bit is set in the 60 11 dest 1 src2 gt gt scale Piccolo status register 1 Piccolo execution is stopped the Break bit is set in the Piccolo status reg
36. CALE The assembler will also support the syntax OUTPUT Rn In which case it will output one register using MOV Rn instruction EMPTY instruction will stall until all registers to be empties contain valid data i e are not empty Register list operations must not be used within re mapping REPEAT loops The OUPUT instruction can only specify up to eight registers to output Reasons for inclusion 5 881 259 51 After a routine has finished the next routine expects all registers to be empty so it can receive data from the ARM An EMPTY instruction is needed to accomplish this Before performing a FIR or other filter all accumulators and partial results need to be zeroed The ZERO instruction helps with this Both are designed to improve code density by replacing a series of single register moves The OUTPUT instruction is included to improve code density by replacing a series of MOV Rn instructions A Remapping Parameter Move Instruction RMOV is provided to allow the configuration of the user defined register re mapping parameters The instruction encoding is as follows 52 If the RMOV instruction is used whilst re mapping 15 active the behaviour is UNPREDICTABLE Flags Unaffected Repeat Instructions provide four zero cycle loops in hardware The REPEAT instruction defines a new hardware loop Piccolo uses hardware loop 0 for the first REPEAT instruction hardware loop 1 for a REPEAT instruction nested w
37. Piccolo since the ARM is inter nally little endian The MPRW instruction places the contents of ARM register Rn into the ROB marking it as two 16 bit data items destined for the 16 bit Piccolo register lt dest gt 1 The restric tions on lt dest gt are the same as those for the LDPW instructions i e A0 X0 Y0 Z0 For example the instruction will transfer the contents of R3 into the ROB marking the data as 2 16 bit quantities destined for X0 1 It should be noted that as for the LDP16W case with a wrap of 1 only the bottom half of a 32 bit register can be targeted As with MPR no endianess specific operations are applied to the data LDP is encoded as 20 19181716 15141312 111098 76543210 ew eT TS T Doe 31302928 272625 24 23 22 21 where PICCOLO1 is Piccolo s first coprocessor number currently 8 The N bit selects between LDP32 1 and LDP16 0 LDPW is encoded as 20 19181716 1514 1312 111098 76543210 ow on Ds ees 31302928 272625 24 23 22 21 50 where DEST is 0 3 for destination register A0 X0 Y0 70 and WRAP is 0 3 for wrap values 1 2 4 8 PICCOLO2 is Piccolo s second coprocessor number currently 9 The N bit selects between LDP32 1 and LDP16 0 LDP16U is encoded as 20 19181716 1514 1312 111098 76543210 com o e e o TS Tes omm 60 where DEST is 1 3 for the destination bank X Y 7 LDPA is encoded as
38. Piccolo will stall the ARM on an LDC if there is not enough room in the input reorder buffer to load in the data and on an STC if there is insufficient data in the output buffer to store i e the data the ARM is expecting is not in the output buffer 18 Piccolo also executes ARM Coprocessor register transfers to allow ARM to access Piccolo s special registers Piccolo fetches its own instructions from memory to control the Piccolo datapath illustrated in FIG 3 and to transfer data from the reorder buffer to registers and from registers to the output buffer 18 The arithmetic logic unit of the Piccolo that executes these instructions has a multiplier adder circuit 20 that performs multiplies adds subtracts multiple accumulates using a carry chain 21 logical operations shifts and rotates There is also provided in the datapath an accumulate decumulate circuit 22 and a scale saturate circuit 24 The Piccolo instructions are initially loaded from memory into the instruction cache 6 where Piccolo can access them without needing access back to the main memory Piccolo cannot recover from memory aborts Therefore if Piccolo is used in a virtual memory system all Piccolo data must be in physical memory throughout the Piccolo task This is not a significant limitation given the real time nature of Piccolo tasks e g real time DSP If a memory abort occurs Piccolo will stop and set a flag in a status register S2 FIG 3 shows the overall d
39. United States Patent Glass et al 19 US005881259A 5 881 259 Mar 9 1999 Patent Number Date of Patent 11 45 54 INPUT OPERAND SIZE AND HI LOW WORD OTHER PUBLICATIONS SELECTION CONTROL IN DATA PROCESSING SYSTEMS Motorola MC88110 Second Generation RISC Microproces sor User s Manual 1991 pp 5 1 through 5 25 75 Inventors Simon James Glass David Vivian Iacobovici Sorin A Pipelined Interface for High Floating Jaggar both of Cherry Hinton England Point Performance with Precise Exceptions IEEE Micro vol 8 No 3 Jun 1 1988 pp 77 87 73 Assignee ARM Limited Cambridge United Kiyohara Tokuzo et al Register Connection A New Kingdom Approach to Adding Registers into Instruction Set Archi tectures Computer Architecture News vol 21 No 2 May 21 Appl No 727 213 1 1993 pp 247 256 f Lee Ruby B Subword Parallelism with MAX 2 IEEE 22 Filed 8 1996 Micro So 16 No 4 Aug 1 1996 pp 51 59 51 isnt 06 9 34 Muller Mike ARM6 a High Performance Low Power 52 AS Cb Lie 395 386 395 562 395 564 Consumption Macrocell Proceedings of the Spring Com 58 Field of Search 711 212 395 562 Society International Conference No 395 564 586 Conf 38 Feb 22 1993 Institute of Electrical and Elec tronics Engineers pp 80 87 56 References Cited Sarkar S et al DSP C
40. a to be tagged as destined for the bottom half of the destination register lt dest gt 1 This is the Half Register case For example the instruction LDP16W 1 0 8 will load two words into the ROB marking them as 16 bit data destined for X0 1 RO will be incremented by 8 The instruction LDP16W 4 0 16 will behave in a similar fashion to the LDP32W examples except for the fact that endian specific action may be performed on the data as it is returned from memory All unused encodings of the LDP instruction may be reserved for future expansion The LDP16U instruction is provided to support the effi cient transfer of non word aligned 16 bit data LDP16U support is provided for registers D4 to D15 the X Y and Z banks The LDP16U instruction will transfer one 32 bit word of data containing two 16 bit data items from 5 881 259 9 memory into Piccolo Piccolo will discard the bottom 16 bits of this data and store the top 16 bits in a holding register There is a holding register for the X Y and Z banks Once the holding register of a bank is primed the behaviour of LDP W instructions is modified if the data is destined for a register in that bank The data loaded into the ROB is formed by the concatenation of the holding register and the bottom 16 bits of data being transferred by the LDP instruc tion The upper 16 bits of data being transferred is put into the holding register entry lt
41. ack 16 bits No register writeback 32 bits 16 bit output 2 32 bit output The register number Dx indicates which of the 16 registers is being addressed The Hi Lo bit and the Size bit work together to address each 32 bit register as a pair of 16 bit registers The Size bit defines how the appropriate flags as defined in the instruction type will be set irrespec tive of whether a result is written to the register bank and or output FIFO This allows the construction of compares and similar instructions The add with accumulate class of instruction must write back the result to a register The following table shows the behaviour of each encoding Encoding Register Write FIFO Write V FLAG 1 Write whole register No write 32 bit overflow 2 Write whole register Write 32 bits 32 bit overflow 3 Write low 16 bits to No write 6 bit overflow Dx 4 Write low 16 bits to Write low 16 bits 16 bit overflow Dx 5 Write low 16 bits to No write 16 bit overflow Dx h 6 Write low 16 bits to Write low 16 bits 16 bit overflow Dx h 7 No write No write 6 bit overflow 8 No write No write 32 bit overflow 9 No write Write low 16 bits 16 bit overflow 10 No write Write 32 bits 32 bit overflow In all cases the result of any operation prior to writing back to a register or inserting into the output FIFO is a 48 bit quantity There are two cases If the write is of 16 bits the 48 bit quantity is reduced to a 16 bit quantity by s
42. atapath functionality of Pic colo The register bank 10 uses 3 read ports and 2 write ports One write port the L port is used to refill registers from the reorder buffer The output buffer 18 is updated directly from the ALU result bus 26 output from the output buffer 18 is under ARM program control The ARM copro cessor interface performs LDC Load Coprocessor instruc tions into the reorder buffer and STC Store Coprocessor instructions from the output buffer 18 as well as MCR and MRC Move ARM register to from CP register on the register bank 10 The remaining register ports are used for the ALU Two read ports A and B drive the inputs to the multiplier adder circuit 20 the C read port is used to drive the accumulator decumulator circuit 22 input The remaining write port W is used to return results to the register bank 10 The multiplier 20 performs a 16x16 signed or unsigned multiply with an optional 48 bit accumulate The scaler unit 24 can provide a 0 to 31 immediate arithmetic or logical shift right followed by an optional saturate The shifter and logical unit 20 can perform either a shift or a logical operation every cycle Piccolo has 16 general purpose registers named DO D15 or A0 A3 0 3 Y0 Y3 70 73 The first four registers 0 are intended as accumulators and are 48 bits wide the extra 16 bits providing a guard against overflow during many successive calculations The remaining registers are 32 bits
43. be put into state access mode and have its state examined modified via the scan chain The Piccolo Status register contains a single bit to indicate that it has executed a breakpointed instruction When a breakpointed instruction is executed Piccolo sets the B bit in the Status register and halts execution To be able to interrogate Piccolo the debugger must enable Piccolo and put it into state access mode by writing to its control register before subsequent accesses can occur FIG 4 illustrates a multiplexer arrangement responsive to the Hi Lo bit and Size bit to switch appropriate halves of the selected register to the Piccolo datapath If the Size bit indicates 16 bits then a sign extending circuit pads the high order bits of the datapath with Os or 1s as appropriate Although illustrative embodiments of the invention have been described in detail herein with reference to the accom panying drawings it is to be understood that the invention is not limited to those precise embodiments and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims We claim 1 Apparatus for data processing said apparatus compris ing i a plurality of registers for storing data words to be manipulated each of said registers having at least an N bit capacity and ii an arithmetic logic unit responsive program instruc ti
44. by data item 42 etc Register Al accumulates the results of a similar set of multiplication operations but this time the set of coefficients have been shifted such that cO is now multiplied by 41 c1 is now multiplied by d2 c2 is now multiplied by d3 etc Likewise register A2 accumulates the results of multiplying the data values by the coefficient values shifted another step to the right such that cO is multiplied by d2 c1 is multiplied by d3 c2 is multiplied by d4 etc This shift multiply and accumulate process is then repeated with the result being placed in register A3 If register remapping in accordance with the preferred embodiment of the present invention is not employed then the following instruction loop will be required to perform the block filter instruction start with 4 new data values ZERO 0 Zero the accumulators REPEAT Z1 Z1 number of coeffs 4 do the next four coefficients on the first time around a0 dO c0 d1 cl d2 c2 d3 c3 al d1 c0 d2 cl d3 c2 d4 c3 a2 12 0 13 1 14 2 15 3 a3 d3 c0 d4 cl d5 c2 d6 c3 MULA 0 17 YO 1 AO a0 dO c0 and load d4 MULA 1 XO h YO 1 A1 sal 41 0 MULA 2 X1 1 Y0 1 2 a2 d2 c0 MULA X1 h 0 17 a3 d3 c0 and load c4 MULA YO h AO a0 d1 c1 and load 45 MULA 1 1 1 0 1 sal d2 cl MULA A2 X1 h 2 a2 13 1 MULA X0 1
45. cessary remapping The register remapping logic 52 can be considered as being part of the instruction decoder 50 although it will be apparent to those skilled in the art that the register remapping logic 52 may be provided as a com pletely separate entity to the instruction decoder 50 An instruction will typically include one or more oper ands identifying registers containing the data items required by the instruction For example a typical instruction may include two source operands and one destination operand identifying two registers containing data items required by the instruction and a register in to which the result of the instruction should be placed The register remapping logic 52 receives the operands of an instruction from the instruc tion decoder 50 these operands identifying logical register references Based on the logical register references the register remapping logic will determine whether remapping should or should not be applied and will then apply a remapping to physical register references as required If it is determined that remapping should not be applied the logical register references are provided as the physical register references The preferred manner in which the remapping is performed will be discussed in more detail later Each output physical register reference from the register remapping logic is passed to the Piccolo processor core 54 such that the processor core can then apply the instruction to the
46. colo stalls waiting for the register to be refilled specification of the source accumulator and destination If a register is marked for refill and the register is then registers For these instructions the Size bits are used to updated before the refilled value is read the result is indicate the source accumulator and the size bits are implied UNPREDICTABLE for example ADD X0 X0 is by the instruction type as 0 unpredictable since it marks X0 for refill and then refills it When a 16 bit value is read via the A or B busses itis placing the sum of X0 and into it automatically sign extended to a 32 bit quantity If a 48 bit es 4 bit scale field encodes fourteen scale types register is read via the A or B busses only the bottom 32 ASR 0 1 2 3 4 6 8 10 bits appear on the bus Hence in all cases source 1 and source ASR 12 to 16 5 881 259 25 LSL 1 Parallel Max Min instructions do not provide a scale and therefore the six bit constant variant of source 2 is unused Set to 0 by assembler Within a REPEAT instruction register re mapping is supported allowing a REPEAT to access a moving win dow of registers without unrolling the loop This is described in more detail in below Destination operands have the following 7 bit format 25 24 23 22212019 There are ten variants of this basic encoding 25 Assembler Mnemonic 24 23 22 21 20 19 Dx Dx Dx h Undefined 1 No register writeb
47. cond gt lt dest gt lt srcl gt lt src2 gt 01 SELTT cond dest lt srcl gt lt src2 gt 10 SELTF cond dest lt srcl gt lt src2 gt 11 Unused If a register is marked for refill it is unconditionally refilled The assembler also provides the mnemonics MOV cond dest lt srcl gt SELFT cond dest lt srcl gt lt src2 gt SELFF lt cond gt lt dest gt lt srcl gt lt src2 gt MOV lt cond gt A B is equivalent to SEL lt cond gt A B A SELFT SELFF are obtained by swapping 1 and src2 and using SELTF SELTT Flags All flags are preserved so that a sequence of selects may be performed Reasons for inclusion Used for making simple decisions inline without having to resort to a branch Used by Viterbi algorithms and when scanning a sample or vector for the largest element Shift Operation instructions provide left and right logical shifts right arithmetic shifts and rotates by a specified amount The shift amount is considered to be a signed integer between 128 and 127 taken from the bottom 8 bits of the register contents or an immediate in the range 1 to 31 A shift of a negative amount causes a shift in the opposite direction by ABS shift amount The input operands are sign extended to 32 bits the resulting 32 bit output is sign extended to 48 bits before write back so that a write to a 48 bit register behaves sensibly 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 1
48. converted to 32 bit values Only accumulate instruc onto both halves of the 32 bit quantity If the 8 bit imme tions using bus C can access the full 48 bits of an accumu diate is used it is duplicated only if the rotate indicates that lator register the 8 bit immediate should be rotated onto the top halfofthe 15 the refill bit is set the register is marked as empty after 32 bit quantity use and will be refilled from the ROB by the usual refill mechanism see the section on the ROB Piccolo will not ROT IMMEDIATE stall unless the register is used again as a source operand before the refill has taken place The minimum number of 20 cycles before the refilled data is valid best case the data is 10 OxOOXYOOXY waiting at the head of the ROB will be either 1 or 2 Hence 11 OxXYOOXYOO it is advisable not to use the refilled data on the instruction following the refill request If use of the operand on the next No scale is available for parallel select operations the 25 two instructions can be avoided it should be since this will scale field shall be set to O for these instructions prevent performance loss on deeper pipeline implementa The multiply accumulate instructions do not allow an tions 8 bit rotated immediate to be specified Bit 10 of the field is The refill bit is specified in the assembler by suffixing the used to partly specify which accumulator to use Source 2 is register number with The section of th
49. d input size flag specifies an N 2 bit size a high low location flag indicating in which of high order bit positions of said source register and low order bit positions of said source register said input operand data word is located The above and other objects features and advantages of this invention will be apparent from the following detailed descrip tion of illustrative embodiments which is to be read in connection with the accompanying drawings BRIEF DESCRIPTION OF THE DRAWINGS FIG 1 illustrates the high level configuration of a digital signal processing apparatus FIG 2 illustrates the input buffer of register configuration of a coprocessor FIG 3 illustrates the datapath through the coprocessor FIG 4 illustrates a mutliplexing circuit for read high or low order bits from a register FIG 5 is a block diagram illustrating register remapping logic used by the coprocessor in preferred embodiments FIG 6 illustrates in more detail the register remapping logic shown in FIG 5 and FIG 7 is a table illustrating a Block Filter Algorithm DESCRIPTION OF THE PREFERRED EMBODIMENTS The system described below is concerned with digital signal processing DSP DSP can take many forms but may typically be considered to be processing that requires the high speed real time processing of large volumes of data This data typically represents some analogue physical sig nal A good example of DSP is that used in digital mobile
50. data holding register holding register lt data h This mode of operation is persistent until it is turned off by a LDPA instruction The holding register does not record the destination register tag or size These characteristics are obtained from the instruction that provides the next value of data 1 Endian specific behaviour may always occur on the data returned by the memory system There is no non 16 bit equivalent to LDP16U since it is assumed that all 32 bit data items will be word aligned in memory The LDPA instruction is used to switch off the unaligned mode of operation initiated by a LDP16U instruction The unaligned mode may be turned off independently on banks X Y Z For example the instruction LDPA X Y will turn off the unaligned mode on banks X and Y Data in the holding registers of these banks will be discarded 31302928 272625 24 23 22 21 10 15 20 25 30 10 Executing LDPA on a bank which is not in unaligned mode is allowed and will leave that bank in aligned mode The MPR instruction places the contents of ARM register Rn into the ROB destined for Piccolo register dest The destination register dest may be any full register in the range A0 Z3 For example the instruction MPR R3 will transfer the contents of R3 into the ROB marking the data as destined for the full register X0 No endianess specific behaviour occurs to the data as it is transferred from ARM to
51. e arithmetic logic unit is also able to perform parallel 0 560 020 9 1993 European Pat Off operation program instruction words operating indepen 0654733 5 1995 European Pat dently upon N 2 bit input operand data words stored in 0 655 680 5 1995 European Pat Off ive Balves of a Tepister 0696 772 2 1996 European Pat Off um 5 2013380 8 1979 United Kingdom WO 94 15279 7 1974 WIPO 11 Claims 7 Drawing Sheets 2 gt 32 Program 16 Instruction Mx Word 30 Input Operand Size Bit 32 Source E D i me Input Operand Size Hi Lo Bit 36 5 881 259 Sheet 1 of 7 Mar 9 1999 U S Patent Bl sng sseippy sng 2120 20559201402 9102 102009 dO 5 881 259 Sheet 2 of 7 Mar 9 1999 U S Patent siajsibay 10558201005 dO vi c DH 5 881 259 Sheet 3 of 7 Mar 9 1999 U S Patent Japooeq Jo1siDeH uoneunseq epooep 1sep uj eyed U S Patent Mar 9 1999 Sheet 4 of 7 5 881 259 Fig 4 Input Operand Size Program Instruction Word 30 Input Operand Size Bit 32 5 881 259 Sheet 5 of 7 Mar 9 1999 U S Patent uononasu 916071 2 516 9102 105590019 10559001100 5 881 259 Sheet 6 of 7 Mar 9 1999 U S Patent 9 0
52. e register marked implied as a 16 bit operand as empty depends on the register operand The two halves of 8 7 6 5 11 10 9 4 3 2 1 0 AO R2 Register number Hi SCALE Lo IMMED 6 SCALE 40 Multiply double instructions do not allow the use of a each register may be marked for refill independently for constant Only a 16 bit register can be specified Bit 10 of example X0 1 will mark only the bottom half of X0 for the field is used to partly specify which accumulator to use refill X0 will mark the whole of X0 for refill When the top SRC2 MULA 11 10 9 8 7 6 5 4 3 2 1 0 SRC2 MULD Some instructions always imply a 32 bit operation e g half bits 47 16 of a 48 bit register are refilled the 16 bits ADDADD and in these cases the size bit shall be set to 1 of data is written to bits 31 16 and is sign extended up to bit with the Hi Lo bit used to optionally swap the two 16 bit 47 halves of the 32 bit operand Some instructions always 55 If an attempt is made to refill the same register twice eg imply a 16 bit operation e g MUL and the size bit should ADD X1 X0 X0 then only one refill takes place The be set to 0 The Hi Lo bit then selects which half of the assembler should only allow the syntax ADD X1 X0 X0 register is used it is assumed that the missing size bit is If a register read is attempted before that register has been clear Multiply accumlulate instructions allow independent refilled Pic
53. ecified independently of the destination register The bottom two bits of the destination register give the number acc of the 48 bit accumulator to accumulate into Hence ADDA X0 X1 X2 A0 and ADDA A3 X1 X2 A3 are valid but ADDA X1 X1 X2 A0 is not With this class of instruction the result must be written back to a register the no writeback encodings of the destination field are not allowed 5 881 259 39 40 2 26 24 2322212019 17 1615141312 11109876543210 7 P OPC specifies the type of instruction In the following acc is 32 bit registers The primary condition code flags are set DEST 1 0 The Sa bit indicates saturation from the result of the most significant 16 bits the secondary 10 flags are updated from the least significant half Only 32 bit registers can be specified as the source for these instructions Action OPC although the values can be halfword swapped The indi vidual halves of each register are treated as signed values The calculations and scaling are done with no loss of 0 dest SAT acc srcl src2 gt gt scale precision Hence ADDADD X2 ASR 1 will pro 1 dest SAT acc 5121 src2 gt gt scale duce the correct averages in the upper and lower halves of X0 Optional saturation is provided for each instruction for which the Sa bit must be set 30 292827 26 24 2322212019 18 17 1615141312 11109876543210 E Rs E D in Mnemonics 25 OPC defines
54. ed 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 141312111098 76543210 COND BANK PICCOLO1 OFFSET Bank 1 General Purpose registers GPR 16 32 bit words containing the general purpose register state Bank 2 Accumulators 4 32 bit words containing the top 32 bits of the accumu lator registers N B duplication with GPR state is necessary for restoration purposes would imply another write enable on the register bank otherwise Bank 3 Register Piccolo ROB Output FIFO Status 1 32 bit word indicating which registers are marked for refill 2 bits for each 32 bit register 8 32 bit words containing the state of the ROB tags 8 7 bit items stored in bits 7 to 0 3 32 bit words containing the state of the unaligned ROB latches bits 17 to 0 1 32 bit word indicating which slots in the output shift register contain valid data bit 4 indicates empty bits 3 to 0 encode the number of used entries 1 32 bit word containing the state of the output FIFO holding latch bits 17 to 0 Bank 4 ROB Input Data 8 32 bit data values Bank 5 Output FIFO Data 8 32 bit data values Bank 6 Loop Hardware 4 32 bit words containing the loop start addresses 4 32 bit words containing the loop end addresses 4 32 bit words containing the loop count bits 15 to 0 1 32 bit word containing user defined re mapping param eters and other re mapping state The LDC instruction is used to load Piccolo state when Picco
55. ed that now the value Z1 is equal to the number of coefficients rather than being equal to the number of coefficients 4 as in the prior art example For the first iteration of the instruction loop the base pointer value is zero and so there is no remapping However next time the loop is executed the base pointer value will be 1 for both the X and Y banks and so the operands will be mapped as follows X0 1 becomes X0 h 5 881 259 33 X0 h becomes 1 1 11 becomes X1 h X1 h becomes X0 1 since BASE WRAP is 4 Y0 1 becomes YO h Y0 h becomes Y1 1 Y1 1 becomes Y1 h Y1 h becomes Y0 1 since BASEWRAP is 4 Hence it can be seen that on the second iteration the four MULA instructions actually perform the calculations indi cated by the fifth to eight MULA instructions in the example discussed earlier that does not include the remapping of the present invention Similarly the third and fourth iterations through the loop perform the calculations formerly per formed by the ninth to twelfth and thirteenth to sixteenth MULA instructions of the prior art code Hence it can be seen that the above code performs exactly the same block filter algorithm as the prior art code but improves code density within the loop body by a factor of four since only four instructions need to be provided rather than the sixteen required by the prior art By employing the register remapping technique in accor dance with preferred e
56. electing the bottom 16 bits 15 0 a 0 25 30 35 40 45 50 55 60 65 26 If the instruction saturates then the value will be saturated into the range 2715 to 2 15 1 16 bit value is then written back to the indicated register and if the Write FIFO bit is set to the output FIFO If it is written to the output FIFO then it is held until the next 16 bit value is written when the values are paired up and placed into the output FIFO as a single 32 bit value For 32 bit writes the 48 bit quantity 15 reduced to a 32 bit quantity by selecting the bottom 32 bits 31 0 For both 32 bit and 48 bit writes if the instruction satu rates the 48 bit value will be converted to a 32 bit value in the range 2731 1 to 2 31 Following the saturation If writeback to an accumulator is performed the full 48 bits will be written If writeback to a 32 bit register is performed bits 31 0 are written If writeback to the output FIFO is indicated again bits 31 0 will be written The destination size is specified in the assembler by a 1 or h after the register number If no register writeback is performed then the register number is unimportant so omit the destination register to indicate no write to a register or use to indicate a write only to the output FIFO For example SUB X0 YO is equivalent to CMP X0 YO and ADD X0 YO places the value of X0 Y0 into the output FIFO If there is no room in the ou
57. erands are logical register references and these are then mapped to physical register references iden tifying specific Piccolo registers 10 All operations includ ing refilling operate on the physical register The register remapping only occurs on the Piccolo instruction stream side data loaded into Piccolo is always destined for a physical register and no remapping is performed The remapping mechanism will be discussed further with reference to FIG 5 which is a block diagram illustrating a number of the internal components of the Piccolo coproces sor 4 Data items retrieved by the ARM core 2 from memory are placed in the reorder buffer 12 and the Piccolo registers 5 881 259 27 10 are refilled from the reorder buffer 12 in the manner described earlier with reference to FIG 2 Piccolo instruc tions stored in the cache 6 are passed to an instruction decoder 50 within Piccolo 4 where they are decoded prior to being passed to the Piccolo processor core 54 The Piccolo processor core 54 includes the multiplier adder circuit 20 the accumulate decumulate circuit 22 and the scale saturate circuit 24 discussed earlier with reference to FIG 3 If the instruction decoder 50 is handling instructions forming part of an instruction loop identified by a REPEAT instruction and the REPEAT instruction has indicated that remapping of a number of registers should take place then the register remapping logic 52 is employed to perform the ne
58. ered a HALT instruction and has halted A bit Piccolo suffered a memory abort load store or Piccolo instruction and has halted D bit Piccolo has detected a deadlock condition and has halted see below Register S2 is the Piccolo program counter 50 55 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8 7 Program Counter the condition has occurred and report the exact point of failure by reading the ARM and Piccolo program counters and registers It should be stressed that deadlock can only happen due to an incorrect program or perhaps another part of the system corrupting Piccolo s state Deadlock can not occur due to data starvation or overload There are several operations available that may be used to control Piccolo from the ARM these are provided by CDP instructions These CDP instructions will only be accepted when the ARM is in a privileged state If this is not the case Piccolo will reject the CDP instruction resulting in the ARM taking the undefined instruction trap The following opera tions are available Reset Enter State Access Mode Enable 6543210 5 881 259 15 Disable Piccolo may be reset in software by using the PRESET instruction PRESET Clear Piccolo s state This instruction is encoded as 16 ARM coprocessor instructions to be executed on Piccolo will be busy waited The PENABLE and PDISABLE instructions are used for fast context switching When Piccolo
59. he 32 bit positive value in X0 by the 32 bit add or subtract src2 to srel positive value in X1 with early termination 313029282726252423222120 19 18 17 16 15 14131211109 876 54 3210 1 0010 5 DEST 51 5 SRC2 P D 1 C 40 OPC specifies the type of instruction Action OPC X2 0 clear the quotient LOG ZO XO number of bits XO can be shifted 106 71 1 number of bits 1 be shifted 0 D s else temp src1 src2 45 SUBS 70 Z1 ZO X1 shift up so 1 s match lest temp gt gt scale BLT di d X1 gt X0 is 0 1 if carry set temp srcl src2 else temp srcl src2 LSL 70 5 ue lins ones dest temp gt gt scale BUT if scale is a shift left ADD Z0 Z0 1 numberof tests to do then the new value of carry from srcl src2 or SUBS 70 70 0 set carry srcl src2 is shifted into the bottom REPEAT 70 i 50 CAS X1 LSL 1 ADCN X2 X2 X2 Mnemonics NEXT div end 0 CAS dest lt srcl gt src2 lt scale gt 1 CASC dest lt srcl gt src2 lt scale gt 55 At the end X2 holds the quotient and the remainder can be Flags q See above recovered from X0 Reasons for inclusion 60 The Conditional Add or Subtract instruction enables effi The Count Leading Bits instruction allows data to be cient divide code to be constructed normalised 5 881 259 43 44 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 121110 9876543210 011011 FIS DEST SIR SR
60. ister If the number of instructions in the loop is small 1 or 2 then Piccolo may take extra cycles to set the loop up If the loop count is register specified a 32 bit access is implied S 1 though only the bottom 16 bits are significant and the number is considered to be unsigned If the loop count is zero then the action of the loop is undefined Acopy of the loop count is taken so the register can be immediately reused or even refilled without affecting the loop The REPEAT instruction provides a mechanism to modify the way in which register operands are specified within a loop The details are described above Encoding of a REPEAT with a register specified number of loops 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 141312111098 76543210 11110 RFIELD_4 5 0000 INSTRUCTIONS_8 1 5 881 259 53 Encoding of REPEAT with a fixed number of loops 54 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 141312111098 76543210 11110 RFIELD_4 LOOPS_13 INSTRUCTIONS_8 The RFIELD operand specifies which of 16 re mapping parameter configurations to use inside the loop 10 RFIELD Re mapping Operation 0 No Re mapping Performed 1 User Defined Re mapping 15 252x319 Preset Re mapping Configurations TBD The assembler provides two opcodes REPEAT and NEXT 20 for defining a hardware loop The REPEAT goes at the start of the loop and the NEXT delimits the end of the loop allowing the assembler to calculate the n
61. ister and the ARM is interrupted to say Mnemonics that a breakpoint has been reached 65 00 AND lt dest gt lt srcl gt lt src2 gt lt scale gt 01 ORR lt dest gt lt srcl gt lt src2 gt lt scale gt 5 881 259 continued 10 BIC dest lt srcl gt src2 lt scale gt 11 EOR dest lt srcl gt src2 lt scale gt The assembler supports the following opcodes TST TEQ lt srcl gt lt src2 gt lt srcl gt lt src2 gt TST is an AND with the register write disabled TEQ is an EOR with the register write disabled Flags Z is set if the result is all zeros N C V are preserved SZ SN SC SV are preserved Reasons for inclusion Speech compression algorithms use packed bitfields for encoding information Bitmasking instructions help for extracting packing these fields Max and Min Operation instructions perform maximum and minimum operations 10 15 20 46 Reasons for inclusion In order to find the strength of a signal many algorithms scan a sample to find the minimum maximum of the abso lute value of the samples The MAX and MIN operations are invaluable for this Depending on whether you wish to find the first or last maximum in the signal the operands 1 and 2 can be swapped around MAX X0 0 will convert X0 to a positive number with clipping below MIN X0 X0 255 will clip X0 above This is useful for graphics pro cessing Max and Min Operati
62. items as they are returned from the memory system See below for more details on Big Endian and Little Endian Support LDPW instruction transfers a number of data items to a set of registers The first data item transferred is tagged as destined for dest the next for lt dest gt 1 etc When wrap transfers have occurred the next item transferred is tagged as destined for dest and so on The wrap quantity is specified in halfword quantities For LDPW the following restrictions apply quantity size must be a non zero multiple of 4 size must be less than or equal to the size of the ROB for a particular implementation 8 words in the first version and guaranteed to be no less than this in future versions dest may be one of A0 X0 YO 70 wrap may be one of 2 4 8 halfwords for LDP32W and one of 1 2 4 8 halfwords for LDP16W The quantity lt size gt must be greater than 2 lt wrap gt otherwise no wrapping occurs and the LDP instruction shall be used instead For example the instruction LDP32W 2 RO 8 will load two words into the ROB marking them as destined for the full register RO will be incremented by 8 The instruction LDP32W 4 R0 16 will load four words into the ROB marking them as destined for X0 X1 X0 X1 in that order RO will not be affected For LDP16W lt wrap gt may be specified as 1 2 4 or 8 The wrap of 1 will cause all dat
63. ithin the first repeat instruction and so on The REPEAT instruction does not need to specify which loop is being used REPEAT loops must be strictly nested If an attempt is made to nest loops to a depth greater than 4 then the behaviour is unpredictable 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8 765 43210 11111 101 ZPARAMS YPARAMS XPARAMS Each PARAMS field is comprised of the following entries 6 5 4 3 2 1 0 BASEWRAP BASEINC RENUMBER The meaning of these entries is described below PARAMETER DESCRIPTION RENUMBER Number of 16 bit registers to perform re mapping on may take the values 0 2 4 8 Registers below RENUMBER are re mapped those above are accessed directly The amount the base pointer is incremented at the end of each loop May take the values 1 2 or 4 The base wrapping modulus may take the values 2 4 8 BASEINC BASEWRAP Mnemonics RMOV lt PARAMS gt lt PARAMS gt The lt PARAMS gt field has the following format lt PARAMS gt lt BANK gt lt BASEINC gt 1 lt RENUMBER gt w lt BASEWRAP gt BANK lt BASEINC gt 1 2 4 lt gt 01214181 lt gt 21418 Each REPEAT instruction specifies the number of instruc tions in the loop which immediately follows the REPEAT instruction and the number of times to go around the loop which is either a constant or read from a Piccolo reg
64. lo is in state access mode The BANK field specifies which bank is being loaded 15 20 25 30 The following sequence will store all Piccolo state to the address in register RO STP BO RO 16 save private registers STP B1 RO 64 save general purpose registers STP B2 RO 16 save accumulators STP B3 RO 56 save Register ROB FIFO status STP B4 RO 32 save ROB data STP B5 RO 32 save output FIFO data STP RO 52 save loop hardware Debug Mode Piccolo needs to respond to the same debug mechanisms as supported by ARM i e software through Demon and Angel and hardware with Embedded ICE There are several mechanisms for debugging a Piccolo system ARM instruction breakpoints Data breakpoints watchpoints Piccolo instruction breakpoints Piccolo software breakpoints ARM instruction and data breakpoints are handled by the ARM Embedded ICE module Piccolo instruction break points are handled by the Piccolo Embedded ICE module Piccolo software breakpoints are handled by the Piccolo core The hardware breakpoint system will be configurable such that both the ARM and Piccolo will be breakpointed Software breakpoints are handled by a Piccolo instruction Halt or Break causing Piccolo to halt execution and enter debug mode B bit in the status register set and disable 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 141312111098 76543210 The following sequence
65. mbodiments of the present invention the following benefits can be realised 1 It improves code density 2 It can in certain situations hide the latency from marking a register being as empty to that register being refilled by Piccolo s reorder buffer This could be achieved by unrolling loops at the cost of increased code size 3 It enables a variable number of registers to be accessed by varying the number of loop iterations performed the number of registers accessed may be varied and 4 It can ease algorithm development For suitable algorithms the programmer can produce a piece of code for the nth stage of the algorithm then use register remapping to apply the formula to a sliding set of data It will be apparent that certain changes can be made to the above described register remapping mechanism without departing from the scope of the present invention For example it is possible for the bank of registers 10 to provide 10 15 20 25 30 35 40 34 grammer in an instruction operand Whilst these extra reg isters cannot be accessed directly the register remapping mechanism can make these registers available For example consider the example discussed earlier where the X bank of registers has four 32 bit registers available to the programmer and hence eight 16 bit registers can be speci fied by logical register references It is possible for the X bank of registers to actually consist of for example six 3
66. nds are smaller in size than the maximum data path width This increases the data processing capabilities of the system considerably whilst not incurring significant extra overhead One modification that may be made is that in which said arithmetic logic unit has a signal path that functions as a carry chain between bit positions in arithmetic logic opera tions and when executing a parallel operation program instruction word said signal path is broken between said first N 2 bit input operand data word and said second N 2 bit input operand data word Whilst the parallel operation program instructions could take many forms it is preferred that said parallel operation program instruction word performs the arithmetic logic operation of one of i a parallel add in which two parallel N 2 bit additions are performed ii a parallel subtract in which two parallel N 2 bit subtractions are performed iii a parallel shift in which two parallel N 2 bit shift operations are performed and iv a parallel add subtract in which an N 2 bit add and an N 2 bit subtraction are performed in parallel A further refinement of the invention is that when said input size flag specifies an N bit size said high low location flag indicates whether those bits stored in said high order bit positions should be moved to said lower order bit positions and those bits stored in said low order bit positions should be moved to said high order bit positio
67. ne way of viewing the synergistic interaction of the ARM and Piccolo is that ARM acts as a powerful address generator for Piccolo data with Piccolo being left free to perform DSP operations requiring the real time handling of large volumes of data to produce corresponding real time results 10 15 20 25 30 35 40 45 50 55 60 65 4 FIG 1 illustrates the ARM 2 and Piccolo 4 with the ARM 2 issuing control signals to the Piccolo 4 to control the transfer of data words to and from Piccolo 4 An instruction cache 6 stores the Piccolo program instruction words that are required by Piccolo 4 A single DRAM memory 8 stores all the data and instruction words required by both the ARM 2 and Piccolo 4 The ARM 2 is responsible for addressing the memory 8 and controlling all data transfers The arrange ment with only a single memory 8 and one set of data and address buses is less complex and expensive than the typical DSP approach that requires multiple memories and buses with high bus bandwidths Piccolo executes a second instruction stream the digital signal processing program instruction words from the instruction cache 6 which controls the Piccolo datapath These instructions include digital signal processing type operations for example Multiply Accumulate and control flow instructions for example zero overhead loop instruc tions These instructions operate on data which is held in Piccolo registers 10 see FIG
68. ngle source register 10 15 20 25 30 35 40 45 50 60 5 Apparatus as claimed in claim 4 wherein said arith metic logic unit has a signal path that functions as a carry chain between bit positions in arithmetic logic operations and when executing a parallel operation program instruc tion word said signal path is broken between said first N 2 bit input operand data word and said second N 2 bit input operand data word 6 Apparatus as claimed in claim 4 wherein said parallel operation program instruction word performs the arithmetic logic operation of one of i a parallel add in which two parallel N 2 bit additions are performed ii a parallel subtract in which two parallel N 2 bit subtractions are performed iii a parallel shift in which two parallel N 2 bit shift operations are performed and iv a parallel add subtract in which an N 2 bit add and an N 2 bit subtraction are performed in parallel 7 Apparatus as claimed in claim 1 wherein when said input size flag specifies an N bit size said high low location flag indicates whether those bits stored in said high order bit positions should be moved to said lower order bit positions and those bits stored in said low order bit positions should be moved to said high order bit positions prior to use as an N bit input operand data word 8 Apparatus as claimed in claim 1 wherein said arith metic logic unit has an N bit datapath there
69. nique ID and revision code Piccolo has stalled waiting for either a register to be refilled or the output FIFO to have an available entry 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 8 7 6 5 4 3 2 10 Implementor Bits 3 0 contain the revision number for the processor Bits 15 4 contain a 3 digit part number in binary coded decimal format 0x500 for Piccolo Bits 23 16 contain the architecture version 0x00 Version 1 Bits 31 24 contain the ASCII code of an implementers trademark 0x41 A ARM Ltd Register 51 is the Piccolo status register 30 eT Revision The coprocessor interface is busy waiting because of insufficient space in the ROB or insufficient items in the output FIFO If both of these conditions are detected Piccolo sets the D bit in its status register halts and rejects the ARM coprocessor instruction causing ARM to take the undefined instruction trap This detection of deadlock conditions allows a system to be constructed which can at least warn the programmer that 3130292827262524 23 22 21 20 19 18 17 16 15 14 13 1211109 8 76 54 32 1 0 5 5 5 5 UJE NIZICIV Primary condition code flags N Z C V Secondary condition code flags SN SZ SC SV E bit Piccolo has been disabled by the ARM and has halted U bit Piccolo encountered an UNDEFINED instruction and has halted B bit Piccolo encountered a BREAKPOINT and has halted H bit Piccolo encount
70. ns prior to use as an N bit input operand data word This feature is particularly useful during transform opera tions A particularly effective hardware implementation of this functionality comprising at least one multiplexer responsive to said high low location flag for selecting for supply to the low order N 2 bits of said datapath an N 2 bit input operand data word stored in one of high order bit positions of said source register and low order bit positions of said source register In order to deal with signed arithmetic without undue complication it is preferred to provide a circuit for sign extending an N 2 bit input operand data word prior to input to said N bit datapath Viewed from another aspect the present invention pro vides a method of processing data said method comprising the steps of i storing data words to be manipulated in a plurality of registers each of said registers having at least an N bit capacity and ii in response program instruction words performing arithmetic logic operations specified by said program instruction words iii wherein at least one program instruction word includes a a source register bit field specifying a source register of said plurality of registers storing an input operand data word for said program instruction 5 881 259 3 word p2 b an input operand size flag specifying whether said input operand data word has an N bit size or an N 2 bit size and c when sai
71. nter State Access Mode This instruction is encoded as 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 109 8 7 Piccolo is disabled by executing the PDISABLE instruc tion PDISABLE Disable Piccolo This instruction is encoded as 6 35 4 13 12 COND 1110 0001 0000 0000 PICCOLO1 0000 When executed PSTATE instruction will Halt Piccolo if it is not already halted setting the E bit in Piccolo s Status Register Configure Piccolo into its State Access Mode Executing the PSTATE instruction may take several cycles to complete as Piccolo s instruction pipeline must 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8 7 6 5 4 3 2 COND 1110 0011 0000 0000 PICCOLO1 0000 65 drain before it can halt Whilst it is executing following When this instruction is executed the following occurs 5 881 259 17 Piccolo s instruction pipeline will drain Piccolo will halt and the H bit in the Status register set The Piccolo instruction cache holds the Piccolo instruc tions which control the Piccolo datapath If present it is guaranteed to hold at least 64 instructions starting on a 16 word boundary The following ARM opcode assembles into an MCR Its action is to force the cache to fetch a line of 16 instructions starting at the specified address which must be on a 16 word boundary This fetch occurs even if the cache already holds data related to this address PMIR Rm
72. ol circuit 16 using the oldest loaded value destined for that register within the input buffer 12 The reorder buffer holds 8 tagged values The input buffer 12 has a form similar to a FIFO except that data words can be extracted from the centre of the queue after which later stored words will be passed along to fill the space Accordingly the data words furthest from the input are the oldest and this can be used to decide which data word should be used to refill a register when the input buffer 12 holds two data words with the correct tag Rn Piccolo outputs data by storing it in an output buffer 18 FIFO as shown in FIG 3 Data is written to the FIFO sequentially and read out to memory in the same order by ARM The output buffer 18 holds 8 32 bit values Piccolo connects to ARM via the coprocessor interface CP Control signals of FIG 1 On execution of an ARM coprocessor instruction Piccolo can either execute the instruction cause the ARM to wait until Piccolo is ready before executing the instruction or refuse to execute the instruction In the last case ARM will take an undefined instruction exception most common coprocessor instructions that Piccolo will execute are LDC and STC which respectively load and 5 881 259 5 store data words to and from the memory 8 via the data bus with ARM generating all addresses It is these instructions which load data into the reorder buffer and store data from the output buffer 18
73. on words to perform arithmetic logic operations speci fied by said program instruction words wherein iii said arithmetic logic unit is responsive to at least one program instruction word that includes a a source register bit field specifying a source register of said plurality of registers storing an input operand data word for said program instruction word b an input operand size flag specifying whether said input operand data word has an N bit size or an N 2 bit size and c when said input size flag specifies an N 2 bit size a high low location flag indicating in which of high order bit positions of said source register and low order bit positions of said source register said input operand data word is located 2 Apparatus as claimed in claim 1 comprising an N bit data bus for transferring data words between a data storage device and said plurality of registers 3 Apparatus as claimed in claim 2 comprising an input buffer for receiving data words from said N bit data bus and for supplying said N bit data words to said plurality of registers 4 Apparatus as claimed in claim 1 wherein said arith metic logic unit is responsive to at least one parallel opera tion program instruction word that performs separate arith metic logic operations upon a first N 2 bit input operand data word and a second N 2 bit input operand data word stored within respective high order bit positions and low order bit positions of a si
74. ons in Parallel instructions perform maximum and minimum operations on parallel 16 bit data 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 141312111098 76543210 101 O 1 F S DEST IST R 5 SRC2 P D 1 C 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 141312111098 76543210 OPC specifies the type of instruction Action OPC 0 dest srcl lt 8122 src 1 src2 1 dest srcl gt src2 7 src src2 0 MIN lt dest gt lt srcl gt lt src2 gt dest lt srcl gt lt src2 gt Flags Z is set if the result is zero N is set if the result is negative For Max C is set if src2 gt srcl destesrcl case For Min C is set if src2 gt srcl dest src2 case V preserved 111 1 F S DEST SRC1 SRC2_PARALLEL D 1 45 50 55 60 65 OPC specifies the type of instruction Action OPC 0 dest sre1 1 lt 81222 8121 sre2 1 dest h srcl h lt src2 h 2 srcl h src2 h 1 dest l src1 l gt src2 1 2 8101 sre2 1 dest h srcl h gt src2 h src1 h src2 h Mnemonics 0 MINMIN dest lt srcl gt lt src2 gt 1 MAXMAX dest lt srcl gt lt src2 gt Flags Z is set if the upper 16 bits of the result is zero N is set if the upper 16 bits of the result is negative C For Max C is set if src2 h gt sre1 h dest srcl case For Min C is set if src2 h srcl h dest src2 case V preserved 5 881 259 47 SZ SN SC SV are set similarly f
75. oprocessor for Event Filtering in a Data Acquitsition System 1990 IEEE Nuclear Science U S PATENT DOCUMENTS Symposium Conference Record vol 1 of 2 Oct 22 27 3 735 355 5 1973 Balogh Jr et al ss 711 212 1990 pp 313 317 3 987 201 10 1976 Gooding et al 4 258 419 3 1981 Blahut et al 395 386 Primary Examiner Richard L Ellis 4 679 140 7 1987 Gotou et al 395 376 Attorney Agent or Firm Nixon amp Vanderhye P C 4 785 393 11 1988 Chu et al 395 565 4 825 355 1 Kurakazu et al n 30 05 57 ABSTRACT 5 132 898 7 1992 Sakamura et al 395 310 7 5 155 820 10 1992 Gibson 305 386 data processing system having a plurality of registers 10 5 442 769 8 1995 Corcoran et al and an arithmetic logic unit 20 22 24 includes program 5 669 012 9 1997 Shimizu et al 2 1222 395 376 instruction words having a source register bit field Sn specifying one of the registers storing an input operand data FOREIGN PATENT DOCUMENTS word together with an input operand size flag indicating 0372580 6 1990 European Pat Off whether the input operand has an N bit size or N 2 bit size 0 395 348 10 1990 European Pat Off together with a high low location flag indicating which of 0442041 8 1991 European Pat Off the high order bit positions or low order bit positions stores 0 463 628 1 1992 European Pat Off the input operand if it is of the smaller size It is preferred 0 465 054 1 1992 European Pat Off that th
76. or the lower 16 bit halves Reasons for inclusion As for 32 bit Max and Min Move Long Immediate Operation instructions allow a register to be set to any signed 16 bit sign extended value Two of these instructions can set a 32 bit register to any value by accessing the high and low half in sequence For moves between registers see the select operations 48 Reasons for inclusion A one cycle sustained MULA is required for FIR code 5 MULS is used in the FFT butterfly 15 also useful for multiply with rounding For example A0 X0 X1 16384 gt gt 15 can be done in once cycle by holding 16384 in another accumulator A1 for example Different dest and acc is also required for the FFT kernel 313029282726252423222120191817161514131211109 8 76 54 3 2100 Mnemonics MOV dest imm 16 The assembler will provide a non interlocking opera tion using this MOV instruction i e NOP is equivalent to MOV 0 Flags Flags are unaffected Reasons for inclusion Initialising registers counters Multiply Accumulate Operation instructions perform signed multiplication with accumulation or de accumulation scaling and saturation 1 11100 5 DEST IMMEDIATE 15 Multiply Double Operation instructions perform signed multiplication doubling the result prior to accumulation or 20 de accumulation scaling and saturation 25 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 1211109 8
77. order buffer then the result is undefined Piccolo s execution unit will stall if a read is made to an empty register The Input Reorder Buffer ROB sits between the copro cessor interface and Piccolo s register bank Data is loaded into the ROB with ARM coprocessor transfers The ROB contains a number of 32 bit values each with a tag indicat ing the Piccolo register that the value is destined for The tag also indicates whether the data should be transferred to a whole 32 bit register or just to the bottom 16 bits of a 32 bit register If the data is destined for a whole register the bottom 16 bits of the entry will be transferred to the bottom half of the target register and the top 16 bits will be transferred to the top half of the register sign extended if the target register is a 48 bit accumulator If the data is destined for just the bottom half of a register so called Half Register the bottom 16 bits will be transferred first The register tag always refers to a physical destination register no register remapping is performed see below regarding register remapping On every cycle Piccolo attempts to transfer a data entry from the ROB to the register bank as follows Each entry in the ROB is examined and the tags compared with the registers that are empty it is determined whether a transfer can be made from part or all of an entry to a register From the set of entries that can make a transfer the oldest entry i
78. parison The multiplexor 72 receives as its two inputs the logical register reference and the output from modulo circuit 70 the remapped register reference In preferred embodiments of the present invention if the logical register reference is less than the REGCOUNT value then the logic 74 instructs the multiplexor 72 to output the remapped register reference as the Physical Register Refer ence If however the logical register reference is greater than or equal to the REGCOUNT value then the logic 74 instructs the multiplexor 72 to output the logical register reference directly as the physical register reference As previously mentioned in preferred embodiments it is the REPEAT instruction which invokes the remapping mechanism As will be discussed in more detail later REPEAT instructions provide four zero cycle loops in hard ware These hardware loops are illustrated in FIG 5 as part of the instruction decoder 50 Each time the instruction 5 881 259 29 decoder 50 requests instruction from cache 6 the cache returns that instruction to the instruction decoder where upon the instruction decoder determines whether the returned instruction is a REPEAT instruction If so one of the hardware loops is configured to handle that REPEAT instruction Each repeat instruction specifies the number of instruc tions in the loop and the number of times to go around the loop which is either a constant or read from a Piccolo register
79. ruction The BASEUPDATE signal will be produced periodically by the loop hardware for example each time the instruction loop is to be repeated When a BASEUPDATE signal is received by the storage element 66 the storage element will overwrite the previous base pointer value with the next base pointer value provided by the multiplexor 60 In this manner the base pointer value supplied to the ReMap logic 58 will change to the new base pointer value The physical register to be accessed inside a remapped section of a register bank is determined by the addition of a logical register reference contained within an operand of an instruction and the base pointer value provided by the base update logic 58 This addition is performed by adder 68 and the output is passed to modulo circuit 70 In preferred embodiments the modulo circuit 70 also receives a register wrap value and if the output signal from the adder 68 the addition of the logical register reference and the base pointer value exceeds the register wrap value the result will wrap through to the bottom of the remapped region The output of the modulo circuit 70 is then provided to multiplexor 72 A REGCOUNT value is provided to logic 74 within Remap block 56 identifying the number of registers within a bank which are to be remapped The logic 74 compares this REGCOUNT value with the logical register reference and passes a control signal to multiplexor 72 dependent on the result of that com
80. s CMP CMN lt srcl gt lt src2 gt lt srcl gt lt src2 gt CMP is a subtract which sets the flags with the register write disabled CMN is an add which sets the flags with register write disabled Flags These have been discussed above Reasons for inclusion ADC is useful for inserting carry into the bottom of a register following a shift MAX MIN operation It is also used to do a 32 32 bit divide It also provides for extended precision adds The addition of an N bit gives finer control of the flags in particular the carry This enables a 32 32 bit division at 2 cycles per bit Saturated adds and subtracts are needed for G 729 etc Incrementing decrementing counters RSB is useful for cal culating shifts 32 is a common operation saturated RSB is needed for saturated negation used in G 729 Add subtract accumulate instructions perform addition and subtraction with accumulation and scaling saturation 17 1615141312 11109876543210 Gi cc OPC specifies the type of instruction Action OPC 10030 dest srcl src2 gt gt scale 110 0 dest 1 src2 gt gt scale 10001 dest SAT src1 8122 gt gt scale 11001 dest SAT src1 8122 gt gt scale 01110 dest src2 src1 gt gt scale 01111 dest SAT src2 src1 gt gt scale 55 60 65 Unlike the multiply accumulate instructions the accumulator number cannot be sp
81. s are common to most instructions and described in detail in separate sections as is the register re mapping Most instructions require two source operands Source 1 and Source 2 Some exceptions are saturating absolute The Source 1 SRC1 operand has the following 7 bit format 18 17 16 15 14 13 12 Refill Register Number PARAMETERS 21 30 35 40 45 50 55 60 20 10 1 1 F S DEST A R SRC1 A R 5 2 SCALE 11 012 5 DEST IMMEDIATE_15 D EC RFIELD_ SRC1 INSTRUCTIONS_8 4 RFIELD_ LOOPS_13 INSTRUCTION_8 EH 1 The elements of the field have the following meaning Size indicates the size of operand to read 1 32 bit 0 16 bit Refill specifies that the register should be marked as empty after being read and can be refilled from the ROB Register Number encodes which of the 16 32 bit regis ters to read Hi Lo For 16 bit reads indicates which half of the 32 bit register to read For 32 bit operands when set indicates that the two 16 bit halves of the register should be interchanged Size Hi Lo Portion of Register Accessed 0 0 Low 16 bits 0 1 High 16 bits 1 0 Full 32 bits 1 Full 32 bits halves swapped The register size is specified in the assembler by adding a suffix to the register number 1 for the low 16 bits h for the high 16 bits or x for 32 bits with the upper and lower sixteen bits interchanged The general So
82. s selected and its data transferred to the register bank The tag of this entry is updated to mark the entry as empty If only part of the entry was transferred only the part transferred is marked empty For example if the target register is completely empty and the selected ROB entry contains data destined for a full register the whole 32 bits are transferred and the entry is marked empty If the bottom half of the target register is empty and the ROB entry contains data destined for the bottom half of a register the bottom 16 bits of the ROB entry are transferred to the bottom half of the target register and the bottom half of the ROB is marked as empty The high and low 16 bits of data in any entry can be transferred independently If no entry contains data that can be transferred to the register bank no transfer is made that 5 881 259 7 cycle The table below describes all possible combinations of target ROB entry and target register status Target Rn Status Target ROB empty low half empty high half empty entry status Full Register Rn h lt 1 lt 1 Rn h lt entry h both halves Rn lt entry entryl marked entry h marked valid entry marked empty empty empty Full Register Rn h lt entry h Rn h lt entry h high half valid entry marked entry marked empty empty Full Register lt entry l lt entry ow half valid entry marked entry marked empty
83. s that which is generated by either bit 31 or bit 15 or the result based on whether the destination is 32 or 16 bits wide The standard arithmetic instructions can be divided up into a number types depending on how the flags are set In the case of Add and Subtract instructions if the N bit is set then all flags are preserved If the N bit is not set then the flags are updated as follows Z is set if the full 48 bit result was 0 N is set if the full 48 bit result had bit 47 set was negative V is set if either The destination register is 16 bit and the signed result will not fit into a 16 bit register not in the range 2715 lt x 2 15 The destination register is a 32 48 bit register and the signed result will not fit into 32 bits If dest is a 32 or 48 bit register then the C flag is set if there is a carry out of bit 31 when summing lt srcl gt and src2 or if no borrow occurred from bit 31 when subtract ing src2 from lt 1 gt the same carry value you would expect on the ARM If dest is a 16 bit register then the C flag is set if there is a carry out of bit 15 of the sum The secondary flags SZ SN SV SC are preserved In the case of instructions which either carry out a multiplication or accumulate from a 48 bit register Z is set if the full 48 bit result was O N is set if the full 48 bit result had bit 47 set was negative V is set if either 1 the destination register is 16 bit and
84. te an RMOV to load the user defined parameters dest SAT srcl gt 0 srcl srcl The value is always saturated In particular the absolute value of 0x80000000 is Ox7fffffff and NOT 0x80000000 Mnemonic SABS dest lt 1 gt Flags Z is set if the result is zero N is preserved C is set of src1 0 dest srcl case V is set if saturation occurred Reasons for inclusion Useful in many DSP applications Select Operations Conditional Moves serve to condi tionally move either source 1 or source 2 into the destination register A select is always equivalent to a move There are also parallel operations for use after parallel adds subtracts Note that both source operands may be read by the instruction for implementation reasons and so if either one is empty the instruction will stall irrespective of whether the operand is strictly required 313029282726252423222120191817161514131211109 876 54 3210 D 1 5 881 259 55 OPC specifies the type of instruction Action OPC 00 If lt cond gt holds for primary flags then dest srci else dest sre2 01 If cond holds for the primary flags then dest h src1 h else dest h src2 h If cond holds for the secondary flags then dest l src1 1 else dest l src2 l 10 If lt cond gt holds for the primary flags then dest h src1 h else dest h srce2 h If lt cond gt fails for the secondary flags then dest l sre1 1 else dest l src2 l 11 Reserved Mnemonics 00 SEL lt
85. the signed result will not fit into a 16 bit register not in the range 2715 lt lt 2715 or 2 the destination register is a 32 48 bit register and the signed result will not fit into 32 bits C is preserved The secondary flags SZ SN SV SC are preserved The other instructions including logical operations par allel adds and subtracts max and min shifts etc are covered below The Add and Subtract instructions add or subtract two registers scale the result and then store back to a register The operands are treated as signed values Flag updating for the non saturating variants is optional and may be sup pressed by appending an N to the end of the instruction 31 3029282726 24 2322212019 10 15 20 25 30 35 40 45 38 continued 101NO dest src2 Carry gt gt scale 111 0 dest src2 Carry 1 gt gt scale Mnemonics 100NO ADD N dest lt srcl gt src2 lt scale gt 110NO SUB N dest lt srcl gt src2 lt scale gt 10001 SADD dest lt srcl gt src2 lt scale gt 11001 SSUB dest lt srcl gt src2 lt scale gt 01110 RSB dest lt srcl gt src2 lt scale gt 01111 SRSB dest lt srcl gt src2 lt scale gt 101NO ADC N dest lt srcl gt src2 lt scale gt 111NO SBC N dest lt srcl gt src2 lt scale gt The assembler supports the following opcode
86. through 9 Apparatus as claimed in claim 8 comprising at least one multiplexer responsive to said high low location flag for selecting for supply to the low order N 2 bits of said datapath an N 2 bit input operand data word stored in one of high order bit positions of said source register and low order bit positions of said source register 10 Apparatus as claimed in claim 8 comprising a circuit for sign extending an N 2 bit input operand data word prior to input to said N bit datapath 11 method of processing data said method comprising the steps of i storing data words to be manipulated in a plurality of registers each of said registers having at least an N bit capacity and ii in response program instruction words performing arithmetic logic operations specified by said program instruction words iii wherein at least one program instruction word includes a a source register bit field specifying a source register of said plurality of registers storing an input operand data word for said program instruction word b an input operand size flag specifying whether said input operand data word has an N bit size or an N 2 bit size and c when said input size flag specifies an N 2 bit size a high low location flag indicating in which of high order bit positions of said source register and low order bit positions of said source register said input operand data word is located
87. tput FIFO for a value Piccolo stalls waiting for space to become available If a 16 bit value is written for example ADD X0 h X1 X2 then the value is latched until a second 16 bit value is written The two values are then combined and placed into the output FIFO as a 32 bit number The first 16 bit value written always appears in the lower half of the 32 bit word Data entered into the output FIFO is marked as either 16 or 32 bit data to allow endianess to be corrected on big endian systems If a 32 bit value is written between two 16 bit writes then the action is undefined Within a REPEAT instruction register re mapping is supported allowing a REPEAT to access a moving win dow of registers without unrolling the loop This is described in more detail below In preferred embodiments of the present invention the REPEAT instruction provides a mechanism to modify the way in which register operands are specified within a loop Under this mechanism the registers to be accessed are determined by a function of the register operand in the instruction and an offset into the register bank The offset is changed in a programmable manner preferably at the end of each instruction loop The mechanism may operate indepen dently on registers residing in the X Y and Z banks In preferred embodiments this facility is not available for registers in the A bank The notion of a logical and physical register can be used The instruction op
88. ubtract instructions are useful for subtraction on two signed 16 bit quantities held in pairs in performing operations on complex numbers held in a single 5 881 259 41 42 32 bit register They used in the FFT kernel It is also EXAMPLE 1 useful for simple addition subtraction of vectors of 16 bit data allowing two elements to be processed per cycle The Branch conditional instruction allows conditional Divide the 32 bit unsigned value in X0 by the 16 bit changes in control flow Piccolo may take three cycles to 5 unsigned value in with the assumption that 0 lt execute a taken branch X1 16 and X1 h 0 31 3029282726 25 24 23 222120 19 18 17 1615 1413 12 1110987654 3210 11111 IMMEDIATE 16 COND Action Branch by offset if lt cond gt holds according to the primary flags 15 LSL X1 X1 15 shift up divisor The offset is a signed 16 bit number of words At the d AT 16 CR feg moment the range of the offset is restricted to 32768 to CASC X0 X1 LSL 1 32767 words NEXT The address calculation performed is target address branch instruction address 4 OFFSET Mnemonics At the end of the loop X0 1 holds the quotient of the divide remainder can be recovered from X0 h depending on the B cond destination label value of carry Flags 25 Unaffected EXAMPIE 2 Reasons for inclusion Highly useful in most routines Conditional Add or Subtract instructions conditionally Divide t
89. ulated based on the full ALU result after the scale has been applied but prior to being m 35 ND Piccolo on the ARM so the translation from condition written to the destination An ASR will always reduce the testing to flag checking is not the same as the ARM either number of bits required to store the result but an ASL would increase it To avoid this Piccolo truncates the 48 bit result when an ASL scale is applied to limit the number of bits 0990 FQ en 40 over which zero detect and overflow must carried out 0001 NE Last result was non zero 0010 CS Used after shift MAX operation The N flag is calculated presuming signed arithmetic is 0011 CC C 0 being carried out This is because when overflow occurs the 0100 Last result was negative soe Cel 0101 PL GE N 0 Tast result was positive most significant bit of the result is either the C flag or the N 0110 VS V Signed overflow saturation on last result flag depending on whether the input operands are signed or 0 0 No overflow saturation on last result 45 unsigned 1000 VP 1 amp N 0 Overflow positive on last result T x S 1001 VELA N 1 Overflow negative on last result The V flag indicates if any loss of precision occurs as a 1010 reserved result of writing the result to the selected destination If no 10 reserved PCS ug Nay af write back is Selected a size is still implied and the 1101 LE N 1 17 41 50 overflow
90. umber of instruc tions in the loop body For the REPEAT it is only necessary to specify the number of loops either as a constant or register For example 25 followed by a REPEAT instruction See the section above for details of the RMOV instruction and the re mapping parameters format If the number of iterations for a loop is 0 then the action of REPEAT is UNPREDICTABLE If the number of instructions field is set to O then the action of REPEAT is UNPREDICTABLE A loop consisting of only one instruction with that instruction being a branch will have UNPREDICTABLE behaviour Branches within the bounds of a REPEAT loop that branch outside the bounds of that loop are UNPREDICT ABLE The Saturating Absolute instruction calculates the satu rated absolute of source 1 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14131211109 876 543210 10011 DEST SRC1 100000000000 D 1 5 Action REPEAT MULA 701 AO MULA YO h Z0 h AO NEXT 40 This will execute the two MULA instructions times Also 45 REPEAT 10 MULA YO 0 NEXT will perform 10 multiply accumulates 50 The assembler supports the syntax REPEAT iterations lt PARAMS gt To specify the re mapping parameters to use for the REPEAT If the required remapping parameters are equal to 55 one of the predefined set of parameters then the appropriate REPEAT encoding is used If it is not then the assembler will genera
91. urce 2 SRC2 has one of the following three 12 bit formats 5 881 259 21 IMMED_8 cael EIE 11 10 9 8 7 6 5 4 3 2 1 0 Eure eee eer ce 22 EX IMMED 6 SCALE FIG 4 illustrates a multiplexer arrangement responsive to the Hi Lo bit 36 and the input operand Size bit 32 within a program instruction word 30 to switch appropriate halves of the selected register selected by the source register bit field 34 to the Piccolo datapath If the Size bit indicates 16 bits 15 then a sign extending circuit pads the high order bits of the datapath with Os or 1s as appropriate The first encoding specifies the source as being a register the fields having the same encoding as the SRC1 specifier The SCALE field specifies a scale to be applied to the result 20 of the ALU SCALE 3 2 0 Action 25 0 0 0 0 ASR 0 0 0 0 ASR 1 0 0 0 ASR 2 0 0 ASR 3 0 1 0 0 4 30 0 1 0 RESERVED 0 1 0 6 0 1 ASL 1 1 0 0 0 ASR 8 1 0 0 16 1 0 0 ASR 10 35 1 0 RESERVED SRC2_SEL ERE continued SCALE 60 3 2 1 0 Action 1 1 0 0 ASR 12 1 1 0 1 ASR 13 1 1 1 0 ASR 14 1 1 1 1 ASR 15 65 10 11 10 9 8 7 6 5 4 3 2 1 0 Be LMR Bc ees em The 8 bit immediate with rotate encoding allows the generation of a 32 bit immediate which is expressible by an 8 bit value and 2 bit rotate The following table shows the immediate values that can be generated from the 8 bit value XY ROT
92. will load all Piccolo state from the address in register RO LDP BO RO 16 private registers LDP B1 RO 64 load general purpose registers LDP B2 RO 16 load accumulators LDP B3 RO 56 load Register ROB FIFO status LDP B4 RO 32 load ROB data LDP B5 RO 32 load output FIFO data LDP B6 RO 52 load loop hardware 50 55 60 65 itself as if Piccolo had been disabled with a PDISABLE instruction The program counter remains valid allowing the address of the breakpoint to be recovered Piccolo will no longer execute instructions Single stepping Piccolo will be done by setting breakpoint after breakpoint on the Piccolo instruction stream Software Debug The basic functionality provided by Piccolo is the ability to load and save all state to memory via coprocessor instructions when in state access mode This allows a debugger to save all state to memory read and or update it and restore it to Piccolo The Piccolo store state mechanism will be nondestructive that is the action of storing the state of Piccolo will not corrupt any of Piccolo s internal state This means that Piccolo can be restarted after dumping its state without restoring it again first 5 881 259 59 The mechanism to find the status of the Piccolo cache is to be determined Hardware Debug Hardware debug will be facilitated by a scan chain on Piccolo s coprocessor interface Piccolo may then

Input operand size and hi/low word selection control in data

Contents

Download Pdf Manuals

Related Search

Related Contents