Home

VIS Instruction Set User`s Manual

1. sss 3 6 3 Preparing To Use INC AS com eee piter nites 3 644 Starting ING AS cd O 3 6 9 Getting Help Azs iniciaban ibiza 3 6 6 Interrupting and Quitting INCAS sess 3 6 7 Using INCAS for Cycle Counting sse 3 6 8 Using INCAS For Debugging 3 6 9 Example Program Used in Illustrating INCAS Operation 34 Process TUNNE ih sieve vk nee e e BUD erased ever vn dine deta de e cea Using VIS AAA A 4j A etse aid de e d te ea e ER tete tied 42 Data Types Usedom oot eon dandoles 4 2 1 Partitioned Data Formats nennen nenne 42 Fixed Data Formats retener ebbe dre eere ed 42 3 Include Directives eerte ote eter dette eerte 43 Utility MINES iaa paa 43 1 vis write gsr vis read gsr ssssssssssssee eee 4 3 2 vis read hi vis read lo vis write hi vis write lo Sun Microelectronics viii 19 19 19 20 20 21 21 22 22 23 23 23 24 24 25 25 25 27 28 28 28 29 30 34 39 41 41 42 43 43 44 44 44 45 4 4 4 5 4 6 4 7 4 8 Contents 43 3 vis feg palt s nudos pe HEP o PU ete prete petiere Dean 46 ABA O scii o aee e e e ER Re Sq RD ea 46 43 5 vis to double vis to double dup sse 47 VIS Logical Instructions s nse onis naen aea e a Aa a e tenente enne nens 49 4 4 1 vis fzero vis fzeros vis_fone vis fones er 49 4 4 2 vis fsrc vi
2. ssssssss 90 4 8 3 Partitioned Arithmetic and Packing sss 90 4 8 4 Finding Maximum and Minimum Pixel Values ss 91 48 5 Merge Code Examples sess 93 Sun Microsystems Inc ix VIS Instruction Set User s Manual 4 8 6 Using VIS Instructions in SPARC Assembly sss 94 4 8 7 Using VIS Block Load and Store Instructions sss 96 49 Using array8 With Assembly Code sse 101 Advanced Topics tii da ede eta 105 Boks OVERVIOW kov ari 105 5 2 Imaging Applications eene 106 5 21 Resampling of Aligned Data With a Filter Width of 4 106 5 222 Handling Three Band Data 108 5 23 Fast Lookup of 8 Bit Data 111 5 24 Alpha Blending Two Images 116 5 3 Graphics Applications sssssssssssssseeenenen eren 119 5 3 1 Texture Mapping 119 5 4 Audio Applications ardran ae a a R e a e Eae eren 121 54 1 Finite Impulse Response FIR Filter sess 121 5 Video Applications eene nennen nennen eren 123 5 5 1 Motion Vector Estimation 123 Performance Optimization sse eene e a nnn nennen 127 A eene em pei mettent enit te ii t ee RE son 127 A Minimization of Conditional Usage cococconecooononcononenenrcnnnnernrarnnecncnnarnrnrrararannnnnnos 128 AJ Dealing With Misaligned Data sse ee 128 AA Cycle Expensive
3. 55 sflh vis read hi sdl 56 Ssfll vis read lo sdl 58 rdlh vis fmul8x16 sflh adh 59 rdll vis fmul8x16 sfll adl 1 61 sf2h vis read hi sd2 62 sf21 vis read lo sd2 64 rd2h vis fmul8x16 sf2h bdh 65 rd21 vis fmul8x16 sf21 bdl 67 rdh vis fpaddl6 rdlh rd2h 68 rdl vis fpaddl6 rdll rd21 70 rd vis fpackl16 to hi rd rdh V rd vis fpackl16 to lo rd rdl 73 dp 0 rd 0x0060 73 std S 0 12 0x0064 x ret 0x0068 xf restore g0 g0 g0 0x006c 0 type vdk vis blend88 2 0x006c size vdk vis blend88 vdk vis blend88 section text tfalloc execinstr 0x0000 0 align 4 SUBROUTINE main 1 OFFSET SOURCE LINE LABEL INSTRUCTION global main main 0x0000 save sp 128 sp 74 1 T 68 o o ode edel HU ee Od CO idee TEI 78 main int argc char argv 79 80 vis d64 s1 1 s2 1 d 1 a 1 82 vdk vis blend88 sl s2 d a 0x0004 82 add fp 16 01 0x0008 add Sfp 24 02 0x000c add fp 8 00 0x0010 Call vdk vis blend88 4 Result 590 0x0014 uA add fp 32 03 84 exit 0 0x0018 84 call exit 1l Result g0 0x001lc or g0 0 00 0x0020 x ret 0x0024 x restore 90 90 90 0x0028 0 type main 2 0x0028 size main main Sun Microelectronics 38 3 Development Flow 0x0028 0 global fs
4. Cb Jet Eb cr O eh Cb Gt vis fcmpge32 vis d64 datal 2 32 vis d64 data2 2 32 Description vis fcmplgt le eq neq lt ge compare four 16 bit partitioned or two 32 bit partitioned fixed point values within datal 4 16 datal 2 32 and data2 4 16 data2 2 32 The 4 bit or 2 bit comparison results are returned in the corresponding least significant bits of a 32 bit value that is typically used as a mask A single bit is returned for each partitioned compare and in both cases bit zero is the least significant bit of the compare result For vis fcmpgtO each bit within the 4 bit or 2 bit compare result is set if the corresponding value of data1 4 16 datal 2 32 is greater than the corresponding value of data2 4 16 data2 2 32 For vis femple each bit within the 4 bit or 2 bit compare result is set if the corresponding value of data1 4 16 data1 2 32 is less than or equal to the corresponding value of data2 4 16 data2 2 32 For vis_fempeq each bit within the 4 bit or 2 bit compare result is set if the corresponding value of data1 4 16 data1_2_32 is equal to the corresponding value of data2 4 16 data2 2 32 For vis fcmpne each bit within the 4 bit or 2 bit compare result is set if the corresponding value of data1 4 16 data1 2 32 is not equal to the corresponding value of data2 4 16 data2 2 32 Sun Microelectronics 52 4 Using VIS For vis femplt each bit within the 4 bit or 2 bit compare re
5. Align addresses relative to destination alignment and load data sl offset VIS OFFSET s1 d offset sl aligned vis alignaddr sl d offset u sl 0 sl aligned 0 u sl 1 sl aligned 1 S2 offset VIS OFFSET s2 d offset s2 aligned vis alignaddr s2 d offset u s2 0 s2 aligned 0 u s2 1 s2 aligned 1 off a VIS OFFSET a d offset alpha aligned vis alignaddr a d offset u alpha 0 alpha aligned 0 u alpha 1 alpha aligned 1 Sun Microsystems Inc 117 VIS Instruction Set User s Manual Number of times through the loop times vis u32 d end gt gt 3 vis_u32 d aligned gt gt 3 1 for i 0 i lt times i void vis alignaddr void 0 off a Set alignment for alpha quad a vis faligndata u alpha 0 u alpha 1 u alpha 0 u alpha 1 u alpha 1 alpha aligned i 2 void vis alignaddr void 0 sl offset Set alignment for sl dbl sl vis faligndata u sl 0 u sl 1 u sl 0 u sl 1 u sl 1 sl aligned i 2 void vis alignaddr void 0 s2 offset Set alignment for s2 dbl s2 vis faligndata u s2 0 u s2 1 u s2 0 u s2 1 u s2 1 s2 aligned i 2 dbl sl e vis fexpand vis read hi dbl s1 dbl s2 e vis fexpand vis read hi dbl s2 dbl tmp2 vis fpsubl6 dbl s2 e dbl sl e dbl tmpl vis fmul8xl6 vis read hi quad a dbl tmp2 dbl suml vis
6. thresh 11 off 3 tf vis to float tu t2 vis_fexpand tf Prepare the above values auh above 9 off 3 lt lt 24 above 10 off 3 lt lt 16 above 11 off 3 lt lt 8 above 9 off 3 aul above 10 off 3 lt lt 24 above 11 off 3 lt lt 16 above 9 off 3 lt lt 8 above 10 off 3 a0 vis to double auh aul auh above 11 off 3 lt lt 24 above 9 off 3 lt lt 16 above 10 off 3 lt lt 8 above 11 off 3 aul above 9 off 3 lt lt 24 above 10 off 3 lt lt 16 above 11 off 3 lt lt 8 above 9 off 3 al vis_to_double auh aul auh above 10 off 3 lt lt 24 above 11 off 3 lt lt 16 above 9 off 3 lt lt 8 above 10 off 3 aul above 11 off 3 lt lt 24 above 9 off 3 lt lt 16 above 10 off 3 lt lt 8 above 11 off 3 a2 vis to double auh aul Prepare the below values buh below 9 off 3 lt lt 24 below 10 off 3 lt lt 16 below 11 off 3 lt lt 8 below 9 off 3 bul below 10 off 3 lt lt 24 below 11 off 3 lt lt 16 below 9 off 3 lt lt 8 below 10 off 3 bO vis to double buh bul buh below 11 off 3 lt
7. Sun Microelectronics 14 2 UltraSPARC Concepts The additional latency for an internal cache miss and E Cache hit is 6 cycles 3 in ternal and 3 external Reads can be completed in every cycle with data driven the second cycle after address and control signals UltraSPARC does not differen tiate between burst reads and two consecutive reads signals used for a single read are simply replicated for each subsequent read The reads are fully pipelined and thus full throughput is achieved Writes can also be completed every cycle with data driven the cycle after address and control A dead cycle is created when switching direction on the data bus to avoid overlapping drivers The total write after read WAR penalty is two cycles There is no read after write RAW penalty 2 8 5 System Interface A complete UltraSPARC I subsystem consisting of the UltraSPARC I processor synchronous SRAM components for the External Cache tags and data and two UltraSPARC I Data Buffer UDB chips is shown in Figure 2 6 Prefetch External P Cache Unit Second Tags Level Cache External Cache 128 16 Address gt pariy System Data 128 16 Distributed muet m ECO Arbitration M s System 1 N Y m Figure 2 6 UltraSPARC I System Interface Sun Microsystems Inc 15 VIS Instruction Set User s Manual The UDBs serve to electrically isolate the interaction between the CPU and E Cache from the system bus
8. to synchronize with returning data process data returned by BLD A0 block load and sync data from BLD BO block store data from BLD AO process data returned by BLD BO block load and sync data from BLD AO block store data from BLD BO issue memory barrier instruction to ensure all previous memory load and store has completed return restore register window ET SIZE vis inverse 8 blk 4 9 Using array8 With Assembly Code An example of using the array8 instruction from assembly code to process 8 pix els in 9 clocks assuming the data are all in L2 cache 8 cycle latency define blockedO 10 define blockedO 11 define base 12 Sun Microsystems Inc 101 VIS Instruction Set User s Manual define seven 13 define fixed0 o0 define fixedl o1 define step 02 define step 03 define stepl5 o4 alignaddr g0 Sseven gO init loop counter to init gsr to 7 numpixels 16 assume numpixels divisible by 16 place initial fixed point address into fixed0 place step into step prior to the loop addx Sfixed0 step7 fixedO array8 Sfixed0 size blocked0 subx fixedO0 step fixedl E array8 Sfixedl size blockedl 7 step into Sstep7 generate f8 f15 fixed0 address of point 7 blocked0 address of point fixedl address of point 6 blockedl address of point ldda base blocked0 ASI FL8 PRIMARY f16 load point subx fixedl step fixed0
9. vis u8 max in max2 max3 vis read hi in vis read lo in overwite my max with the input overwite my min with the input Sun Microsystems Inc 91 VIS Instruction Set User s Manual Results are in bytes 0 2 4 6 of my min and my max min0 vis u8 smy min minl vis u8 smy min 2 min2 vis u8 amp my min 4 min3 vis u8 amp my min 6 max0 vis u8 amp my max maxl vis u8 amp my max 2 max2 vis u8 amp my max 4 max3 vis u8 amp my max 6 define MIN a b a lt b a b define MAX a b a gt b a b min MIN MIN min0 minl MIN min2 min3 max MAX MAX max0 maxl MAX max2 max3 Sun Microelectronics 92 4 8 5 Merge Code Examples 4 Using VIS Byte merging may be used to interleave multi banded images An example of combining separate red green blue and alpha images into a single 4 banded im age with pixels in red blue green and alpha format is illustrated by 4 8 5 1 and an example illustrating how to transpose a block of bytes is presented as 4 8 5 2 In this example an 8 x8 matrix p is transposed into an 8 x 8 matrix q Poo Poi Po Poo Pio P70 900 do1 407 Pio Pri Piz Poi Pr Paf hodi 417 P70 Pn Pr Por Piz Pm 970 An 477 4 8 5 1 Byte Merging vis_d64 red green blue alpha abgr vis_d64 r 9 b a ag br int time
10. 4 Using VIS digits If the resulting value is negative i e the MSB is set zero is returned If the value is greater than 255 then 255 is returned Otherwise the scaled value is returned For an illustration of this operation see 4 7 2 63 47 31 15 0 data 4 16 23 15 7 0 result M 3 0 3 0 GSR scale factor 1010 GSR scale factor 0100 0 7 0 Figure 4 17 vis fpack16 operation Example vis d64 data 4 16 vis f32 result result vis fpackl 6 data 4 160 Sun Microsystems Inc 65 VIS Instruction Set User s Manual pixels vis fpack32 4 7 2 vis fpack320 Function Truncate two 32 bit fixed values into two unsigned 8 bit integers Syntax vis d64 vis fpack32 vis d64 data 8 8 vis d64 data 2 32 Description vis fpack320 copies its first argument data 8 8 shifted left by 8 bits into the destination or vis d64 return value It then extracts two 8 bit quantities one each from the two 32 bit fixed values within data 2 32 and overwrites the least significant byte position of the destination Two pixels consisting of four 8 bit bytes each may be assembled by repeated operation of vis fpack32 on four data 2 32 pairs The reduction of data 2 32 from 32 to 8 bits is controlled by the scale factor of the GSR The initial 32 bit value is shifted left by the GSR scale factor and the result is considered as a fixed point number with its binary point between bits 22 and 23 If this number is nega
11. Development Flow 3 3 1 Overview This chapter presents the applications development process and introduces the tools for developing applications debugging and performance monitoring Topics included in this chapter are Development Process Overview SPARCompiler 4 x SC 4 x Use of software VIS Simulator Use of INCAS It s a Nearly Cycle Accurate Simulator Process Tuning Sun Microsystems Inc 21 VIS Instruction Set User s Manual 3 2 Code ways 1 3 3 Development Process Overview written using the VIS instruction set may be compiled and run in three Compile your VIS code using the SPARCompiler 4 x directly to generate object code for execution on the UltraSPARC CPU Compile your VIS code using any compatible not necessarily a SPARCompiler 4 x C compiler and link with libvis sim so or libvis_ sim a a VIS instruction simulator to resolve VIS function calls The VIS instruction simulator substitutes standard C implementations for the VIS instruction set which permits you to run your code on any compatible processor not necessarily an UltraSPARC I to perform debugging and algorithm validation Compile and specially process your VIS code to run on INCAS It s a Nearly Cycle Accurate Simulator which is a nearly cycle accurate model of the UltraSPARC I processor This permits you to do independent code performance prediction cycle counting and debugging VIS Software Developer s Kit The VIS S
12. General References Books Weaver David L editor The SPARC Architecture Manual Version 8 Prentice Hall Inc 1992 Weaver David L and Tom Germond eds The SPARC Architecture Manual Version 9 Prentice Hall Inc 1994 Papers Boney Joel SPARC Version 9 Points the Way to the Next Generation RISC Sun World October 1992 pp 100 105 Greenley D et al UltraSPARC The Next Generation Superscalar 64 bit SPARC 40th annual Compcon 1995 Kohn L etal The Visual Instruction Set VIS in UltraSPARC 40th annual Compcon 1995 Maturana G et al Incas A cycle accurate model of the UltraSPARC 40th annual Compcon 1995 Tremblay Marc A Fast and Flexible Performance Simulator for Microarchitecture Trade off Analysis on UltraSPARC DAC 95 Proceedings in press Zhou C et al MPEG Video Decoding with UltrapSPARC Visual Instruction Set 40th annual Compcon 1995 Sun Microsystems Inc v VIS Instruction Set User s Manual Sun Microsystems Publications Books and Manuals UltraSPARC User s Manual Revision 2 0 June 1996 Part No 802 7220 01 UltraSPARC I User s Manual Part No STP1030 UG INCAS User s Guide 2 0 UltraSPARC I Data Sheet This item is available in printed form or through the WWW See On Line Resources for information about the UltraSPARC I WWW page On Line Resources The UltraSPARC I WWW page is located at http www sun com
13. backtrack to point 5 array8 Sfixed0 size blocked0 blockedO address of point ldda base blockedl ASI FL8 PRIMARY f18 load point subx fixedO0 step fixedl backtrack to point 4 array8 fixedl size blockedl blockedl address of point ldda base blocked0 ASI FL8 PRIMARY f20 load point 5 subx fixedl step sfixed0 backtrack to point 43 array8 Sfixed0 size blocked0 blockedO address of point ldda base blockedl ASI FL8 PRIMARY f22 load point 4 subx fixedO0 step fixedl backtrack to point 2 array8 fixedl size blockedl blockedl address of point ldda base blocked0 ASI FL8 PRIMARY f24 load point 3 subx fixedl step fixed0 backtrack to point 1 array8 Sfixed0 size blocked0 blocked0 address of point ldda base blockedl ASI FL8 PRIMARY f26 load point 2 subx fixedO0 step fixedl backtrack to point 0 array8 fixedl size blockedl blockedl address of point ldda base blocked0 ASI FL8 PRIMARY f28 load point 1 addx fixedl stepl5 fixedO0 fixed0 address of point 15 array8 fixed0 size ldda subx fixed0 step Sun Microelectronics 102 blocked0 Sbase blockedl1 ASI FL8 PRIMARY Sfixedl k blocked0 address of point S 30 load point 0 address of point 14 fixedl 15 step into step15 7 6 5 6 4 3 2 1 0 15 4
14. d36 faligndata d4 d6 338 faligndata d6 d8 d40 faligndata d8 d10 d42 faligndata d10 d12 d44 faligndata d12 d14 d46 addcc LO Ty LO bg pt dis fmovd d14 d48 Sun Microelectronics 88 end of loop handling 11 1dda regaddr stda d32 reg faligndata d48 d16 faligndata dl16 d18 faligndata 018 d20 faligndata d20 d22 faligndata d22 d24 faligndata d24 d26 faligndata d26 d28 faligndata d28 d30 addcc 10 1 10 be pnt done fmovd d30 d48 ldda regaddr stda d32 reg ba loop faligndata d48 d0 done end of loop processi 4 Using VIS ASI BLK P d0 addr ASI BLK P d32 d34 5d36 d38 d40 d42 d44 5d46 ASI BLK P dl6 addr ASI BLK P d32 ng See also Section 4 8 7 Using VIS Block Load and Store Instructions 4 8 Code Examples The following are some code examples illustrating the application of the VIS in struction set 4 8 1 Averaging Two Images void ave vis d64 inputsO0 vis d64 outputs int Tnt xs vis d64 input0 inputl vis d64 result hi vis write gsr 2 3 i times inputsO i inputs1 i for i inputO0 inputl result hi 0 result lo outputs i result_ i vis fpaddl6 vis fexpand vis read hi input0 vis fpaddl6 vis fexpand vis read lo input0 vis freg pair vis fpackl16 result hi vis d64 inputsl times lo
15. generate edge mask for start point mask vis edge8 da prepare sourc sp vis d64 dend address and set GSR alignaddr offset vis_alignaddr sa off load 8 bytes of source data s0 sp Sp tt sl sp s vis faligndata s0 8 pixel inversion d vis fnot s store 8 bytes vis pst 8 d dp emask s0 sp dp to in set edge mask will be saved emask Oxff 8 byte loop while vis u32 dp lt tes es Ws a vis pst 8 vis u32 s1 of result so all 8 bytes of data doing while loop x dend2 load 8 bytes of source data sl sp Sun Microelectronics 78 4 Using VIS S vis faligndata s0 s1 8 pixel inversion d vis fnot s store 8 bytes of result vis pst 8 d dp emask s0 s1 Sp tt dp generate edge mask for end point mask vis edge8 dp dend load 8 bytes of source data sl sp S vis_faligndata s0 s1 8 pixel inversion d vis_fnot s store 8 bytes of result vis pst 8 d dp emask Code Example 4 2 Data Boundary Handling by vis inverse8b VO id vis_inverse8b vis_u8 src vis_u8 dst int length vis u8 sa src start point in source vis d64 sp 8 byte aligned start point in source vis u8 da dst start point in destination vis u8 dend dend2 e
16. 3 Development Flow 3 6 7 Using INCAS for Cycle Counting The following illustrates the use of INCAS on VIS code example vis example3 described in section 3 6 9 To perform cycle counting on the binary file vis example3 1 Load the Binary File into RAM1 starting at address 0 ieul load 0 raml vis example3 2 Set Breakpoints where you want to check the cycle count See file vis example3 c in directory VSDKHOME examples src and code listing in section 3 6 9 for corresponding location of the breakpoints ieul breakpoint add amp vdk vis blend88 ieul breakpoint add amp exit 3 Start cycle counting with the command run When the simulation reaches a breakpoint use the command time to check the current cycle count at that point ieul run ieul breakpoint 1 stage G at vdk vis blend88 0x8518 encountered ieul time real time Feb 6 19 09 41 380477 user time 0 330000 system time 0 100000 cycle count 843 1960 47 cps 7 06 MCPH instr count 68 158 14 ips 0 57 MIPH cpi 12 397 ipc 0 081 Maximum resident set size 0 pages ieul run ieul breakpoint 2 stage G at exit 0x85ac encountered ieul time real time Feb 6 19 09 47 609686 user time 0 35740 system time 0 107334 cycle count 969 2115 38 cps 7 62 MCPH instr count 95 207 39 ips 0 75 MIPH cpi 10 200 ipc 0 098 Maximum resident set size 0 pages 4 Repeat this process throughout your code The difference of the cycle
17. Set shift field of gsr to 2 vis fexpand vis read hi inputl vis fexpand vis read lo inputl p r vis fpackl6 result 10 Sun Microsystems Inc 89 VIS Instruction Set User s Manual 4 8 2 Blending Two Images by a Fixed Percentage void blend vis d64 inputsO vis d64 inputsl vis d64 outputs 4 8 3 int percent int times vis_u32 coeff_hi coeff_lo vl f32 coefficients vis d64 input0 inputl blend0 blendl vl f32 result hi result 1o int i vis write gsr 0 coeff hi int 16384 0 percent 100 0 coeff lo 16384 coeff hi coefficients vis to float coeff hi lt lt 16 coeff lo for i 0 i lt times i input0 inputsO il inputl inputs1 il blend0 vis fmul8xl6au vis read hi input0 coefficients blendl vis fmul8x16al vis read hi inputl coefficients result hi vis fpackl6 vis fpaddl6 blend0 blendl blend0 vis fmul8xl6au vis read lo input0 coefficients blendl vis fmul8x16al vis read lo inputl coefficients result lo vis fpackl6 vis fpadd16 blend0 blendl outputs i vis freg pair result hi result 10 Partitioned Arithmetic and Packing void interpolate vis f32 values vis d64 outputs int times vl f32 pixels0 pixelsl vl f32 filters viscd64 filt00 filtOl filtlO filtll vl f32 result0 resultl filters vis to float 0x30001000 pixels0 values 0 pixelsl
18. Using VIS loop array8 fixedl size blockedl blockedl address of point 14 ldda base blocked0 ASI FL8 PRIMARY f0 load point 15 subx fixedl step fixed0 fixed0 address of point 13 faligndata f16 Saccuml Saccuml array8 Sfixed0 size blocked0 blocked0 address of point 13 ldda base blocked1 ASI FL8 PRIMARY f2 load point 414 subx fixedO0 step fixedl fixedl address of point 412 faligndata f18 Saccuml Saccuml array8 fixedl size blockedl blockedl address of point 12 ldda base blocked0 ASI FL8 PRIMARY f4 load point 13 subx fixedl step fixed0 fixed0 address of point 11 faligndata f20 Saccuml Saccuml array8 Sfixed0 size blocked0 blocked0 address of point 11 ldda base blocked1 ASI FL8 PRIMARY Sf6 load point 12 subx fixedO0 step fixedl fixedl address of point 10 faligndata f22 Saccuml Saccuml array8 fixedl size blockedl blockedl address of point 10 ldda base blocked0 ASI FL8 PRIMARY f8 load point 11 subx fixedl step fixed0 fixed0 address of point 9 faligndata f24 accuml Saccuml array8 fixed0 size blocked0 blocked0 address of point 9 ldda base blockedl ASI FL8 PRIMARY f10 load point 10 subx fixedO0 step fixedl fixedl address of point 8 faligndata f26 Saccuml Saccuml array8 fixedl size blockedl blockedl address of point 8 ldda base blo
19. byte3 wordl gt gt 24 byte4 wordl gt gt 16 amp Oxff byte5 wordl gt gt 8 amp Oxff byte6 wordl amp Oxff byte7 word2 gt gt 24 word0 word2 wordl word3 word2 vis u32 src 2 i next word3 vis u32 src 2 i next 1 lookup vis ld u8 i vis ras table byte 7 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte6 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte5 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte4 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte3 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte2 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table bytel accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte0 accum vis faligndata lookup accum vis d64 dst i accum break Case 2 for i 0 i lt doubles 1 byte0 word0 gt gt 8 Oxff bytel word0 amp Oxff byte2 wordl gt gt 24 byte3 wordl gt gt 16 amp Oxff byte4 wordl gt gt 8 amp Oxff byte5 wordl amp Oxff byte6 word2 gt gt 24 Sun Microelectronics 114 5 Advanced Topics byte7 word2 gt gt 16 Oxff word0
20. communications or in the design construction operation or maintenance of any nuclear facility Sun disclaims any express or implied warranty of fitness for such uses Printed in the United States of America Preface Overview Welcome to the VIS Instruction Set User s Guide This book presents information about the VIS Instruction Set which is an extension to the SPARC V9 instruction set This book presents Anintroduction to the UltraSPARC I architecture e The VIS development environment e The VIS instructions e Select examples illustrating the use of VIS to process multimedia data How to Use This Book This book is provided with the UltraSPARC I developers kit and provides you with a complete definition of the VIS instructions with some illustrative code ex amples Since the examples given include some assembly code you should refer to The SPARC Architecture Manual Version 9 and The UltraSPARC I User s Manual for a more complete explanation of the concepts presented While this book does present information on how to set up a VIS development environment and how to use INCAS It s a Nearly Cycle Accurate Simulator you should have available for reference information about the INCAS commands that is included in the INCAS User s Guide 2 0 This guide is a part of the VIS Software Developer s Kit Sun Microsystems Inc iii VIS Intruction Set User s Manual Textual Conventions Contents Fonts are used as
21. counts between two breakpoints gives you the number of cycles the code between the two breakpoints takes to run Sun Microsystems Inc 29 VIS Instruction Set User s Manual 5 You can put INCAS commands into a file and run them in batch mode See file timing cmd for an example Be sure to put the command wait after each run command so that INCAS will wait for the completion of the run command before executing the time command to show the cycle count All INCAS screen output is also saved in a file named incas log This file can then be used later for further analysis 3 6 8 Using INCAS For Debugging INCAS permits you to examine the processor status at each cycle Remember however that INCAS is a simulator not a debugger You can examine processor status but can not change it Because INCAS works on assembly level and below it is more convenient to have assembly listing of your code on hand for reference To generate an assembly listing use the S option in the compiler There are sev eral watches which can be set to monitor different activities Some particularly useful watches are ieul watchpipe monitors the status of the pipeline ieul watchload monitors the loading of an instruction ieul watchdisp monitors the dispatching of an instruction ieul watchdone monitors the finishing of an instruction The following example is a sample debug session based on code vis example3 c You can find the sou
22. fone and vis fones return vis d64 and vis f32 one filled variables Example vis f32 data 32 vis d64 data 64 data 64 vis fzero data 64 holds 0x0000000000000000 data 32 vis fones data 32 holds Oxffffffff These instructions set all 64 bits of data 64 to zeros or ones They are useful for initializing variables since data 64 may be regarded as a partitioned variable containing two 32 bit or four 16 bit zero values see the section on arithmetic in structions 4 4 2 vis fsrc vis fsrcs vis fnot vis fnots Function Copy a value or it s complement Syntax vis d64 vis fsrc vis d64 data 64 vis f32 vis fsrcs vis f32 data 32 vis d64 vis fnot vis d64 data 64 vis f32 vis fnots vis f32 data 32 Sun Microsystems Inc 49 VIS Instruction Set User s Manual Description vis fsrc copies one vis d64 variable to another and vis fnot copies the complement of one vis d64 variable to another vis fsrcs copies one 32 bit variable to another and vis fnots copies the complement of one 32 bit variable to another Example vis f32 datal 32 data2 32 vis d64 datal 64 data2 64 datal 32 vis fsrc data2 32 same as datal 32 data2 32 datal 64 vis fnot data2 64 same as datal 64 data2 64 44 3 vis flor and xor nor nand xnor ornot andnot s Function Perform logical operations between two 32 bit or two vis d64 partitioned variables Syntax vis d64 vi
23. ieu ieul run ieul watchdisp 982 I FPO vdk vis blend88 0x40 0x838c fpsubl6 f0 f12 f16 ieul watchdisp 982 J FP1 vdk vis blend88 0x44 0x8390 Sun Microelectronics 32 fmul8x16 3 Development Flow f3 f12 f12 ieul watchpipe G E C OPW ieul watchpipe ieul watchpipe I GF E ieul watchpipe J H uil cycle kckck ck kckck count 0x000003d7 KK KK KK ck kk kk kX CYCLE DONE XXX X kk kk KK KK kk x xx ieul watchload 983 A vdk vis blend88 0x20 0x836c ld sp 0x5c f8 OxOffOOff ieul watc fmul8x16 ieul watc ieul watc ieul watc ieul watc uil cycle ckckck ck kckck ck ieul watc fmul8x16 ieul watc ieul watc ieul watc ieul watc uil cycle ckckck ck kckck ieul watc fmovsS f8 ieul watc ieul watc ieul watc ieul watc uil cycle ckckck ck k kk ieul watc f8 f1 ieul watc fpadd16 ieul watc ieul watc ieul watc ieul watc uil cycle ckckck ck kckck ieul watc fpsubl6 ieul watc fmul8x16 ieul watc fpadd16 ieul watc fpack16 mem addr memory 0x6ffe7c 0x7ffelc f8 0 2 36720e 29 hdisp 983 K FP1 vdk vis blend88 0x48 0x8394 Sf4 f14 f14 hpipe GECNOPW hpipe hpipe K IGF E hpipe JH count 0x000003d8 KK KK kk ck k kk kkX CYCLE DONE X XXX XX kk kk kk kk kk x xx hdisp 984 L FP1 vdk vis blend88 0x4c 0x8398 f5 f16 f16 hpipe G E C OPW hpipe hpipe LK IGF
24. lt 24 below 9 off 3 lt lt 16 below 10 off 3 lt lt 8 below 11 off 3 Sun Microelectronics 110 5 2 3 bul b1 buh bul b2 below 9 off 3 lt lt 24 below 10 off 3 lt lt 16 below 11 off 3 lt lt 8 below 9 off 3 vis to double buh bul below 10 off 3 lt lt 24 below 11 off 3 lt lt 16 below 9 off 3 lt lt 8 below 10 off 3 below 11 off 3 lt lt 24 below 9 off 3 lt lt 16 below 10 off 3 lt lt 8 below 11 off 3 vis to double buh bul Generate edge mask for the start point ma num for if if PE sk vis_edge8 da dend Calculate loop count vis u32 dend vis u32 dp 24 8 pixel loop i 0 i lt num i Process segment 0 HRESHOLD t0 t1 a0 b0 Process segment 1 HRESHOLD t2 t0 al bl Pprocess segment 2 HRESHOLD t1 t2 a2 b2 c O o O Process segment 0 if needed vis u32 dp lt vis u32 dend THRESHOLD t0 t1 a0 b0 Process segment 1 if needed vis u32 dp lt vis u32 dend THRESHOLD t2 t0 al b1 Process segment 2 if needed vis u32 dp lt vis u32 dend THRESHOLD t1 t2 a2 b2 Fast Lookup of 8 Bit Data This routine e
25. max sy fly ny min sy sh fly 16 y new height in bytes if nx lt 0 ny lt 0 return 0 16x16 block is outside search area compute width in 8 byte units nx8 nx gt gt 3 accum vis fzero Sun Microelectronics 124 sll 5 Advanced Topics sal sl2 sa2 row loop for 0 j lt ny j for i 0 i lt nx8 i load 8 bytes of source data from farmel Spl vis d64 vis_alignaddr sal 0 s10 sp1 0 s11 spl 1 sdl vis faligndata s10 s11 load 8 bytes of source data from farme2 Sp2 vis d64 vis alignaddr sa2 0 s20 sp2 0 s21 sp2 1 sd2 vis faligndata s20 s21 accum vis pdist sdl sd2 accum 00 sal sa2 8 sll sal sll fllb S12 sa2 sl2 f21b process what s left over nx 8 in plain c code sal sa2 sll framel fllb fly flx nx8 8 sl2 frame2 f21b f2y flix nx8 8 nx nx8 8 LE NZ d for j 0 j lt ny j for i 0 i lt nx i accum abs sal sa2 sal sa2 sll sal sll fllb S12 sa2 sl2 f21b result d64 accum return result ull Sun Microsystems Inc 125 VIS Instruction Set User s Manual Sun Microelectronics 126 Performance Optimization A A 1 Overview This appendix provides some helpful hints and suggestions to consider when writing code for the UltraSPARC I Sun M
26. multiply 4 65 vis fmuld8sux160 vis fmuld8ulx16 Function Multiply a 16 bit partitioned vis f32 variable by a 16 bit partitioned vis f32 variable to produce a 32 bit partitioned vis d64 result Syntax vis d64 vis fmuld8sux16 vis f32 datal6s1 vis f32 datal6s2 vis d64 vis fmuld8ulx16 vis f32 datal6s1 vis f32 datal6s2 Description vis fmuld8sux16 multiplies the upper 8 bits of one 16 bit signed component of data16s1 by the corresponding signed 16 bit element of data16s2 The 24 bit product is shifted left by 8 bits to return a 32 bit result as illustrated in Figure 4 15 Sun Microelectronics 62 4 Using VIS data16s1 data16s2 31 15 result 00000000 00000000 63 81 0 Figure 4 15 vis fmuld8sux16 operation vis fmuld8ulx160 multiplies the unsigned lower 8 bits of each 16 bit component in datal6s1 by the corresponding signed element in data16s2 Each 24 bit product is returned as a sign extended 32 bit result as illustrated in Figure 4 16 data16s1 data16s2 31 yy 15 yr 0 sign extended sign extended result 63 31 0 Figure 4 16 vis fmuld8ulx16 operation Sun Microsystems Inc 63 VIS Instruction Set User s Manual vis fmul8sux16 and vis fmul8ulx16 together perform a true 16x16 gt 32 bit multiplication taking two vis f32 arguments each containing two 16 bit signed values As with vis fmul8sux16 and vis fmul8ulx16 each instruct
27. recs ret restore ENTRY vis inverse 8 asm text alloc execinstr function name sp reserve space for stack and adjust register window size 0 8 byte per loop delay instruction after this branch alway get executed see p 145 in V9 Manual return restore register window SET SIZE vis inverse 8 asm 4 8 7 Using VIS Block Load and Store Instructions FUNCTION SYNOPSIS ARGUMENT src Source image dst destination image size image size Sun Microelectronics 96 void vis inverse 8 blk vis inverse 8 blk invert an image into another vis u8 src vis u8 dst vis u32 size NOTES l src and dst must poin Size XSIZE YSIZE ZSIZ I DESCRIPTION dst 255 src I include vis asi h 4 Using VIS t to 64 byt E must be multiple of 64 aligned addresses inimum size of stack frame according to SPARC ABI trails a function and sets the size for the ELF symbol membar membar BI BI BI BI BI StoreLoad StoreLoad BI BI BI BI Bi BI BI EI BI BI BI BI BI BI BI BI LALALA define INFRAME 96 ENTRY provides the standard procedure entry code define ENTRY x X align 4 global x x SET SIZE table define SET SIZE x X size x Xx defi
28. values 1 for i 0 i lt times 1 Sun Microelectronics 90 Multiply pixels0 filt00 vis fmul8x1 vis fmul8x1 filt01 Multiply pixelsO filt10 vis fmul8x1 filtll vis fmul8x1 result0 vis fpackl resultl vis fpackl outputs i Shift input window to the right pixels0 pixelsl pixelsl vis freg pair resultO 4 Using VIS by 0 75 pixesll by 0 25 add 6au pixelsO filters 6al pixelsl filters by 0 25 pixesll by 0 75 add 6al pixels0 filters 6au pixelsl filters 6 vis fpaddl6 filt00 filt01 6 vis fpaddl6 filt10 filt1l result1 El values i 21 4 8 4 Finding Maximum and Minimum Pixel Values void minimum vis d64 inputs int doubles int i int mask vis d64 my min vis f32 zeros vis u8 min0 minl my max in hi in lo min2 min3 max0 my min my max inputs 0 zeros vis fzeros i 0 i lt doubles i inputs i for in in hi in lo vis fpmerge zeros vis fpmerge zeros If an entry of the input my max my max mask my max mask mask vis fcmpgtl6 in hi vis pst 16 in hi amp my max mask vis fcmpgt16 in lo vis pst 16 in lo amp my max If an entry of my min gt the input in hi mask in 10 mask mask vis fcmpgtl16 my min vis pst 16 in hi amp my min mask vis fcmpgtl6 my min vis pst 16 in lo my min vis u8 min maxl
29. vis fnot s store 8 bytes of result vis pst 8 d dp emask s0 sl Sp dp generate edge mask for end point mask vis edge8 dp dend load 8 bytes of source data sl s sp vis faligndata s0 s1 8 pixel inversion d vis fnot s store 8 bytes of result Sun Microelectronics 80 4 Using VIS vis pst 8 d dp emask 4 7 8 vis pst 8 16 3210 Function Write mask enabled 8 16 and 32 bit components from a vis_d64 value to memory Syntax void vis pst 8 vis d64 data void address vis u8 mask void vis pst 16 vis d64 data void address vis u8 mask void vis pst 32 vis d64 data void address vis u8 mask Description vis pst 8 16 3210 use mask typically determined by edge or compare instructions to control which 8 16 or 32 bit components of data are to be written to memory Typical uses include writing only selected channels of a multi channel image avoiding writing past image boundaries and selecting between images on a pixel by pixel basis based on the result of a comparison instruction Example Code Example 4 3 Creation of Mask That Allows for an Unaligned Store vis d64 addr addr last addr aligned vis d64 data int emask mask vis edge8 addr addr last addr aligned vis alignaddr addr 0 vis pst 8 data addr aligned emask Code Example 4 4 Loop that Writes Zeroes to a Span of Byt
30. vis fpmerge vis read hi m0426 vis read hi m1537 ql vis fpmerge vis read lo m0426 vis read lo m1537 m0426 vis fpmerge vis read lo m04 vis read lo m26 m1537 vis fpmerge vis read lo m15 vis read lo m37 q2 vis fpmerge vis read hi m0426 vis read hi m1537 q3 vis fpmerge vis read lo m0426 vis read lo m1537 m04 vis fpmerge vis read lo p0 vis read lo p4 m26 vis fpmerge vis read lo p2 vis read lo p6 m15 vis fpmerge vis read lo pl vis read lo p5 m37 vis fpmerge vis read lo p3 vis read 1lo p7 m0426 vis fpmerge vis read hi m04 vis read hi m26 m1537 vis fpmerge vis read hi m15 vis read hi m37 q4 vis fpmerge vis read hi m0426 vis read hi ml537 q5 vis fpmerge vis read lo m0426 vis read lo m1537 m0426 vis fpmerge vis read lo m04 vis read lo m26 m1537 vis fpmerge vis read lo m15 vis read lo m37 q6 vis fpmerge vis read hi m0426 vis read hi m1537 q7 vis_fpmerge vis_read_lo m0426 vis_read_lo m1537 4 8 6 Using VIS Instructions in SPARC Assembly FUNCTION SYNOPSIS Sun Microelectronics 94 void vis inverse 8 asm vis u8 vis u8 vis inverse 8 asm invert an image into another ESTC dst l a a X 4 Using VIS vis u32 size ARGUMENT src source image dst destination image size image size NOTES src and dst must point to 8 byte aligned addresses size XSIZE YSIZE ZSIZE m
31. 4 16 vis fpadd320 and vis fpsub320 perform partitioned addition and subtraction between two 64 bit partitioned components interpreted as two 32 bit signed variables data1 2 32 and data2 2 32 and return a 64 bit partitioned variable interpreted as two 32 bit components sum 2 32 or difference 2 32 Overflow and underflow are not detected and result in wraparound Figure 4 6 illustrates the vis fpadd16 and vis fpsub16 operations Figure 4 7 illustrates the vis fpadd320 and vis fpsub32 operation The 32 bit versions interpret their arguments as two 16 bit signed values or one 32 bit signed value The single precision version of these instructions vis fpadd16s vis fpsub16s vis fpadd32s vis fpsub32s perform two 16 bit or one 32 bit partitioned adds or subtracts Figure 4 8 illustrates the vis fpadd16s and vis fpsub16s operation and Figure 4 9 illustrates the vis fpadd32s and vis fpsub32s operation 15 63 47 31 0 data2 4 16 sum 4 160r difference 4 16 Figure 4 6 vis fpadd16 and vis fpsub16 operation Sun Microsystems Inc 55 VIS Instruction Set User s Manual 63 31 0 l sum 2 32or difference 2 32 63 31 Figure 4 7 vis fpadd32 and vis fpsub32 operation 15 sum 2 160r difference 2 16 o o o wo E al o 31 15 0 Figure 4 8 vis fpadd16s and vis fpsub16s operation data1 1 32 0 31 T data2 1 32 31 0 sum 1 320r difference 1 32 31 0 Figu
32. 9 void vdk vis blend88 vis d64 spl vis d64 sp2 30 vis d64 dp vis d64 ap 32 33 vis d64 sdl sd2 ad 34 vis d64 ones 35 vis f32 sflh sf2h sfll sf21 36 vis d64 adh bdh adl bdl 37 vis d64 rdlh rd2h rdll rd21 l 38 vis d64 rdh rdi 39 vis d64 rd 41 sdl sp1 0 42 sd2 sp2 0 43 ad ap 0 45 vis write gsr 3 lt lt 3 47 ones vis to double dup 0x0Off00ff0 0x0004 47 sethi Shi 0xff00c00 500 0x0008 41 ldd i0 f 2 0x000c 45 or 290 24 01 0x0010 42 ldd Sil f4 0x0014 47 add 00 1008 00 0x0018 43 ldd 13 f6 0x001c st 00 sp 92 0x0020 E ld Ssp 92 f8 0x0024 pi fexpand f6 f10 0x0028 f fexpand f7 f12 0x002c wr g0 01 gsr 0x0030 0 fmovs f8 2f0 0x0034 fmovs f8 9f 0x0038 Ef fpsub16 f0 f10 f14 0x003c E fmul8x16 f2 f10 2 f10 0x0040 x fpsub16 f0 f12 5f16 0x0044 fmul8x16 3 12 f12 0x0048 fmul8x16 f4 9f14 f14 0x004c fmul8x16 Sf 5 5f16 f16 0x0050 x fpaddl6 f10 f14 f14 0x0054 EY fpadd16 f12 f16 f10 0x0058 fpack16 f14 f0 0x005c fpack16 f10 f1 49 adh vis fexpand hi ad 50 adl vis fexpand lo ad 52 bdh vis fpsubl6 ones adh 53 bdl vis fpsubl6 ones adl Sun Microsystems Inc 37 VIS Instruction Set User s Manual
33. Cache Load Buffer Store Buffer and Data Memory Management Unit DMMU The External Cache E Cache which services misses from the Instruction Cache I Cache in the UltraSparc front end and the D Cache of the LSU Sun Microelectronics 6 2 UltraSPARC Concepts Branch Prefetch and Prediction Dispatch Unit and IMMU P Next Field Integer Floating Execution Point Unit Graphics Unit Load D Cache E Store Buffer DMMU Buffer Second Level Cache Interface System Interface Branch Unit System Address 35 1 parity 128 16 parity 128 16 ECC Figure 2 1 Simplified Block Diagram of UltraSPARC I Sun Microsystems Inc 7 VIS Instruction Set User s Manual 2 3 The UltaSPARC Front End The UltraSPARC front end is essentially the Prefetch Dispatch Unit PDU Figure 2 2 illustrates the major components of the UltraSPARC I front end Next Branch I Cache Field Prediction Prefetch Pre Unit Decoded a Unit Instruction 12 Buffer Entry 64 Entries Dispatch Unit Instructions I nteger Branch Execution Instructions are prefetched from a pseudo 2 way 16kbyte instruction cache Each line in the I Cache contains 8 instructions 32 bytes Every pair of instructions has a 2 bit branch prediction field which maintains history of a possible branch in the pair The four prediction states are the conventional strongly taken likely tak en strongly not taken and l
34. Converts two 32 bit partitioned data to two 16 bit partitioned data Syntax vis f32 fpackfix vis d64 data 2 32 Sun Microsystems Inc 67 VIS Instruction Set User s Manual Description vis fpackfix takes two 32 bit fixed components within data 2 32 scales and truncates them into two 16 bit signed components This is accomplished by shifting each 32 bit component of data 2 32 according to GSR scale factor and then truncating to a 16 bit scaled value starting between bits 16 and 15 of each 32 bit word Truncation converts the scaled value to a signed integer i e rounds toward negative infinity If the value is less than 32768 32768 is returned If the value is greater than 32767 32767 is returned Otherwise the scaled data 2 16 value is returned Figure 4 19 illustrates the vis fpackfix operation Example vis d64 data 2 32 vis f32 data 2 16 data 2 16 vis fpackfix data 2 32 63 31 data 2 32 data 2 16 31 15 0 3 0 GSR scale factor 0110 data 2 32 component 31 16 15 5 0 00 0000 37 data_2 16 component 15 0 Figure 4 19 vis fpackfix operation Sun Microelectronics 68 4 Using VIS 4 4 vis fexpand Description Converts four unsigned 8 bit elements to four 16 bit fixed elements Syntax vis d64 vis fexpand vis 32 data 4 8 Description vis fexpand converts packed format data e g raw pixel data to a partitioned format vis fexpand takes four 8 bit unsigned element
35. LALA LL LALALA Loss lt lt lt ss lt lt ss std 03 da std O4 da std O5 da std 06 da std 07 da inc 64 da deccc ns ble pn icc loop end nop fendif define INVERSE AO fnotl AO 00 fnotl Al Ol fnotl A2 02 fnotl A3y 03 fnotl A4 04 fnotl 55 05 fnotl A6 O6 fnotl A7 O7 define INVERSE BO fnotl BO O0 fnotl Bl Ol fnotl B2 O2 fnotl B3 034 fnotl B4 04 Inotl B5 Q5 fnotl B6 O6 fnotl B7 O7 hold global data 24 32 40 48 56 SPARC have four integer register groups hold input data o registers 00 to 07 hold output data l registers 10 to 17 hold local data When calling an assembly function stored in i registers from i0 to i5 Stored in stack Note that i6 is reserved for stack pointer and i7 for return address 4 Using VIS LALALA LALA LLL LALALA i registers i0 to i7 g registers g0 to g7 Note that g0 is alway zero write to it has no program visible effect the first 6 arguments are The rest arguments are Only the first 32 f registers can be used as 32 bit registers The last 32 f registers can only be used as 16 64 bit registers define src 10 define dst Sil define sz 12 frame pointer i6 return addr 17 stack pointer 06 Sun Microsystems Inc 99 VIS Instruction Set User s Manual call link 07 define sa 10 define da 11 define se 12 define
36. The binary point of the 16 bit result in this case is to the right of bit 0 Another example illustrated below has 12 fractional bits in each of its 2 component arguments i e the binary point is between bits 11 and 12 A full precision 32 bit result would have 24 fractional bits i e the binary point between bits 23 and 24 Since however only a 16 bit result is provided the lower 16 fractional bits are dropped after rounding thus providing a result with 8 fractional bits i e the binary point between bits 7 and 8 0101 001010010101 5 161376953125 x 0001 011001001001 1 392822265625 00000111 00110000 7 188880741596 63 55 47 39 31 23 15 7 0 data1_4 16 63 47 31 15 0 dataz 4 10 YY al msb msb msb msb resultu Y Y Y Y 63 55 47 39 31 23 15 7 0 Figure 4 13 vis fmul8sux16 operation Sun Microsystems Inc 61 VIS Instruction Set User s Manual 63 55 47 39 31 23 15 7 0 data1 4 16 63 47 31 15 0 data2 4 16 sign extended sign extended sign extended sign extended 8 msb 8 msb 8 msb 8 msb result Y Y Y Y 63 55 47 39 31 23 15 7 0 Figure 4 14 vis fmul8ulx16 operation Example vis d64 datal 4 16 data2 4 16 resultl resultu result resultu vis fmul8suxi16 datal 4 16 data2 4 16 resultl vis fmul8ulxi16 datal 4 16 data2 4 16 result vis fpaddl6 resultu resultl 16 bit result of a 16 16
37. This offers a more optimum way of performing the equivalent of using vis write hi and vis write lo since the compiler attempts to minimize the number of floating point move operations by strategically using register pairs Example vis f32 datal 32 data2 32 vis d64 data 64 Produces data 64 with datal 32 as the upper and data2 32 as the lower component data 64 vis freg pair datal 32 data2 32 4 3 4 vis to float Function Place a vis u32 variable into a floating point register without performing a floating point conversion Syntax vis f32 vis to float vis u32 data 32 Description Sun Microelectronics 46 4 Using VIS The semantics of the C compiler require a format conversion when assigning an integer data 32 to a float variable Since the VIS does not operate with floating point variables but only uses the floating point registers vis to float bypasses the float conversion and stores the unmodified bit pattern in a floating point register Example vis u32 data 32 vis f32 f f vis to float data 32 The same result would be achieved by the following statement vis f32 amp data 32 Taking an illustrative example data 32 21845 5555 base 16 0101010101010101 base 2 f data 32 will result in f containing a floating point representation of 21845 0 which will have a completely different bit pattern than the one shown f vis to floa
38. a 16 bit partitioned resultu The 24 bit product is rounded to 16 bits The operation is illustrated in Figure 4 13 vis fmul8ulx16 multiplies the unsigned lower 8 bits of each 16 bit element of datal 4 16 by the corresponding 16 bit element in data2 4 16 Each 24 bit product is sign extended to 32 bits The upper 16 bits of the sign extended value are returned in a 16 bit partitioned resultl The operation is illustrated in Figure 4 14 Because the result of fmul8ulx16 is conceptually shifted right 8 bits relative to the result of fmul8sux16 they have the proper relative significance to be added together to yield 16 bit products datal 4 16 and data2_4 16 Each of the partitioned multiplications in this composite operation multiplies two 16 bit fixed point numbers to yield a 16 bit result i e the lower 16 bits of the full precision 32 bit result are dropped after rounding The location of the binary point in the fixed point arguments is under user s control It can be anywhere from the right of bit 0 or to the left of bit 14 Sun Microelectronics 60 4 Using VIS For example each of the input arguments can have 8 fractional bits i e the binary point is between bit 7 and bit 8 If a full precision 32 bit result were provided it would have 16 fractional bits i e the binary point would be between bits 15 and 16 Since however only 16 bits of the result are provided the lower 16 fractional bits are dropped after rounding
39. and operate at the system clock frequency which can be either 1 2 or 1 3 of the processor clock Collectively the UDBs have FIFOs for eight 16 byte noncacheable stores one 64 byte read buffer two 64 byte write buffers and a 64 byte copyback buffer The large number of outstanding 16 byte stores is useful for maintaining peak store bandwidth to a frame buffer System transactions are packet based in that address and data transfers are dis joint non interfering events A 36 bit address bus is used to deliver two cycle re quest packets that begin a transaction This bus can be shared by up to three other masters in addition to a centralized system controller Arbitration is distributed Each master on the address bus has the same logic and sees all requests for the bus There are five potential requests four potential mas ters plus one from a high priority system controller Arbitration is round robin with a hysteresis effect to reduce latency for the last master This helps reduce la tency for bursts of transactions from the same master There is also a special park ing mode for uniprocessors that typically reduces arbitration latency to zero by keeping UltraSPARC enabled onto the address bus between transactions 2 4 Processor Pipeline The functions performed by the IEU LSU and FGU are implemented in a dual 9 stage pipeline Most instructions go through the pipeline in exactly 9 stages The instructions are considered terminated after
40. follows italic font is used to refer to variables in text Typewriter font is used for code examples Bold font is used for emphasis The VIS User s Manual is designed to introduce you to the VIS Instruction Set to permit you to write image processing graphics or other applications for the UI traSPARC processor Chapter 1 Introduction presents a high level overview of the UltraSPARC I superscalar processor and the performance advantages of the VIS Instruction Set Chapter 2 UltraSPARC Concepts presents some of the hardware features of the UltraSPARC I that account for the substantial performance enhancement Chapter 3 Development Flow introduces you to the VIS development environment which includes the SPARCompiler 4 x the VIS simulator a development and debugging tool and INCAS It s a Nearly Cycle Accurate Simulator a nearly cycle accurate simulator of the UltraSPARC I processor Chapter 4 Using VIS introduces you to VIS and includes simple examples of instruction use Chapter 5 Advanced Topics presents a sampling of example programs taken from the applications areas of imaging graphics audio and video Appendix A Performance Optimization presents some suggestions for performance optimization Appendix B Extending an XIL program using VIS presents how a function coded with VIS can be incorporated into a higher level library like XIL Sun Microelectronics iv Related Documents
41. in vis invers8b ssssssssssseess 77 Figure 4 26 Blocked Byte Data Formatting Structure sssssssessseeee 85 Figure 4 27 Three Dimensional Array Fixed Point Address Format 85 Figure 4 28 Three Dimensional Array Blocked Address Format Array8 86 Figure 4 29 Three Dimensional Array Blocked Address Format Array16 86 Figure 4 30 Three Dimensional Array Blocked Address Format Array32 87 Figure 5 1 Simultaneous Computation of 8 Filter Output Values coccion 106 Sun Microelectronics xii Introduction 1 1 1 Overview This chapter presents a brief introduction to the UltraSPARC I superscalar pro cessor with special emphasis on the VIS Instruction Set Topics included in this chapter are e Description of UltraSPARC I Introduction to the VIS Instruction Set Sun Microsystems Inc 1 VIS Instruction Set User s Manual 1 2 UltraSPARC I UltraSPARC I is a highly integrated superscalar processor implementing the 64 bit SPARC V9 RISC architecture The major performance features of the processor are the capability to sustain an execution rate of four instructions per cycle even in the presence of conditional branches and cache misses at a high clock rate Ultra SPARC I supports 64 bit virtual addresses and integer data sizes up to 64 bits while preserving compatibility with code written for the 32 bit SPARC V8 proces so
42. kk ck k kk k k CYCLE DONE XX XX XX RA kk KK ieul cycle ieul watchdisp 984 L FP1 vdk vis blend88 0x4c 0x8398 fmul8x16 f5 f16 f16 ieul watchpipe GE C NOP W ieul watchpip ieul watchpipe LK IGFE ieul watchpip JH uil cyclecount 0x000003d9 KK kck kk kc kckckckckck ck ck ck k kk kk K kCYCLE DONE XX XXX kk KK k kx ieul cycle ieul watchdone 985 E FPO vdk vis blend88 0x30 0x837c fmovs f8 f0 ieul watchpipe G E C OPW ieul watchpip ieul watchpipe LKI GF ieul watchpip J H uil cyclecount 0x000003da KOK KK KK kc kckckck ck ckckckck ck kk kc kk k kCY CL E DON BKKKKK KKK KKK KKK KK KKK KKK KK In pipeline watch each instruction is represented by a case sensitive letter and is shown going through seven stages of the pipeline The above example shows three cycles of output In the first cycle instruction A 1d is being loaded and in struction K fmul8x16 is being dispatched In the second cycle instruction L fmul8x16 is being dispatched and instruction K has moved to the second stage of the pipeline In the third cycle instruction E fmovs is finishing instruction K has moved to the third stage of the pipeline and instruction L has moved to the second stage of the pipeline 8 Run the simulation with some watches on INCAS will continually output the status changes of the watches until it reaches a breakpoint ieul debug
43. ns 13 define XX f0 define O00 f16 define O01 f17 define O10 f18 define 011 f19 define O20 f20 define O21 f21 define 030 f22 define O31 f23 define 040 f24 define O41 f25 define 050 f26 define O51 S 27 define 060 f28 define O61 f29 define O70 30 define O71 f31 define 00 f16 define 01 f18 define 02 f20 define O3 f22 define O4 f24 define O5 S 26 define O6 S 28 define O7 f30 define A0 f32 define A1 f34 define A2 f36 define A3 38 define A4 S 40 define A5 f42 define A6 f44 define A7 S 46 define BO S 48 define Bl f50 define B2 f52 define B3 f54 define B4 f56 define B5 f58 define B6 60 define B7 S 62 section Sun Microelectronics 100 text alloc execinstr 4 Using VIS ENTRY vis inverse 8 blk function name save Ssp MINFRAME sp reserve space for stack do some error checking tst sz ble pn Sicc ret calculate loop count sra sz 6 ns add src sz se mov src sa mov dst da MEMBAR_BEFORE_BLD BLD A0 BLD BO loop bon INVERSE A0 BLD AO BST INVERSE BO BLD B0 BST bg pt icc loop bgn loop end MEMBAR AFTI reti ret restore S ER_BLD and adjust register window size gt 0 64 bytes per loop end address of source issue memory barrier instruction to ensure all previous memory load and store has completed issue the 2nd block load instruction
44. of point 30 PRIMARY fixedl xels 8 15 xels 16 23 Advanced Topics 9 5 1 Overview This chapter presents sample programs that illustrate the use of the VIS instruc tion set Sample programs presented are from the following major application ar eas Imaging e Graphics e Audio e Video Sun Microsystems Inc 105 VIS Instruction Set User s Manual 5 2 Imaging Applications 5 2 1 Resampling of Aligned Data With a Filter Width of 4 This example illustrates the resampling of a pixel array by a filter requiring four pixel values The use of VIS instructions illustrates the speedup possible by the partitioned arithmetic permitting the simultaneous computation of 8 filter output values Figure 5 1 shows four columns each with 8 data elements of input data from which 8 output values are simultaneously computed This figure assumes a 2 dimensional layout of the input data which does not need to be the case i p P p 1 p 2 p 3 Figure 5 1 Simultaneous Computation of 8 Filter Output Values Input data ibufli stored in transposed form contain the pixels from column i of 8 consecutive rows obuflj is computed as a weighted sum of the four columns fO ibuf iTable j f3 ibuf iTable j 3 The input and output data in ibuf and obuf are assumed to be aligned on 64 bit boundaries so that the use of vis faligndata vis alignaddr and vis edge8 are not required Th
45. performance The three versions of the array instruction array8 array16 and array32 differ only in the scaling of the computed memory offsets array16 shifts its result left by one position and array32 shifts left by two in order to handle 16 and 32 bit texture data When using the array instructions a blocked byte data formatting structure is imposed The N x N x M volume where N 22x64 M2mx932 0 n 5 1 lt m lt 16 should be composed of 64 x 64 x 32 smaller volumes which in turn should be composed of 4 x 4 x 2 volumes This data structure is optimal for 16 bit data For 16 bit data the 4 x 4 x 2 volume has 64 bytes of data which is ideal for reducing cache line misses the 64 x64 x 32 volume will have 256k bytes of data which is good for improving the TLB hit rate Figure 4 26 illustrates how the data has to be organized where the origin 0 0 0 is assumed to be at the lower left front corner and the x coordinate varies faster than y than z i e when we traverse the volume from the origin to the upper right back we go from left to right front to back bottom to top Sun Microelectronics 84 4 Using VIS Mzm x 32 16x2 32 i 16 x 4 64 0 4 16 x 4 64 N 2 x 64 Figure 4 26 Blocked Byte Data Formatting Structure The array instructions have 2 inputs 1 The x y z coordinates are input via a single 64 bit integer organised as shown in Figure 4 27 55 54 44 43 33 32 22 21 1110 Figure 4 27 Three Dimension
46. predicted set of the target The high bandwidth provided by the I Cache 4 instructions cycle allows the UltraSPARC to prefetch instructions ahead of time based on the current instruction flow and on branch prediction Providing a fetch bandwidth greater than or equal to the maximum execution bandwidth assures that for well behaved code the proces sor does not starve for instructions Exceptions to this rule occur when branches are hard to predict when branches are very close to each other or when the I Cache miss rate is high Sun Microsystems Inc 17 VIS Instruction Set User s Manual 2 5 2 Stage 2 Decode D Stage In this stage the fetched instructions are pre decoded and sent to the Instruction Buffer The pre decoded bits generated during this stage accompany the instruc tions during their stay in the Instruction Buffer Upon reaching the next stage where the grouping logic lives these bits speed up the parallel decoding of up to 4 instructions While it is being filled the Instruction Buffer also presents up to 4 instructions to the next stage A pair of pointers manage the Instruction Buffer ensuring that as many instructions as possible are presented in order to the next stage 2 5 8 Stage 3 Grouping G Stage This stage s main task is to group and dispatch a maximum of four 4 valid in structions in one cycle It receives a maximum of 4 valid instructions from the Prefetch and Dispatch Unit PDU it controls the In
47. s1 1 vdk vis blend r exit 0 A char argv s2 1 d 1 ali 88 sl s2 d a 3 6 9 2 Assembly Listing for vis example3 section file text alloc execinstr vis_example3 c section text tfalloc execinstr 0x0000 0 align 4 1 SUBROUTINE vdk_vis_blend88 I OFFSET SOURCE LINE LABEL INSTRUCTION global vdk vis blend88 vdk vis blend88 0x0000 save sp 96 sp FILE vis example3 c 1 Copyright C 1995 Sun Microsystems Inc 3 Lx 4 FUNCTION 5 jp ik vdk vis blend88 blend two 8 pixel arrays 6 p 7 SYNOPSIS 8 ox void vdk vis blend88 vis d64 sp1 vis_d64 sp2 9 I x vis_d64 dp vis_d64 ap 10 LA TE ARGUMENT l 12 1 Spl pointer to 8 bytes of source data 1 13 ES sp2 pointer to 8 bytes of source data 2 l 14 E dp pointer to 8 bytes of destination data 15 ROI ap pointer to 8 bytes of alpha coefficient 16 bk 17 DESCRIPTION Sun Microelectronics 36 18 Ox Blend 19 b dst 0 lt 20 Li 4 j 22 include 23 include 24 include 1 I 27 Joc Koko k kok k kok k k kok k lee 3 Development Flow two arrays with a alpha coefficient array alpha srcl 255 alpha src2 255 alpha lt lt stdlib h gt vis types vis proto hin cae OKCKCKCKCKCKCKCKCKCkCKCkCKCkCK Ck K Ck k Ck k k ck Ck ck ck ck ck I x ke x I 2
48. they go through the last stage W after which changes to the processor state are irreversible Figure 2 7 shows a diagram of the integer and floating point pipeline stages Three additional stag es are added to the integer pipeline to make it symmetrical with the floating point pipeline This simplifies pipeline synchronization and exception handling and eliminates the need to implement a floating point queue Floating point instructions with a latency greater than 3 divide and square root behave differently than other instructions in the sense that the pipe is extend ed when the instruction reaches stage N Memory operations are allowed to proceed asynchronously with the pipeline in order to support latencies longer than the latency of the on chip data cache Sun Microelectronics 16 2 UltraSPARC Concepts Integer Pipe E Execute C Cache Access N1 D Cache Hit Miss N2 FP Pipe Sync N3 Traps are resolved W Write R Register X1 Start Execution Continued X2 Execution X3 Finish Execution Floating point Graphics Pipe Figure 2 7 UItraSPARC I 9 Stage Dual Pipeline 2 5 Pipeline Stage Description 2 5 1 Stage 1 Fetch F Stage In this stage instructions are fetched from the instruction Cache I Cache and placed in the Instruction Buffer from where they will be selected for execution Up to four instructions are fetched along with branch prediction information the predicted target address of a branch and the
49. word2 wordl word3 word2 vis_u32 src 2 i next word3 vis_u32 src 2 i next 1 lookup vis_ld_u8_i vis_ras table byte7 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte6 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte5 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte4 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte3 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte2 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table bytel accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte0 accum vis faligndata lookup accum vis d64 dst i accum break Case 3 for i 0 i lt doubles 1 1 byte0 word0 amp Oxff bytel wordl gt gt 24 byte2 wordl gt gt 16 amp Oxff byte3 wordl gt gt 8 amp Oxff byte4 wordl amp Oxff byte5 word2 gt gt 24 byte6 word2 gt gt 16 Oxff byte7 word2 gt gt 8 Oxff wordO0 word2 wordl word3 word2 vis_u32 src 2 i next word3 vis_u32 src 2 i next 1 lookup vis ld u8 i vis ras table byte 7 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte6 accum vis faligndata lookup accum lookup vis 1d u8 i vis ras tab
50. write gsr writes gsr to the Graphics Status Register and vis read gsr reads the contents of the Graphics Status Register scale factor alignaddr offset 63 7 6 32 0 Figure 4 3 Graphics Status Register format Example This example illustrates writing to the GSR and changing the scale factor only vis u8 scalef vis write gsr scalef lt lt 3 vis read gsr 0x7 Sun Microelectronics 44 4 Using VIS Note If you are writing a multi threaded VIS application then the Graphics Status Register GSR is a resource that can be shared between multiple threads Care should be taken that a thread after setting the GSR register should not voluntarily give up control say via a mutex to another thread that also sets the GSR register In this case the contents of the GSR cannot be relied on after the first thread regains control But if the same thread is involuntarily made to give up control to the other thread say by an interrupt from the operating system then the operating system will do the necessary context switch so that each thread can depend on the GSR being uncorrupted 4 3 2 vis read hi vis read lo vis write hi vis write lo Function Read and write to the upper or lower component of a vis d64 variable Syntax vis f32 vis read hi vis d64 variable vis f32 vis read lo vis d64 variable vis d64 vis write hi vis d64 variable vis 32 uppercomp vis d64 vis write lo vis d64 variable vis f32 l
51. E hpipe JH count 0x000003d9 Ck kk ck k ck kk kk kX CYCLE DONE KKK KK kk kk KK KK kk x x x hdone 985 E FPO vdk vis blend88 0x30 0x8370 f0 hpipe G E C OPW hpipe hpipe LK IGF hpipe JH count 0x000003da Ck kk kk ck kk kk k CYCLE DONE AA KKK KKK KK KK KK kk x xx hdone 986 F FPO vdk vis blend88 0x34 0x8380 fmovs hdisp 986 M FPO vdk vis blend88 0x50 0x839c f10 f14 f14 hpipe GECNOPW hpipe hpipe M LK 1G hpipe JH count 0x000003db Ok kk kk ck kk kk kX CYCLE DONE X X X XXX kk kk KK KK kk x xx hdone 987 G FPO vdk vis blend88 0x38 0x8384 f0 f10 14 hdone 987 H FP1 vdk vis blend88 0x3c 0x8388 f2 f10 10 hdisp 987 N FPO vdk vis blend88 0x54 0x83a0 f12 f16 f10 hdisp 987 O FP1 vdk vis blend88 0x58 0x83a4 f14 f0 Sun Microsystems Inc 33 VIS Instruction Set User s Manual ieul watchpipe GE CNOP W ieul watchpip ieul watchpipe N Ke Le ieul watchpip O J uil cyclecount 0x000003dc KK eK KK kk KK kA kA kA kk kk CYCLE DONF ck ck ck ck Ck ck kk ck Ck ck kk ko ko ko Sk ko ko ko ko ko ieul watchdone 988 I FPO vdk vis blend88 0x40 0x838c fpsubl6 f0 f12 f16 ieul watchdone 988 J FP1 v
52. GSR are cycle expensive operations so use them sparing ly Another cycle expensive operation is vis falignaddr because it does not get grouped with any other instruction You should typically use it outside a loop When joining two vis f32 variables into a single vis d64 variable using vis freg pair offers a more optimum way than using vis write hi and vis write lo This is because the compiler attempts to minimize the number of floating point move operations by a strategic use of register pairs A 5 Advantage of Using Pre Aligned Data Since most of the VIS instructions require 8 byte aligned data it is required to ac cess non aligned datawith vis alignaddr and vis faliagndata vis alignaddr however is a very cycle expensive operation because it does not get grouped with any other instruction In some cases it takes 3076 running time to deal with data alignment One way to avoid the penalty for vis alignaddr and vis faligndata is to use pre aligned data That is using data which start at Sun Microelectronics 128 A Performance Optimization 8 byte aligned addresses 64 byte aligned addresses for code using block load store instructions A 64 byte aligned data block can be allocated with the following C code vis u8 buf vis u8 img 64 byte aligned address buf vis u8 malloc imagesize 64 img vis u8 vis u32 buf 0x3f 64 In addition to pre aligned data if the image s
53. NCAS run the script SINCASHOME bin incas_startup You should see screen output similar to following Incas Release 2 0 Beta Configuration phase pwd is opt SUNWincas lib Preprocessing configuration file opt SUNWincas lib us 1 conf Parsing configuration file opt SUNWincas lib us 1 conf Creating C module classes Creating module instances and interfaces Performing interface configurations Performing shared object registrations Performing shared object lookups Performing interface configuration verifications Reading ui commands from opt SUNWincas lib incasrc Negative phase is active ieul incasrc is a command file that is executed by INCAS at start up These com mands typically set up some environment variables and some common conve nience aliases 3 6 5 Getting Help You can get information on commands at any point in INCAS with the command help Note however that each module has some unique commands A list of INCAS commands available in help can be found in the file sINCASHOME lib command list For a comprehensive description of INCAS commands refer to the INCAS Users Guide 2 0 found in INCASHOME manu als INCASuserguide ps 3 6 6 Interrupting and Quitting INCAS To interrupt and exit INCAS at any time enter your interrupt character which is lt CTRL gt C by default The INCAS prompt will return after it is interrupted Use command quit to exit INCAS Sun Microelectronics 28
54. Operations sssssssssssseseeeenenenetnne tentent 128 A 5 Advantage of Using Pre Aligned Data sss 128 Extending an XIL program using VIS sse en 131 DL OVA A ette e GRE HI etia 131 B2 Extending AL mines eo p iti 132 Indek Sresi einan cease ate AAKE E ea Ve e AE EE 135 Sun Microelectronics x Listof Figures Figure 1 1 Figure 2 1 Figure 2 2 Figure 2 3 Figure 2 4 Figure 2 5 Figure 2 6 Figure 2 7 Figure 3 1 Figure 4 1 Figure 4 2 Figure 4 3 Figure 4 4 Figure 4 5 Figure 4 6 Figure 4 7 Figure 4 8 Figure 4 9 Figure 4 10 Figure 4 11 Figure 4 12 Figure 4 13 Figure 4 14 Figure 4 15 Figure 4 16 Figure 4 17 Figure 4 18 Figure 4 19 Figure 4 20 Figure 4 21 Four multiplications performed in a single cycle ss 3 Simplified Block Diagram of UltraSPARC I eee 7 UltraSPARC I Front End eee secte ete ae atia tres 8 Integer Execution Unit sse tenente 10 Floating Point and Graphics Unit sess 11 Eo ad7Store Unit cereo eredi e etie e e ete 13 UltraSPARC I System Interface oooconcocinnnononnnncnnenonannnnnnaneraranonnnnanarannrnnrarananonos 15 UltraSPARC I 9 Stage Dual Pipeline sss 17 INCAS Accuracy Model ete tre nettes 26 Graphics Data Formats sss eene 42 Partitioned Data Formats eee nnne nnne 43 Graphics Status Register format sss 44 Fo
55. VIS Instruction Set User s Manual July 1997 amp Sun microsystems Sun Microelectronics 2550 Garcia Avenue Mountain View CA 94043 U S A 1 800 681 8845 www sun com sparc Part Number 805 1394 01 Copyright O 1997 Sun Microsystems Inc All Rights Reserved THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED AS IS WITHOUT ANY EXPRESS REPRESENTATIONS OR WARRANTIES IN ADDITION SUN MICROSYSTEMS INC DISCLAIMS ALL IMPLIED REPRESENTATIONS AND WARRANTIES INCLUDING ANY WARRANTY OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE OR NON INFRINGEMENT OF THIRD PARTY INTELLECTUAL PROPERTY RIGHTS This document contains proprietary information of Sun Microsystems Inc or under license from third parties No part of this document may be reproduced in any form or by any means or transferred to any third party without the prior written consent of Sun Microsystems Inc Sun Sun Microsystems and the Sun logo are trademarks or registered trademarks of Sun Microsystems Inc in the United States and other countries All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International Inc in the United States and other countries Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems Inc The information contained in this document is not designed or intended for use in on line control of aircraft air traffic aircraft navigation or aircraft
56. XXXXXXXXXXX df18 XXXXXXXXXXXXX df20 XXXXXXXXXXXXX df22 XXXXXXXXXXXXX df24 XXXXXXXXXXXXX df26 XXXXXXXXXXXXX df28 XXXXXXXXXXXXX df30 XXXXXXXXXXXXX df32 XXXXXXXXXXXXX df34 XXXXXXXXXXXXX df36 XXXXXXXXXXXXX df38 XXXXXXXXXXXXX df40 XXXXXXXXXXXXX df42 XXXXXXXXXXXXX df44 XXXXXXXXXXXXX df46 XXXXXXXXXXXXX df48 XXXXXXXXXXXXX df50 XXXXXXXXXXXXX df52 XXXXXXXXXXXXX df54 XXXXXXXXXXXXX df56 XXXXXXXXXXXXX df58 XXXXXXXXXXXXX df60 XXXXXXXXXXXXX df62 XXXXXXXXXXXXX fprs 0x05 fef 1 du 0 dl 1 fsr 0x0000000000000000 fcc3 fcc2 fccl fcc0 ns ver qne rd ftt 0 0 0 near none invalid overflow underflow divzero inexact tem aexc E cexc 5 gsr 0x00000018 scale f 3 align 0 7 Cycle through a simulation with the watches turned on debug ieu in the following listing is a macro that sets up some watches It is defined in the INCASHOME lib incasrc command file Sun Microsystems Inc 31 VIS Instruction Set User s Manual ieul debug ieu ieul cycle ieul watchload 983 A vdk vis blend88 0x20 0x836c ld sp 0x5c f8 mem addr memory 0x6ffe7c 0x7ffe7c f8 OxOffO0Off0 2 36720e 29 ieul watchdisp 983 K FP1 vdk vis blend88 0x48 0x8394 fmul8x16 f4 f14 f14 ieul watchpipe GE CNOPW ieul watchpipe ieul watchpipe K IGFE ieul watchpip JH uil cyclecount 0x000003d8 KK kk kk k ck ck kck ck ck
57. XXXXXXXXXXXXX 4 OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX 5 OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX 6 0x00000000007ffe80 OxXXXXXXXXXXXXXXXX 0x00000000007ffe20 OxXXXXXXXXXXXXXXXX 7 0x0000000000008594 OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX y OxXXXXXXXX Sp 0x00000000007ffe20 memory 0x6ffe20 fp 0x00000000007ffe80 memory 0x6ffe80 pstate 0x0lc cle 0 tle 0 vg 0 mg 0 mm 0 red 0 pef 1 am 1 priv 1 ie 0 ag 0 Ceres 0x00 XCC TEC pil OxX Window state registers cwp 2 cansave 3 canrestore 3 otherwin 0 cleanwin 6 other 0 normal 0 6 Thecontent of floating point registers may be examined with the command ieul fregs fregs when you focus on ieu1 ieul fregs f00 2 36720e 29 f01 2 36720e 29 f02 0 00000 03 0 00000 04 0 00000 05 0 00000 06 0 00000 07 0 00000 08 2 36720e 29 f09 XXXXXXXXXXXXX f10 0LDER H 11 OLDER H 12 0 00000 13 0 00000 f14 0LDER G f15 OLDER G f16 XXXXXXXXXXXXX f17 XXXXXXXXXXXXX f18 XXXXXXXXXXXXX 19 XXXXXXXXXXXXX f20 XXXXXXXXXXXXX f21 XXXXXXXXXXXXX f22 XXXXXXXXXXXXX 23 XXXXXXXXXXXXX f24 XXXXXXXXXXXXX f25 XXXXXXXXXXXXX 26 XXXXXXXXXXXXX 27 XXXXXXXXXXXXX f28 XXXXXXXXXXXXX 29 XXXXXXXXXXXXX 30 XXXXXXXXXXXXX 31 XXXXXXXXXXXXX df00 6 46621e 232 df02 0 00000 df04 0 00000 df06 0 00000 df08 XXXXXXXXXXXXX df10 XXXXXXXXXXXXX df12 0 00000 df14 XXXXXXXXXXXXX dfl6 XX
58. a physical address On a load when there are no other outstanding loads the data array is accessed so that the data can be forwarded to dependent instructions in the pipeline as soon as possible ALU operations executed in the E Stage generate condition codes in the C Stage The condition codes are sent to the PDU which checks whether a conditional branch in the group was correctly predicted If the branch was mispredicted ear lier instructions in the pipe are flushed and the correct instructions are fetched The results of ALU operations are not modified after the E Stage the data merely propagates down the pipeline through the annex register file where it is avail able for bypassing for subsequent operations In the Floating point Graphics pipe this stage is the X Stage Instructions start their execution during this stage Instructions of latency one also finish their exe cution phase during the X Stage 2 5 6 Stage 6 Ny Stage In this stage a data cache miss hit or a TLB miss hit is determined If a load misses the D Cache it enters the Load Buffer The access will arbitrate for the E Cache if there are no older unissued loads If a TLB miss is detected a trap will be taken and the address translation obtained by a software routine The physical address of a store is sent to the Store Buffer during this stage To avoid pipeline stalls when store data is not immediately available the store address and data parts are de coupled a
59. achieved by electing to use the fast com pile option since this option chooses the fastest code generation option available on the compile time hardware For routines using VIS code you must include the vis il file on the command line to resolve VIS function calls This replaces each VIS instruction with an inline assembly macro implementation An example il lustrating the assembly implementation of vis fpadd16 is presented in section 3 4 2 on page 23 If you use fast with additional optimization option levels xO 1 12131415 you must take note that the last optimization level specified in the options string is used so the basic optimization level of fast may be overridden When compiling VIS code you must specify the target processor by setting the flag xchip ultra and identify the instructions that the compiler may use by set ting the flag xarch v8plusa The following example illustrates the compilation and linking of two VIS files cc c vis il xchip ultra xarch v8plusa filel c cc c vis il xchip ultra xarch v8plusa file2 c cc filel o file2 o o file Setting the v8plusa flag specifies the 32 bit subset of the 64 bit v9 architecture in cluding the VIS extension If you would like to generate assembly code say file1 s then use the S flag i e cc S vis il xchip ultra xarch v8plusa filel c 3 4 2 Inline Assembly Implementation of vis fpadd160 Code Example 3 1 shows the assembly implementation of inline m
60. acro vis fpadd160 Sun Microsystems Inc 23 VIS Instruction Set User s Manual Code Example 3 1 Inline Assembly Implementation of vis fpadd16 inline vis fpadd16 4 std 00 Ssp 0x48 ldd sp 0x48 f4 std 02 Ssp 0x48 ldd sp 0x48 f10 fpadd16 f4 f10 f0 end 3 5 VIS Simulator The VIS simulator is a development and debugging tool which permits you to test your VIS code on any platform Linking with the simulator library libvis sim so or libvis sim a supplied with the developers kit resolves the VIS function calls with a C simulation of the VIS instruction set The following example shows the compilation of two VIS code files and the linking with the simulator to create the executable binary cc c filel c cc c file2 c cc filel o file2 o o fil L VSDKHOMI E vis sim lvis sim The resulting binary will run on any machine and produce results that are identi cal to those produced by the UItraSPARC specific binary While executing quite slowly this option permits independent verification of algorithms and debugging VIS code in an independent environment The following is an example of a simu lator implementation of vis fpadd160 3 5 1 Example of Simulator Implementation of vis fpadd16 Code Example 3 2 illustrates the simulator implementation of the partitioned ad dition of two 4x16 bit partitioned values Code Example 3 2 Simulator Implementation of vis fpadd16 unio
61. al Array Fixed Point Address Format Sun Microsystems Inc 85 VIS Instruction Set User s Manual Note that z has only 9 integer bits as opposed to 11 for x and y Also note that since x y z are all contained in one 64 bit register they can be incremented si multaneously by using a 64 bit add sub instruction addx or subx thus provid ing a significant performance boost 2 The X Y size of the N x N x M volume Use the following table for the size specification Number of Elements So for a 512 x 512 x 32 or a 512 x 512 x 256 volume you will input a size value of 3 Note that the X and Y size of the volume have to be the same The z size of the volume is a multiple of 32 ranging between 32 and 512 The array instructions output an integer memory offset that when added to the base address of the volume gives you the address of the voxel and can be used by a load instruction The offset is correct only if the data has been reformatted as specified above The output is formatted as shown in Figure 4 28 for Array8 Figure 4 29 for Array16 and Figure 4 30 for Array32 20 17 17 2n 2n n Figure 4 28 Three Dimensional Array Blocked Address Format Array8 2n 2n n Figure 4 29 Three Dimensional Array Blocked Address Format Array16 Sun Microelectronics 86 4 Using VIS middle 22 19 19 19 15 11 7 6 4 2 0 2n 2n n Figure 4 30 Three Dimensional Array Blocked Address Form
62. ample vis u32 pixelsl 0x00112233 Vis u32 pixels2 Oxaabbccdd vis 32 d ej vis d64 mergeresult d vis to float pixels1 e vis to float pixels2 mergeresult vis fpmerge d e mergeresult 0x00aallbb22cc33dd 4 76 vis alignaddr vis_faligndata Function Calculate 8 byte aligned address and extract an arbitrary 8 bytes from two 8 byte aligned addresses Syntax void vis alignaddr void addr int offset vis d64 vis faligndata vis d64 data hi vis d64 data 10 Description vis alignaddr and vis faligndata are usually used together vis alignaddr takes an arbitrarily aligned pointer addr and a signed integer offset adds them places the rightmost three bits of the result in the address offset field of the GSR and returns the result with the rightmost 3 bits set to 0 This return value can then be used as an 8 byte aligned address for loading or storing a vis d64 variable An example is shown in Figure 4 22 Sun Microsystems Inc 71 VIS Instruction Set User s Manual aligned boundary address of destination data falignaddr da offset Y T dp x10000 x10008 da x10005 Data Start Address vis alignaddr x10005 0 returns x10000 with 5 placed in the GSR offset field vis alignaddr x10005 2 returns x10000 with 3 placed in the GSR offset field Figure 4 22 vis_alignaddr example vis faligndata takes two vis d64 arguments data hi and data lo It concatenates these tw
63. ases is not what you want So when the aligned addresses differ it is best to keep address1 less than or equal to address2 The little endian versions vis edge 8 161 32110 compute a mask that is bit reversed from the big endian version The following examples illustrates the handling of data boundaries by two functions vis inverse8a and vis inverse 8b that lead to identical results but differ in the way that they deal with the starting point vis inverse 8b never accesses data beyond the 8 byte aligned start address Such access occurs with vis inverse8a when the offset in the destination address alignment is larger than the offset in the source address alignment vis inverse8b uses one additional vis_ alignaddr vis faligndata pair to deal with the offset of address alignment in the destination This is a safer approach than vis inverse8a Figure 4 24 illustrates start point handling by the function vis inverse8a and Figure 4 25 illustrates start point handling by the function vis inverse 8b sp src sp 1 sp 2 F p vis alignaddr off vis faligndata m INVERSE vis pst 8 off dp dst dp 1 dp 2 emask 00111111 Figure 4 24 Start Point Handling in vis_inverse8a Sun Microelectronics 76 A vis_alignaddr vis_faligndata INVERSE M C DH 4 Using VIS pe gt vis alignaddr vis Sera oak vis pst 8 ff a oft mw dp 1 emask 00111111 dp dst inc
64. at Array32 See the example on page 101 to see how the array8 the load and the add sub in structions are used and grouped together for maximum throughput The group ing takes into consideration the latencies of the different instructions i e the load Idda following the array8 does not load the voxel just addressed by the array8 in its grouping but rather the voxel addressed by array8 in the previous group ing The array instructions operate on all 64 bits of an integer register Solaris 2 5 al lows all 64 bits of the registers g2 g4 and 00 07 to be used other registers cannot be relied on to retain their upper 32 bits Since the current SPARCompiler 4 x has limited support for 64 bit integer operations the array instructions might not be accessed efficiently from C For a coding example see Using array8 With Assembly Code on page 101 4 7 11 vis pdist Function Compute the absolute value of the difference between two pixel pairs i e between eight pairs of vis u8 components Syntax vis d64 vis pdist vis d64 pixelsl vis d64 pixels2 vis d64 accumulator Description vis pdist takes three double precision arguments pixels1 pixels2 and accum pixels and pixels2 contain 8 pixels each in raw format The pixels are subtracted from one another pair wise and the absolute values of the differences are accumulated into accum Note that the destination register is a double precision floating point register which co
65. at fmask float amp mask double dpx11 dpx12 dpyll dpyl2 ddy11 ddyl2 ddx11 ddx12 int idxu idxv ipxu ipxv long long value loop through every span line of the triangle while ily gt 0 Check to see if middle edge expired if imy 0 i dir gt 0 4 ipmx iplx idmx idlx else iphx iplx idhx idlx fpyz fpmz fdyz fdmz fpyu fpmu fdyu fdmu fpyv fpmv fdyv fdmv dpyll dpm11 ddyll ddmll dpyl2 dpml2 ddyl2 ddm12 Compute end of span and adjust to first pixel i iphx FIXMSK gt gt FIXSHF j iphx amp FIXMSK fbx fby i 8 number of pixels in the span xcnt ipmx FIXMSK gt gt FIXSHF i if xcnt gt 0 Sun Microsystems Inc 119 VIS Instruction Set User s Manual a float j pxz int fpyz float idxz gt gt i16 a ipxu int fpyu fdxu a ipxv int fpyv fdxv a dpx11 dpyll dpx12 dpyl2 loop through every pixel while xcnt gt 0 texture color lookup fcolor float amp tm ipxv gt gt v shift lt lt logw ipxu gt gt u shift apply diffuse and specular lighting final color texel amp mask diffuse specular fcolor fcolor amp fmask dpxll dpxl2 fcolor vis fpackl6 vis fpadd16 vis fmul8x16 vis fands fcolor fmask dpxll dpx12 send it to frame buffer value lo
66. atal 32 2 vi _64 ta2_32 ta2_64 ta2_32 ta2 64 ta2 32 data2 64 data2 32 x amp data2 32 data2 64 ta2 64 data2 32 ta2 32 64 data2 64 ta2 64 data2 32 ta2 32 data2 64 ta2 64 data2 32 ta2 32 data2 64 ta2 64 x data2 32 ES data2 64 x data2 32 data2 64 ur data2 32 jl data2 64 x data2 32 Af data2 64 data2 32 El Sun Microsystems Inc 51 VIS Instruction Set User s Manual 4 5 4 5 1 Pixel Compare Instructions vis_femplet le eq ne It ge 16 32 0 Function Perform logical comparison between two partitioned variables and generate an integer mask describing the result of the comparison Syntax in in in in in in in in in in in in vis_d64 datal_4 16 vis_d64 data2_4 16 vis_d64 datal_4 16 vis_d64 data2_4 16 vis d64 datal 4 16 vis d64 data2 4 16 vis d64 datal 4 16 vis d64 data2 4 16 vis d64 datal 2 32 vis d64 data2 2 32 vis d64 datal 2 32 vis d64 data2 2 32 vis d64 datal 2 32 vis d64 data2 2 32 vis d64 datal 2 32 vis d64 data2 2 32 vis d64 datal 4 16 vis d64 data2 4 16 vis d64 datal 2 32 vis d64 data2 2 32 vis d64 datal 4 16 vis d64 data2 4 16 vis fcmpgt16 vis fcmplel6 vis fcmpegl6 vis fcmpnel6 vis fcmpgt32 vis fcmpeq32 vis fcmple32 vis fcmpne32 vis fcmplt16 vis fcmplt32 vis fcmpgel6
67. c curs for a set misprediction The next field mechanism allows UltraSPARC to speculate 5 branches deep representing up to 18 instructions Instructions prefetched by the PDU are expanded to 76 bits in order to facilitate decoding done by the grouping logic These decoded instructions are forwarded to a 12 deep instruction buffer which allows the prefetcher to get ahead of the ex ecution units As long as the instruction queue is kept almost full cache miss set miss and micro TLB uTLB miss penalties can be hidden from the execution units A single entry uTLB provides the prefetcher with a local copy of the last virtual to physical address translation In the rare case of a uTLB miss a 1 cycle fetch penalty is incurred in order to get the address from the 64 entry fully associative instruction TLB iTLB The grouping logic always looks at the next four candidates in the instruction buffer and based on resource availability and dependencies issues up to four in structions Maintaining more than one Program Counter PC per group allows UItraSPARC to dispatch in the same group instructions from two adjacent basic blocks 2 3 1 Integer Execution Unit IEU The Integer Execution Unit IEU performs integer computation for all integer arithmetic logical operations The IEU as depicted in Figure 2 3 includes Sun Microsystems Inc 9 VIS Instruction Set User s Manual dual 64 bit adders implemented in dynamic circuitry an inverter an
68. cations 121 B Block Load and Store Instructions 88 Blocked byte formatting 84 Byte aligned addresses 71 C Compiling VIS Code 23 Cycle Counting with INCAS 29 D Data alignment 71 Data Cache 12 Data Memory Management Unit 14 Data types 42 Debugging with INCAS 30 Development Process 22 Dual Pipeline 17 External Cache 14 F Fixed Data Formats 43 Floating Point and Graphics Unit 11 G Generating a mask 74 Graphics applications 119 Graphics Status Register 44 64 I Imaging applications 106 INCAS 25 Integer Execution Unit 9 L Load Buffer 12 Load Store Unit 12 Logical Instructions 49 Logical operations 50 M Major functional units 6 Multiply instructions 57 to 58 60 62 Sun Microsystems Inc 135 VIS Instruction Set User s Manual P Partitioned data formats 43 Pixel Compare Instructions 52 Pixel formatting instructions 64 Prefetch Dispatch Unit 8 Processor Pipeline 16 R Read and write to registers 44 47 S Short Loads and Stores 82 Store Buffer 14 System Interface 15 T T EdgeMask 75 U Utility inlines 44 V Video Applications 123 vis 46 49 62 to 63 73 to 74 83 87 VIS Simulator 24 vis alignaddr 71 vis edge32 74 vis faligndata 71 to 72 vis fcmpteq 52 vis fcmptge 53 vis fcmptgt 52 vis fcmptle 52 vis fcmptlt 53 vis fcmptne 52 vis fexpand 69 vis fmul8sux16 60 64 vis fmul8ulx16 60 64 vis fmul8x16 57 to 58 69 vis fmul8x16a
69. cked0 ASI FL8 PRIMARY f12 load point 9 addx fixedl stepl5 fixedO0 fixed0 address of point 23 faligndata f28 Saccuml Saccuml array8 Sfixed0 size blocked0 blocked0 address of point 23 ldda base blockedl ASI FL8 PRIMARY f14 load point 8 subx fixedO0 step fixedl fixedl address of point 22 faligndata f30 Saccuml Saccuml std Soutput Saccuml store pixels 0 7 addcc loop counter 1 loop counter add output 8 output array8 fixedl size blockedl blockedl address of point 22 ldda base blocked0 ASI FL8 PRIMARY f16 load point 423 subx fixedl step fixed0 fixed0 address of point 21 faligndata f0 Saccum0 accumO0 array8 Sfixed0 size blocked0 blocked0 address of point 21 ldda base blockedl ASI FL8 PRIMARY f18 load point 422 Sun Microsystems Inc 103 VIS Instruction Set User s Manual subx fixedO0 step fixedl faligndata f2 Saccum0 saccum0 array8 fixedl size blockedl ldda subx fixedl step fixed0 faligndata f4 accum0 Saccum0 array8 Sfixed0 size Sblocked0 ldda subx fixedO0 step fixedl faligndata f6 accum0 Saccum0 array8 fixedl size blockedl ldda subx fixedl step fixed0 faligndata f8 accum0 Saccum0 array8 Sfixed0 size blocked0 ldda subx fixedO0 step fixedl faligndata f10 accum0 Saccum0 array8 fixedl size blockedl ldda addx fix
70. d as an argument to a partial store instruction 8 vis pst 16 or vis pst 32 vis pst 16 datal 4 16 amp data Stor overwri 2 4 16 S the greater 16 bit ting data2 4 16 mask lements of datal 4 16 or data2 4 16 4 6 Arithmetic Instructions The VIS arithmetic instructions perform partitioned addition subtraction or mul tiplication 4 6 1 vis fpadd 16 16s 32 32s vis fpsub 16 16s 32 32s 0 Function Perform addition and subtraction on two 16 bit four 16 bit or two 32 bit partitioned data Syntax vis d64 vis d64 vis d64 vis d64 vis f32 Sun Microelectronics 54 vis fpaddl6 vis d64 vis fpsubl16 vis d64 vis fpadd32 vis d64 vis fpsub32 vis d64 vis fpaddl6s vis f32 datal 2 16 datal 4 16 datal 4 16 datal 2 32 datal 2 32 vis d64 vis d64 vis d64 vis d64 vis f32 data2 2 1 data2 4 16 data2 4 16 data2 2 32 data2 2 32 6 4 Using VIS vis f32 vis fpsubl s vis f32 datal 2 16 vis f32 data2 2 16 vis f32 vis fpadd32s vis f32 datal 1 32 vis f32 data2 1 32 vis f32 vis fpsub32s vis f32 datal 1 32 vis f32 data2 1 32 Description vis fpadd160 and vis fpsub16 perform partitioned addition and subtraction between two 64 bit partitioned variables interpreted as four 16 bit signed components datal 4 16 and data2 4 16 and return a 64 bit partitioned variable interpreted as four 16 bit signed components sum 4 16 or difference
71. d very little extra logic muxes for immediate bypasses that form the basic cycle time of the machine together with the data cache access Dispatch Unit 7 read addresses Integer Register File 4 global sets 2x64 ALUO Register based CTIs Condition Codes Integer Divide Load Data Completion Unit Store Data 64 Load Store Unit Figure 2 3 Integer Execution Unit Sun Microelectronics 10 2 UltraSPARC Concepts A separate 64 bit adder is provided for virtual address additions for memory in structions A simple 64 bit integer multiplier and divider complement the IEU The multiplication unit implements a 2 bit Booth encoding algorithm with an early out mechanism with a typical latency of 8 clock cycles A 1 bit non re storing subtraction algorithm is used in the divide unit which yields a latency of 67 clock cycles for a 64 bit by 64 bit division 2 8 2 Floating Point Graphics Unit FGU The Floating Point and Graphics Unit FGU as illustrated in Figure 2 4 integrates five functional units and a 32 registers by 64 bits Register File The floating point adder multiplier and divider perform all FP operations while the graphics adder and multiplier perform the graphics operations of the VIS Instruction Set Dispatch Unit 5 read addresses Floating Point Graphics store Data Register File 32 64b regs Load Data Completion Unit Figure 2 4 Floating Poin
72. dk vis blend88 0x44 0x8390 fmul8x16 f3 f12 f12 ieul watchdisp 988 P FP1 vdk vis blend88 0x5c 0x83a8 fpack16 f10 f1 ieul watchdisp 988 Q vdk vis blend88 0x60 0x83ac std f0 i2 f mem addr not valid yet df0 even 0x00000000 OLDER ieul watchpipe GE CNOPW ieul watchpip ieul watchpipe P M LK ieul watchpip 00 uil cyclecount 0x000003dd E ON CON DONE kk ck ck ck ck ck ck ck ck kk ck kk ko KK ko ko kk ieul watchdone 989 K FP1 vdk vis blend88 0x48 0x8394 fmul8x16 f4 f14 f14 ieul watchdisp 989 R IEU1 vdk vis blend88 0x64 0x83b0 ret predicted branch addr main 0x18 0x83d0 ieul watchpipe GE C NOP W ieul watchpipe ieul watchpipe RPNM L ieul watchpip 00 leul breakpoint 2 stage G at vdk_vis_blend88 0x64 0x83b0 encountered uil cyclecount 0x000003de ON CON DONE ck ck ck ck Ck ck Ck ck ck Ck ck kk Sk kk ko Sk ko ko KA 3 6 9 Example Program Used in Illustrating INCAS Operation The following sections present the source code the assembly listing and the INCAS command batch file for vis example3 3 6 9 1 Source Code for vis example3 FUNCTION SYNOPSIS void vdk vis blend88 ARGUMENT spl X A X A A F F X X Sun Microelectronics 34 vis d64 spl vis d6 vdk vis blend88 blend two 8 pixel arrays vis d64 sp2 4 dp vis d64 ap pointer to 8 bytes of source data 1 tox xo X X X 3 Develop
73. dr0 addrl addr2 addr3 vis u8 addr4 addr5 addr6 addr7 vis d64 val0 vall val2 val3 val4 val5 val6 val7 accum vis d64 output vis alignaddr void 0 7 accum vis fzero for 1 Generate addr0 addr7 somehow val0 vis ld u8 addr0 vall vis ld u8 addr1 val2 vis ld u8 addr2 val3 vis ld u8 addr3 val4 vis ld u8 addr4 val5 vis ld u8 addr5 val6 vis ld u8 addr9 val7 vis ld u8 addr7 accum vis faligndata val7 accum accum vis faligndata val6 accum accum vis faligndata val5 accum accum vis faligndata val4 accum accum vis faligndata val3 accum accum vis faligndata val2 accum accum vis faligndata vall accum accum vis faligndata val0 accum output accum Sun Microsystems Inc 83 VIS Instruction Set User s Manual 47 10 Array Instructions The array instructions facilitate 3 d texture mapping and volume rendering by computing a memory address for data lookup based on fixed point x y and z co ordinates The data are laid out in a blocked fashion so that points which are near one another have their data stored in nearby memory locations If the texture data were laid out in the obvious fashion the z 0 plane following by the z 1 plane etc then even small changes in z would result in references to distant pages in memory The resulting lack of locality would tend to result in TLB misses and poor
74. e 64 Load Store Buffer Buffer ETE Integer FP Completion Units Second Level Cache Figure 2 5 Load Store Unit Each load is enqueued with an indication of whether it hits or misses the D Cache and this information is tracked for the lifetime of the operation even in the presence of snoops An age based associative comparison is performed in order Sun Microsystems Inc 13 VIS Instruction Set User s Manual to adjust the raw D Cache hit miss indicator of the incoming load to account for allocations or victimizations that may be performed by pending loads to that D Cache line Thus the D Cache tags are only checked once 2 3 8 8 Store Buffer The 8 entry Store Buffer each entry accounts for a 64 bit datum and its corre sponding address provides a temporary holding place for store operations until they can be committed and the D Cache and or the E Cache is available The E Cache update is a two step process First the E Cache tags are checked for hit miss Then the E Cache write occurs at some later time The E Cache tag and data RAM accesses are decoupled so that a tag check can occur in parallel with the E Cache data write of an older store thus maintaining a throughput of one store per clock Additionally consecutive stores to the same E Cache line 64B typically require only a single tag check thus minimizing tag check transactions Store compression combines the last two entries in the store buffer w
75. e See Figure 2 5 for a functional dia gram of the Load Store Unit 2 3 8 1 Data Cache The Data Cache D Cache is a 16kB direct mapped cache It has a 32B 256 bits line size with 16B 128 bits sub blocks It is virtually indexed and physically tagged The D Cache is non blocking and operates using a write through no write allocate policy Strict inclusion with respect to the E cache is maintained fa cilitating cache coherency The D Cache data SRAM is single ported and can sup port a 64 bit load or a 64 bit store every cycle In the event of a D Cache miss an entire sub block 16B can be written in one clock The D Cache tag SRAM has two ports a read port and area write port These two ports allow a load or store to perform a tag look up in parallel with the allocation for an older D Cache miss 2 3 3 2 Load Buffer The load buffer can eliminate stalls caused by D Cache misses load after store hazards and other conflicts Nine entries were implemented to cover the addi tional 6 cycle latency of a D Cache miss E Cache hit A rate of one load E Cache Sun Microelectronics 12 2 UltraSPARC Concepts hit per cycle can be sustained Early compiler results indicate that more than 50 statically of the loops in SPECfp92 are amenable to be software pipelined based on the E Cache latency These loops represent an even larger component of the dynamic execution time The load buffer is organized as a circular queue Register Fil
76. e filter coefficients are taken from coeffs 01 and coeffs 23 They are stored as signed fixed point numbers with 14 fractional digits i e they are roughly between 1 9999 and 1 9999 By choosing the filters according to the sub pixel positions within the source data this routine may be used to implement one pass of a two pass bicubic filtering algorithm finclude vis types h finclude vis proto h Sun Microelectronics 106 5 Advanced Topics void resample vis d64 ibuf Input buffer vis d64 obuf Output buffer int iTable Source column numbers vis f32 coeffs 01 First two filter coefficients vis f32 coeffs 23 Second two filter coefficients int dwidth Number of outputs to produce int p vis f32 f01 f23 vis d64 pix0 pixl pix2 pix3 acc hi acc lo vis write gsr 1 3 for p 0 p lt dwidth p Cache filter coefficients 01 coeffs 01 p 23 coeffs 23 p Read pixel data pix0 ibuf iTableH p pixl ibuf iTableH p 1 pix2 ibuf iTableH p 2 pix3 ibuf iTableH p 3 Compute high and low words of f0 pix0 fl pixl acc hi vis fpaddl6 vis fmul8xl6au vis read hi pix0 f01 vis fmul8x16al vis read hi pix1 f01 acc lo vis fpaddl6 vis fmul8xl6au vis read lo pix0 f01 vis fmul8xl6al vis read lo pix1 f01 Add high and low words of f2 pix2 to accumulator acc hi vi
77. east significant 16 bits of the 32 bit scale are used as a multiplier Figure 4 12 illustrates the vis fmul8x16al operation Since vis fmul8x16au uses the upper 16 bits of scale and vis fmul8x16al uses the lower 16 bits of scale two distinct scale values can be stored in scale pixels scale resultu Y Y Y Y 63 47 31 15 0 Figure 4 11 vis fmul8x16au operation pixels scale msb msb msb msb resultl 63 47 31 15 0 Figure 4 12 vis fmul8x16al operation Sun Microsystems Inc 59 VIS Instruction Set User s Manual Example vis f32 pixels scale vis d64 resultu resultl Most significant 16 bits of scale multiply resultu vis_fmul8xl6au pixels scale Least significant 16 bits of scale multiply resultl vis fmul8x6al pixels scale 4 6 4 vis fmul6sux160 vis fmul8ulx16 Function Multiply the corresponding elements of two 16 bit partitioned vis d64 variables to produce a 16 bit partitioned vis d64 result Syntax vis d64 vis fmul8sux16 vis d64 datal 16 vis d64 data2 16 vis d64 vis fmul8ulx16 vis d64 datal 16 vis d64 data2 16 Description Both vis fmul8sux16 and vis fmul8ulx16 perform half a multiplication fmul8sux16 multiplies the signed upper 8 bits of each 16 bit signed component of datal 4 16 by the corresponding 16 bit fixed point signed component in data2 4 16 The upper 16 bits of the 24 bit product are returned in
78. edl stepl5 fixed0 faligndata f12 accum0 Saccum0 array8 Sfixed0 size blocked0 ldda subx fixedO0 step fixedl faligndata f14 accum0 Saccum0 std Soutput accumO brne loop add output Store pi 8 output exit fa ta f16 Saccuml Saccuml f18 2120 2122 2124 126 128 ta f30 Saccuml tput Saccuml lignd falignd falignd falignd falign falign falignd falignd std Sou ta Saccuml Saccuml ta Saccuml Saccuml ta Saccuml Saccuml Saccuml Saccuml o ta Saccuml Saccuml o ta ta Saccuml Saccuml a a a a a a a a Saccuml store pi Sun Microelectronics 104 Sbase blocked0 ASI FL8 Sbase blockedl1 ASI FL8 Sbase blocked0 ASI FL8 Sbase blockedl1 ASI FL8 Sbase blocked0 ASI FL8 Sbase blockedl1 ASI FL8 fixedl address of point 20 blockedl address of point 20 f20 load point 21 address of point 19 PRIMARY fixedod blocked0 address of point 19 f22 load point 20 address of point 18 PRIMARY fixedl i PRIMARY fixed0 blockedl address of point 18 Sf 24 load point 19 address of point 17 blocked0 address of point 17 f26 load point 18 address of point 16 PRIMARY fixedl blockedl address of point 16 128 load point 17 address of point 431 PRIMARY fixed0 blocked0 address of point 15 S 30 load point 16 address
79. em Controller INCAS accurate INCAS less accurate Figure 3 1 INCAS Accuracy Model Therefore when working with large data sets where 2nd level cache misses and hence interaction with main memory may be more frequent the INCAS cycle count may be off the mark resulting in a cycle count that may be greater or less than that achieved on a real UltraSPARC system In general the results from INCAS should be treated as ball park figures and not as hard numbers attain able on a real system In reality the number of cycles a section of C code takes to run does not only de pend upon itself Adjacent code immediately before and after the execution seg ment also affect the cycle count because the compiler optimizes the code based on the whole program when generating the binary Also optimizing compilers may not produce the same binary instructions for a code segment compiled alone ver sus those compiled as part of a larger program Sun Microelectronics 26 3 Development Flow 3 6 8 Preparing To Use INCAS Since INCAS is a simulator for a processor it does not include operating system services Therefore before you run your binary it is recommended that you make the following modifications to your code and rebuild your binary before running it on INCAS 1 Modify your code to eliminate all system calls such as malloc free scanf printf etc by changing them to incas malloc incas free incas scanf incas printf e
80. eptable or not depends on the application The following illustrates the processing of one scan line define VIS OFFSET addr addr 7 define VIS ALIGN addr addr amp 7 void alpha blend vis u8 d vis u8 sl vis u8 s2 vis u8 a int width Arguments d pointer to destination data sl pointer to data for image sl s2 pointer to data for image s2 a pointer to data for control image alpha width data width of sl s2 and alpha Sun Microelectronics 116 5 Advanced Topics Last byte of destination vis u8 d end E Doubleword aligned pointers vis_d64 d aligned sl aligned s2 aligned alpha aligned Alignment of original pointers int d offset sl offset s2 offset alpha offset Unaligned data from memory vis d64 u alpha 0 u alpha 1 u sl 0 u sl 1 u s2 0 u s2 1 Properly aligned data vis d64 quad a dbl s1 dbl s2 dbl a dbl d Temporaries vis d64 dbl sl e dbl s2 e dbl tmpl dbl tmp2 vis d64 dbl suml dbl sum2 Edge mask for partial stores unsigned int emask Loop variables int i times vis write gsr 3 lt lt 3 Four 7 3 bits of fractional precision d end d width 1 d offset VIS OFFSET d d aligned vis d64 VIS ALIGN d Compute initial edge mask for destination mask vis edge8 d d end
81. es vis d64 addr addr last addr aligned vis d64 zero int emask zero vis fzero addr aligned vis alignaddr addr 0 mask vis edge8 addr addr last while vis u32 addr aligned lt vis u32 addr last vis pst 8 zero addr aligned emask addr aligned mask vis edge8 addr aligned addr last Sun Microsystems Inc 81 VIS Instruction Set User s Manual Code Example 4 5 Same Function as the Loop in Code Example 4 4 Except Using an Explicit Loop Counter vis d64 addr addr last addr aligned vis d64 zero LAT emask times zero vis_fzero addr_aligned vis_alignaddr addr 0 mask vis_edge8 addr addr_last times vis_u32 addr last gt gt 3 vis u32 addr gt gt 3 1 for i 0 i lt times i vis_pst_8 zero addr aligned emask addr aligned mask vis edge8 addr aligned addr last Note If there are memory mapped devices in your system and you are using the partial store instruction vis pst 8 16 32 described in section 4 7 8 of the VIS User s Guide to store data in memory locations into which the device is mapped then this operation will only work if the device is cached The partial store is a read modify write operation and will not work for non cached memory mapped devices e g will not work across the S Bus 4 7 9 Short Loads and Stores Function Perform 8 and 16 bit loads and stores to and from floating point regi
82. esult 64 pO datal result datal 64 vis _64 da o Il 4 result resul V 2 s fors datal tal 32 32 da ll H Cr N Q w w Il result 6 is resul fand datal_ tal_64 amp da rs o l y 4 H Q IST result is resul fands datal tal 32 amp da 32 CEN w l N 4 Q w result_6 resul v 4 fxor data1 tal 64 S 64 aa rT oes H Q w o Il result resul V 2 fxors datal datal 32 S 327 aa Ta INS l H Co l result 6 resul V 4 is fnor datal 64 datal 64 da YT ut H o Il result resul V 2 s fnors datal datal 32 32 da CEN H Co l result 6 resul V 4 s fnand datal datal 64 64 amp da Chips a Il result resul V 2 s fnands datal_ datal 325 32 amp da CEN l Co l result 6 result V 4 s fxnor datal datal 64 _64 da ctam o Il result result V 2 s fxnors datal datal 32 32 da H CEN Co l result 6 resul V 4 s fornot datal 64 datal 64 ctam H o Il result resul s fornots datal 32 datal 32 2 vi 232 resul 7 E 64 vis fandnot datal 64 datal 64 amp data2 64 ct result 3 resul s fandnots datal 32 d
83. fpaddl6 dbl sl he dbl tmpl dbl sl e vis fexpand vis read lo dbl s1 dbl s2 e vis fexpand vis read lo dbl s2 dbl tmp2 vis fpsubl6 dbl s2 e dbl sl e dbl tmpl vis fmul8xl6 vis read lo quad a dbl tmp2 dbl sum2 vis fpaddl16 dbl sl e dbl tmpl dbl d vis freg pair vis fpackl6 dbl suml vis fpackl6 dbl sum2 vis pst 8 dbl d void d aligned emask ttd aligned mask vis edge8 d aligned d end Sun Microelectronics 118 5 Advanced Topics 5 3 Graphics Applications 5 8 1 Texture Mapping This section of code computes the depth Z and color a B G R of each pixel in a triangle object Z is a 32 bit z buffer value and a B G R are 8 bit alpha blue green and red values The 32 bit Z value is concatenated with the 32 bit a B G R value and the resulting 64 bit value is sent to the frame buffer Computing a B G R consists of a lookup from a texture map and then applying diffuse and specular lighting which is a multiply and add operation Using VIS we can stuff a B G R into a 32 bit floating point register and use VIS partitioned arithmetic operators vis fmul8x16 and vis fpadd160 to operate on a B G and R at the same time In the code example shown we are not interested in the o value and hence it is masked out The following is a small section of code that is part of a bigger function and is not complete function by itself float fcolor unsigned mask Oxffffff flo
84. gram using VIS XilMemoryStorage storage XilUnsigned8 im2 data xil export im2 xil get memory storage im2 amp storage my multiply func storage byte data width height storage byte pixel stride storage byte scanline stride The function my multiply func performs the operation and can be written in C or using VIS The VIS example of my multiply func might look like my multiply func data width height pixel stride Scanline stride In this example we will assume the source data is 8 byte aligned and that the width and scanline stride are multiples of 8 It is also assumed that pixel stride 1 int ix iy vis d64 dataptr lineptr NVIS cR32 st rf vis d64 sd se rd dd vis write gsr 3 3 dataptr lineptr vis d64 data for iy 0 iy lt height iy for ix 0 ix lt width gt gt 3 ix sd dataptr Sf vis read hi sd se vis fexpand sf rd vis fmult8x16 sf se rf vis fpackl6 rd dd vis write hi dd rf Sf vis read lo sd se vis fexpand sf rd vis fmult8x16 sf se rf vis_fpack16 rd dd vis_write_lo dd rf dataptr dd dataptr lineptr scanline stride gt gt 3 Sun Microsystems Inc 133 VIS Instruction Set User s Manual Sun Microelectronics 134 Index A Addition and subtraction 54 Arithmetic Instructions 54 Array Instructions 84 array8 array16 and array32 84 Audio Appli
85. hen they both write to the same 16B block Any number of stores can be combined into one transaction Hence the number of data write transactions are minimized an im portant concern since all stores must update the E Cache given that the D Cache is a write through design 2 3 3 4 Data Memory Management Unit DMMU The data memory management unit DMMU incorporates a fully associative 64 entry Translation Lookaside Buffer TLB that provides one virtual to physical address translation per cycle Any combination of the 8kB 16kB 512kB and 4MB supported page sizes is allowed A TLB miss is handled by software for simplici ty and flexibility with a simple hardware assist provided for speed Two read only registers contain pointers to translation table entries from the Translation Storage Buffer TSB defined as a simple direct mapped software cache A sepa rate set of 8 global registers is also accessible as temporary storage 2 3 4 External Cache The External Cache is used to service misses from the I Cache in the UltraSPARC front end and the D Cache in the LSU It is a physically addressed and physically tagged SRAM implementation The line size is 64 bytes E Cache sizes from 512kB to 4MB are supported with E Cache data protected by byte parity An in ternal delayed write buffer minimizes the write after read WAR penalty Writes to the SRAM core are delayed until the next write arrives and the buffer is fully bypassed inside the SRAM
86. hh vis fmuld8suxl6 sh ff tlh vis fmuld8suxl16 sl ff thl vis fmuld8ulx16 sh ff tll vis fmuld8ulx16 sl ff tdh vis fpadd32 thh thl tdl vis fpadd32 tlh t11 rdh vis fpadd32 rdh tdh rdl vis fpadd32 rdl tdl ss 2 Sun Microelectronics 122 da gt gt 3 5 Advanced Topics dh vis fpackfix rdh dl vis fpackfix rdl dd vis freg pair dh dl store 8 bytes of result vis pst 16 dd dp emask sa 8 dp prepare edge mask for the end point mask vis_edgel6 dp dend 5 5 Video Applications 5 5 1 Motion Vector Estimation This example presents a single iteration of a motion vector estimation process A 16x16 block of pixels of frame2 is taken and a search within a specified area in framel is performed to determine if something similar to the 16x16 block from frame2 exists If it does then a motion vector is estimated from this location similar is estimated by the absolute sum of differences diff between the two 16 x 16 blocks The absolute sum of differences is computed in accordance with the following relationship 15 15 diff y D frame1 i j frame2 i j i 0j 0 The speedup capability of VIS is illustrated by the loading and processing of 8 bytes at a time vis_pdist computes the absolute sum of differences between 8 pixels at a time Data of less than 8 bytes are processed by plain unparti
87. icrosystems Inc 127 VIS Instruction Set User s Manual A 2 Minimization of Conditional Usage In order to take full advantage of the Superscalar pipeline architecture one should always use the most predictable instruction patterns and avoid the use of conditionals inside tight loops If tempted to make use of branches to minimize memory references or computations keep in mind that in many cases this might actually impede the generation of efficient code This occurs because branching inhibits the efficient grouping of instructions resulting in inefficient use of the pipelined architecture of the UltraSPARC I A 3 Dealing With Misaligned Data VIS typically deals in groups of 4 or 8 data values at a time but your data may not be exact multiples of 4 or 8 When dealing with 2d image scan lines you can deal with this by using vis aligndata and vis edge 8 16 32 0 instructions There may be cases however where you might use some complex logic in com bination with VIS instructions to deal with this Typically in such cases it is best to write small clean up loops for clarity rather than for speed since on average we expect to spend a vanishing percentage of the run time there and so one might prefer not to spend a significant portion of code development and debug ging time on them In addition clever loop optimizations often slow down loops which are only executed a few times A 4 Cycle Expensive Operations Reading and writing the
88. igndata byte accum is to push byte into the left end of accum The eight output bytes need to be pushed into the accumulator in reverse order ARGUMENTS KS pointer to first byte of first pixel of source data dst pointer to first byte of first pixel of destination table loook up table width number of bytes of pixel data jJ finclude vis types h finclude vis proto h void lookup vis u8 src vis u8 dst vis u8 table 256 int width vis u32 word0 wordl word2 word3 vis d64 lookup accum int byte0 bytel byte2 byte3 byte4 byte5 byte6 byte7 int align doubles next i Set gsr align bits to 7 void vis alignaddr void O 7 Work naively until dst is aligned align 8 dst amp 7 if align width Sun Microelectronics 112 align width if align 8 for i 0 dst i src align dst align width align i align table src i i 5 Advanced Topics Now work based on source offset align unsigned long src amp 0x3 Zero two lsb s of src src vis_u8 unsigned long src amp 0x3 wordO vis u32 src 0 wordl vis u32 src 1 word2 vis u32 src 2 word3 vis u32 src 3 next 4 Last iteration done separately to not to read past the end doubles width 8 1 switch align case 0 for i 0 i lt d
89. ii E E E 25 2 Stage 2 Decode D St gesie iren nesaet e ee eaei e i e de o eee tee ee te ten 25 3 Stage3 Grouping G Stage ccocenoncnconononennncnnnnnnnnnnnnnnennnoranannnnnnarnrarannannnanoso 2 5 4 Stage 4 Execution E Stage 25 5 Stage 5 Cache Access C Stage ene oe NO OG G G GN FE gt Rh a a BOB E ER NO 91 e e o oo Oo N 18 Sun Microsystems Inc vu VIS Instruction Set User s Manual 29 6 NS A ed obe ei iia NAS E AA ere iere ei ne ie e 25 8 Stage 8 INS DIage 3 ceteram ut eene te sacs seers estates angle dae 2 5 9 Stage 9 Write W Stage nennen 2 6 Performance Improvement esses eene eene tn Development Flow omnia tacita caridad edi 9 1 COVA ia a td an ok tete eria epe ia 3 2 Development Process Overview cscccesccesesssestenensseseseeesesescecesesesesnasnesesesnenenesees 3 3 VIS Software Developer s Kit ccccscccesccssesessstenenssesesneesesescecesesssesnasnesesesnenenesees 34 SPARCompil r x BC 4X iiic eene eminet p er reti redet 341 Compiling VIS Code eee eee eee 3 4 2 Inline Assembly Implementation of vis fpadd16 ee 30 WlSSimulatOt e necem tet potete Tete ten in dae eti pie Dre ee ena 3 5 1 Example of Simulator Implementation of vis fpadd16 3 6 Use of INCAS sene rrt etenim ie ere age pr ei tret iE 261 What Is INCAS Gusta rain iege dv pertes 3 6 2 Limitations of Incas Simulation
90. ikely not taken The advantage of the in cache predic tion scheme is that it avoids the alias problems encountered in branch history Floating Point Graphics Figure 2 2 UltraSPARC I Front End Sun Microelectronics 8 2 UltraSPARC Concepts buffer and other similar structures Every single branch in the I Cache has its dedicated prediction bits ignoring the rare case of branch couples which trans lates into a successful prediction rate of 88 for integer code 94 for floating point SPEC92 and 90 for typical database applications Every group of four instructions in the cache has a next field which is simply a pointer to where the prefetcher should access instructions for the very next cycle In the case of sequential code or for code with a branch predicted not taken the next field points to the next 4 instructions in the cache The next field will contain the I Cache index including the set of the branch target if a branch is predicted taken The advantage of this scheme is that the next field can always be fed back to the I Cache without qualifying a possible branch In order to provide a one cy cle loop back to the I Cache a fast dual ported structure was used to implement the next field and the branch prediction bits Only one set of the cache is accessed during a fetch saving power and reducing the cache cycle time Both tags are read so that an incorrect set prediction can be corrected A two cycle penalty o
91. ill start writing at address 0x10000 and the mask 00011111 will disable the writes to 0x10000 0x10001 and 0x10002 and enable writes to 0x10003 0x10004 0x10005 0x10006 and 0x10007 vis edge 8 16 32 accept 2 addresses address1 and address2 where address is the address of the next pixel to write and address2 is the address of the last pixel in the scanline These instructions compute two masks a left edge mask and a right edge mask The left edge mask is computed from the 3 least significant bits LSBs of address1 and the right edge mask is computed from the 3 LSBs of address2 according to Table 4 1 or Table 4 2 for little endian byte ordering Sun Microelectronics 74 4 Using VIS Table 4 1 Edge Mask Specification Edge Size Left Edge Right Edge Edge Size Left Edge Right Edge They then zero out the three least significant bits of address and address2 to get 8 byte aligned addresses i e address1 amp 7 address2 amp 7 If the aligned addresses differ then the left edge mask is returned if they are the same the result of the bitwise ANDing of the left and right edge masks is returned Note that if the aligned addresses differ and address1 is greater Sun Microsystems Inc 75 VIS Instruction Set User s Manual than address2 the edge instructions still return the left edge mask which in almost all c
92. ion computes half of the product which when added together gives a 32 bit product Example vis f32 datal6s1l datal6s2 vis d64 result resultu resultl resultu vis fmuld8sux16 datal6s1 datal6s2 resultl vis fmuld8ulx16 datal6s1 datal6s2 result vis fpadd32 resultu resultl 4 Pixel Formatting Instructions Pixel formatting instructions include packing instructions which convert 16 or 32 bit data to a lower precision fixed or pixel format Input values are clipped to the dynamic range of the output format Packing applies a scale factor determined from a scale factor field in the Graphics Status Register GSR to allow flexible po sitioning of the binary point Pixel formatting instructions also include expand instructions that convert 8 bit elements to 16 bit elements and merge instructions that merge 2 independent pixel data elements into a 64 bit result 4 1 vis fpack160 Function Truncates four 16 bit signed components to four 8 bit unsigned components Syntax vis f32 fpackl6 vis d64 data 4 16 Description vis fpack16 takes four 16 bit fixed components within data 4 16 scales truncates and clips them into four 8 bit unsigned components and returns a vis f32 result This is accomplished by left shifting the 16 bit component as determined from the scale factor field of GSR and truncating to an 8 bit unsigned integer by rounding and then discarding the least significant Sun Microelectronics 64
93. itioned add and subtract on either 16 or 32 bit components and 7 variants of partitioned multiply instructions capable of 8 bit and 16 bit component multiplication 3 Logical operations that perform any one of 16 bitwise logical operations 4 Address handling instructions to deal with misaligned data 5 Array instructions to provide efficient access to three dimensional data sets 6 Memory access instructions permitting partial stores of partitioned data and performing 8 and 16 bit loads and stores to and from 64 bit or 32 bit variables Sun Microelectronics 2 1 Introduction 7 Pixel distance instruction computing the absolute difference between corresponding 8 bit components in a pair of double precision registers and accumulating the sum of differences 1 8 Performance Advantage of VIS Figure 1 1 illustrates the performance advantage of a partitioned 8 bit by 16 bit multiplication i e four 8 x 16 multiplies performed in a single cycle resulting in a 4 times speedup Figure 1 1 Four multiplications performed in a single cycle Sun Microsystems Inc 3 VIS Instruction Set User s Manual Sun Microelectronics 4 UltraSPARC Concepts 2 2 1 Overview This chapter presents the major hardware features of the new UltraSPARC micro processor implementing the 64 bit SPARC V9 architecture that give accelerated graphics performance using VIS Topics included in this chapter are descriptions of Functional U
94. ize is multiple of 8 64 for code us ing block load and store the vis edge80 instructions can be removed providing additional speed up An example of a VIS implementation for image inversion a general data format and 8 byte pre aligned data that is a multiple of 8 image size is demonstrated in VSDKHOME examples src vis inverse8 c Sun Microsystems Inc 129 VIS Instruction Set User s Manual Sun Microelectronics 130 Extendingan XILprogramusingVIS B B 1 Overview This appendix tells how you can incorporate VIS code into a higher level library like XIL Sun Microsystems Inc 131 VIS Instruction Set User s Manual B 2 Extending XIL If you are writing an imaging application on Sparc you are likely using XIL the imaging foundation library XIL provides an interface for imaging which allows imaging applications to run across the Sun product line as well as providing source code compatibility on x86 and PowerPC Solaris platforms XIL was de signed to allow easy extensibility for arbitrary processing and has the ability to load run time modules which have been accelerated for specific platforms Spe cial VIS accelerated run time modules have been created using techniques like the ones described in the previous sections these are loaded automatically when XIL programs are executed on UltraSparc systems For functions not supported by XIL the user must gain access to the data and perform the desired processing This sec
95. l 58 to 59 vis fmul8x16au 58 to 59 Sun Microelectronics 136 vis fnot 50 vis fnots 50 vis fone 49 vis fones 49 vis fpack16 64 vis fpack32 66 vis fpackfix 44 68 vis fpadd16 55 vis fpadd16s 55 vis fpadd32 55 vis fpadd32s 55 vis fpmerge 70 vis fpsub16 55 vis fpsubl6s 55 vis fpsub32 55 vis fpsub32s 55 vis freg pair 45 to 46 vis fsrc 50 vis fsrcs 50 vis fzero 49 vis fzeros 49 vis pst 8 81 vis read gsr 44 vis read hi 45 vis read lo 45 vis to double 47 vis to double dup 47 vis to float 47 vis write gsr 44 vis write hi 45 to 46 vis write lo 45 to 46 W Write and Read GSR 44 Write mask 81
96. le byte5 accum vis faligndata lookup accum lookup vis 1d u8 i vis ras table byte4 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte3 Sun Microsystems Inc 115 VIS Instruction Set User s Manual accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte2 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table bytel accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte0 accum vis faligndata lookup accum vis_d64 dst i accum break Update pointers remaining width src 8 doubles dst 8 doubles width 8 doubles Finish up any remaining pixels for i 0 i lt width 1 dst i table src il 5 2 4 Alpha Blending Two Images This example illustrates an application where two images are blended together For each pair of corresponding pixels in two images s1 and s2 a correspond ing pixel is read from a third control image alpha to compute dst alpha 256 s1 1 alpha 256 s2 sl s2 alpha 256 sl Note that alpha can only range between 0 and 255 so strictly speaking we should divide it by 255 not 256 However the division by 256 occurs for free when we perform the vis_fmul8x16 operation and the destination will differ from the cor rect result by at most 1 Whether this trade off is acc
97. ligned address once and either increment it directly or use array notation This will ensure that the address arithmetic is performed in the integer units in parallel with the execution of the VIS instructions 4 7 7 vis edge 8 16 32 0 Function Compute a mask used for partial storage at an arbitrarily aligned start or stop address Instructions are typically used to handle boundary conditions for parallel pixel scan line loops Syntax Pure edge handling instructions vis u8 vis edge8 void adressl void adress2 vis ul6 vis_edgel6 void adressl void adress2 vis u32 vis edge32 void adressl void adress2 Little endian version of pure edge handling instructions vis u8 vis edge8l void adressl void adress2 vis ul6 vis edgel6l void adressl void adress2 vis u32 vis_edge321 void adressl void adress2 Description vis edge80 vis edge16 and vis edge320 compute a mask to identify which 8 16 or 32 bit components of a vis d64 variable are valid for writing to a 8 byte aligned address vis edge 8 16 3210 are typically used with a partial store instruction Partial stores always start to write at an 8 byte aligned address an application on the other hand may want to start writing at an arbitrary address that is not 8 byte aligned This necessitates a mask e g if you want to start writing data at address 0x10003 the partial store using a partial store instruction as described in the next section w
98. line start point in destination vis u8 dend line end point in destination vis d64 dp 8 byte aligned start point in dest int off offset of address alignment in dest int emask edge masks vis d64 sd s0 sl source data vis f32 sh sl vis f32 ff filter data vis u32 fu vis d64 thh thl tlh termporaries vis d64 tll tdh tdl vis d64 rdh rdl intermediate results vis d64 dg destination data vis f32 dh dl int n k num loop variables set GSR scale factor to O0 each vis s32 component will be saved by vis fpackfix such that bits 16 to 31 of eL Sun Microsystems Inc 121 VIS Instruction Set User s Manual vis write gsr 0 prepare the detination address da vis u8 dst dp vis d64 vis u32 da amp 7 off vis u32 dp vis u32 da dend da 2 dlen 1 generate edge mask for the start point mask vis_edgel6 da dend prepare the source address sa vis u8 src num vis u32 dend gt gt 3 vis_u32 for n 0 n lt num n SS sa rdh vis fzero rdl vis fzero for k 0 k flen k load 8 bytes of source data Sp vis d64 vis alignadar ss s0 sp 0 sl sp 1 sd vis faligndata s0 sl fu fir k lt lt 16 fir k Oxffff ff vis to float fu sh vis read hi sd sl vis read lo sd t
99. lude lt stdlib h gt include vis types h include vis proto h invert an array of 8 bit data int num int num Figure 4 25 Start Point Handling in vis invers8b Examples FUNCTION vis inverse8a vis inverse8b SYNOPSIS void vis inverse8a vis u8 src vis u8 dst void vis inverse8b vis u8 src vis u8 dst ARGUMENT src pointer to first byte of source data dst pointer to first byte of destination data num length of arrays DESCRIPTION dst i 255 src i 0 lt i lt num Sun Microsystems Inc 77 VIS Instruction Set User s Manual Code Example 4 1 void vis inverse8a vis u8 sa src vis d64 sp x vis u8 da dst vis u8 dend dend2 vis d64 dp 8 byte int off offset int emask vis d64 s sl s0 vis d64 d prepare vis u8 src Data Boundary Handling By vis inverse8a vis u8 dst int length start point in source 8 byte aligned start point in source start point in destination end point in destination aligned start point in destination of address alignment in destination edge mask source data destination data destination address dp vis d64 vis u32 da amp 7 off vis u32 dp vis u32 da dend da length 1 pointer to the last byte of data dend2 dend 8 pointer to the last byte which doesn t need edge handling
100. ment Flow sp2 pointer to 8 bytes of source data 2 dp pointer to 8 bytes ap pointer to 8 bytes DESCRIPTION Blend two arrays with a alpha coefficient array dst alpha srcl 255 alpha src2 include lt stdlib h gt include include vis types h vis proto h of destination data of alpha coefficient 0 lt alpha lt 255 BRK kk kk kk kk kk kk kk kk kk kk kk k k K k AA void vdk vis blend88 vis d64 spl vis d64 sp2 vis d64 dp vis d64 ap vis d64 sdl sd2 ad vis d64 ones viscf32 Silh St2h fll st21 vis d64 adh bdh adl bdl vis d64 rdlh rd2h rdll rd21 vis d64 rdh rdl vis d64 rd sdl sd2 ad spl 0 sp2 0 ap 0 vis write gsr 3 3 ones adh adl bdh bdl sflh sf11 rdlh rdil sf2h sf21 rd2h rd21 rdh vis to double dup 0x0Off00ff0 vis fexpand hi ad vis fexpand lo ad vis fpsub16 ones vis fpsubl6 ones vis read hi sdl vis read lo sdl vis fmul8x16 sflh vis fmul8x16 sfl11l vis read hi sd2 vis read lo sd2 vis fmul8x16 sf2h vis fmul8x16 sf21 vis fpadd16 rdlh adh adl adh adl bdh bdl rd2h Sun Microsystems Inc 35 VIS Instruction Set User s Manual rdl vis fpaddl6 rdll rd21 rd vis fpackl16 to hi rd rdh rd vis fpackl6 to lo rd rdl dp 0 rd RRR KR A KG ke X e X e x main int argc vis d64
101. n vis dreg overlay vis d64 d64 vis f32 f32 2 vis u32 u32 2 vis s32 s32 vis ul6 ul6 vis s16 s16 vis u8 u8 8 vis s8 s8 8 unsigned long long ull struct vola uw SL x l Sun Microelectronics 24 3 Development Flow vis d64 vis fpaddl6 vis d64 frsl vis d64 frs2 union vis dreg overlay opl op2 dest op1 d64 frsl op2 d64 frs2 dest s16 0 op1 s16 0 op2 s16 0 dest s16 1 opl s16 1 op2 s16 1 dest s16 2 opl s16 2 op2 s16 2 dest s16 3 opl s16 3 op2 s16 3 return dest d64 3 6 UseofINCAS 3 6 1 What Is INCAS INCAS It s a Near Cycle Accurate Simulator is a near cycle accurate model of the UltraSPARC I processor INCAS offers you a convenient way to do code per formance prediction cycle counting and to examine processor status at each cycle to assist in debugging and optimizing your code 3 6 2 Limitations of Incas Simulation INCAS models the UltraSPARC I processor including the instruction cache the data cache and the external or 2nd level cache quite accurately However the in teraction of the processor with the system controller and main memory is mod eled at a lesser level of accuracy as shown in Figure 3 1 Sun Microsystems Inc 25 VIS Instruction Set User s Manual UItraSPARC I processor with 16 Kbytes Instruction Cache amp 16 Kbytes Data Cache 128 bit wide bus External or 2nd Level Cache 512 Kbytes to 4 Mbytes Syst
102. nd point in destination vis d64 dp 8 byte aligned start point in destination int off offset of address alignment in destination int emask edge mask vis d64 s sl s0 source data vis d64 d destination data prepare destination address dp vis d64 vis u32 da amp 7 off 8 vis u32 da 7 dend da length 1 pointer to the last byte of data dend2 dend 8 pointer to the last byte which doesn t need edge handling EJ generate edge mask for start point mask vis_edge8 da dend prepare source address and set GSR alignaddr offset sp vis_d64 vis_alignaddr sa 0 Sun Microsystems Inc 79 VIS Instruction Set User s Manual load 8 bytes of source data s0 sp sp tt sl sp S vis_faligndata s0 s1 8 pixel inversion d vis_fnot s store 8 bytes of result vis_alignaddr void off 0 vis pst 8 vis faligndata d d dp emask s0 s1 sa off dp prepare source address and set GSR alignaddr offset sp vis d64 vis alignaddr sa 0 set edge mask to 11111111 so all 8 bytes of data will be saved in vis pst 8 doing while loop emask Oxff 8 byte loop while vis_u32 dp lt load 8 bytes sl sp s vis faligndata s0 8 pixel inversion vis u32 dend2 of source data s d
103. nd sent to the Store Buffer separately In the Floating point Graphics pipe this is the second execution stage X where execution continues for most instructions 2 5 7 Stage 7 No Stage In this stage the Integer Pipe essentially waits for the Floating point Graphics pipe to complete Most floating point instructions in Floating point Graphics pipe finish their execution during this stage After N5 data can be bypassed to other stages or forwarded to the data portion of the Store Buffer All loads that have entered the Load Buffer in N4 continue their progress through the buffer they will reappear in the pipeline only when the data comes back 2 5 8 Stage 8 N3 Stage In this stage the Integer and Floating point Graphics pipes converge to resolve traps Sun Microsystems Inc 19 VIS Instruction Set User s Manual 2 5 9 Stage 9 Write W Stage In this stage all results integer and floating point are written to the register files All actions performed during this stage are irreversible After this stage instruc tions are considered terminated 2 6 Performance Improvement The expanded hardware capabilities of the UltraSPARC I processor offer you a sustained execution rate of four instructions per cycle even in the presence of conditional branches and cache misses Typically this may include a simulta neous execution of 2 floating point graphics 1 integer and 1 load store instruc tion per cycle Sun Microelectronics 20
104. ne USE BLD define USE BST define MEMBAR BEFORE BLD define MEMBAR AFTER BLD define BI fmovd XX XX define BUBBLE BI define BUBBLE1 BI define BUBBLE BI BI define BUBBLE BI BI BI define BUBBLE4 BI BI BI BI define BUBBLE5 BI BI BI BI define BUBBLE BI BI BI Bl define BUBBLE BI BI BI BI define BUBBLE BI BI BI BI define BUBBLE BIS BI Bis Bl define BUBBLE10 BI BI BI BI ifdef USE BLD define BLD AO ldda sa ASI BLK P A0 cmp sa se blu pt Sicc 1f inc 64 sa dec 64 sa T3 else Sun Microsystems Inc 97 VIS Instruction Set User s Manual define 15 fendif BLD A0 ldd ldd ldd ldd ldd ldd ldd ldd cmp blu pt inc dec ifdef USE BLD define else define UE fendif BLD BO ldda cmp blu pt inc dec BLD BO ldd ldd ldd ldd lad lad lad ldd cmp blu pt inc dec tifdef USE BST define else define Sun Microelectronics 98 BST stda inc deccc ble pn nop BST std std std sa 0 A0 Sa 8l1 Al5 sa 16 A2 sa 24 A3 sa 32 A4 sa 40 A5 sa 56 A7 sa se SLCC Lt 64 sa 64 sa sa ASI BLK P BO sa se Sicc 1f 64 sa 64 sa sa 0 B0 sa 8 B1 sa 16 B2 sa 24 B3 sa 32 B4 sa 40 B5 sa 48 B6 sa 56 B7 sa se ZLEC If 64 sa 64 sa O0 da ASI BLK P 64 da ns icc loop end 00 da 0 01 da 8 O2 da 16 PPP
105. ng long ipxz gt gt Z SHIFT lt lt i32 unsigned amp fcolor FGR FFB WRITE64 RAW fbx value increment delta ipxu idxu ipxv idxv dpx11 vis_fpaddl6 dpx11 ddx11 dpx12 vis_fpaddl6 dpx12 ddx12 fbx 8 ipxz idxz increment delta iphx idhx ipmx idmx fpyz fdyz fpyu fdyu fpyv fdyv dpyll vis fpaddl6 dpyll ddyll diffuse lighting coefficient dpyl2 vis fpaddl6 dpyl2 ddyl2 specular lighting coefficient fby dlb Sun Microelectronics 120 5 Advanced Topics 5 4 Audio Applications 5 4 1 Finite Impulse Response FIR Filter This example illustrates the implementation of a FIR filter of length flen operating on an input data string of in accordance with the following relationship flen 1 dst n y fir k x src n k 6 O n dlen k 0 A 16 bit x 16 bit multiplication is performed and the result accumulated as a 32 bit value include include include void vis_ lt stdlib h gt vis types h vis proto h fir T6 vis s16 dst vis s16 src int dlen vis s16 fir int flen SEC pointer to first sample of source data dst pointer to first sample of destination data dlen length of destination data e coefficients of FIR filter flen length of FIR filter El vis u8 sa ss start point in source data vis d64 sp 8 byte aligned start point in source vis u8 da
106. nits Of the UltraSPARC I UltraSPARC I front end e Integer Execution Unit IEU Floating Point Graphics Unit FGU e System Interface e Processor Pipeline Sun Microsystems Inc 5 VIS Instruction Set User s Manual 2 2 The Functional Units of Ultrasparc I Figure 2 1 is a simplified block diagram identifying the major functional units that make up UltraSPARC I 1 The front end which is the Prefetch Dispatch Unit PDU prefetches instructions based upon a dynamic branch prediction mechanism and a next field address which allows single cycle branch following By predicting branches accurately which typically is better than 90 of the time the front end can supply four instructions per cycle to the core execution block The Integer Execution Unit IEU performs all integer arithmetic logical operations The IEU incorporates a novel 3 D register file supporting 7 read and 3 write ports The Floating Point Graphics Unit FGU integrates five functional units and a Register File made up of 32 64 bit registers The floating point adder multiplier and divider performing all floating point operations have been augmented by a graphics adder and multiplier to perform the partitioned integer operations required by the VIS Instruction Set The Load Store Unit LSU executes all instructions that transfer data between the memory hierarchy and the two register files in the IEU and the FGU Included in this unit are the Data Cache D
107. ntains an integral value To use vis pdist from C it is necessary for the accumulating register accumulator to appear both as an argument and as the receiver of the return value Sun Microsystems Inc 87 VIS Instruction Set User s Manual The vis pdist instruction is intended to accelerate motion compensation to support real time video compression in such applications as H 320 video conferencing Example vl d64 accum pixelsl pixels2 accum vis fzero accum vis pdist pixell pixel2 accum 4 7 12 Block Load and Store Instructions Function Transfer 64 bytes of data between memory and registers Syntax The Block Load and Store instructions do not have a C interface and must be coded in assembly language For assembly language syntax refer to section 13 6 4 in the UltraSPARC I User s Manual Description The block load instruction loads 64 bytes of data with a block transfer from a 64 byte aligned memory area into eight double precision floating point registers The block store instruction stores data with a block transfer from eight double precision floating point registers to a 64 byte aligned memory area Example Note that the loop must be unrolled to achieve maximum performance AII FP registers are double precision Eight versions of this loop are needed to handle all the cases of double word misalignment between the source and destination loop faligndata d0 d2 d34 faligndata d2 d4
108. o 64 bit values as data hi which is the upper half of the concatenated value and data lo which is the lower half of the concatenated value Bytes in this value are numbered from most significant to the least significant with the most significant byte being 0 The return value is a vis d64 variable representing eight bytes extracted from the concatenated value with the most significant byte specified by the GSR offset field as illustrated in Figure 4 23 where it is assumed that the GSR address offset field has the value 5 aligned boundary data hi data lo Offset x10000 x10008 x10005 vis faligndata data hi data lo returns the shaded data segment Figure 4 23 vis_faligndata example Care must be taken not to read past the end of a legal segment of memory A legal segment can only begin and end on page boundaries and so if any byte of a vis d64 lies within a valid page the entire vis d64 must lie within the page However when addr is already 8 byte aligned the GSR address Sun Microelectronics 72 4 Using VIS Offset bits will be set to 0 and no byte of data lo will be used Therefore even though it is legal to read 8 bytes starting at addr it may not be legal to read 16 bytes and this code will fail This problem may be avoided in a number of ways addr may be compared with some known address of the last legal byte the final iteration of a loop which may need to read past the end of the legal data may be
109. oftware Developer s Kit VSDK is a set of tools and sample code de signed to help in the development of VIS code A bulk of the sample code in this and later chapters of this guide can be found in the VSDK Before using the VSDK the following environment variables must be defined VSDKHOME the root directory of the VIS Software Developers Kit INCASHOME the root directory of INCAS If the SPARCompiler 4 x being used to compile VIS code is not the default com piler then the environment variable CC needs to be set to point to the SC 4 x compiler in order for the Makefiles in the VSDK to work An example environment variable definition is setenv VSDKHOME opt SUNWvsdk setenv INCASHOME opt SUNWincas Sun Microelectronics 22 3 Development Flow 3 4 SPARCompiler 4 x SC 4 x The SPARCompiler 4 x SC4 0 or later is the latest SUN compiler release and is backward compatible with the previous releases of SPARCompilers supporting UItraSPARC development By incorporating a new flexible flag scheme the SPARCompiler 4 x lets you target the UltraSPARC processor implementation of the SPARC V9 architecture with the VIS instruction set extension Additionally the SPARCompiler 4 x offers improved runtime performance profile feedback based optimization and improved parallelization support 3 4 1 Compiling VIS Code When compiling VIS code on a machine incorporating the UltraSPARC CPU close to optimum performance will be
110. oubles i byte0 word0 gt gt 24 No need to mask with Oxff bytel word0 gt gt 16 amp Oxff byte2 word0 gt gt 8 Oxff byte3 word0 amp Oxff byte4 wordl gt gt 24 byte5 wordl gt gt 16 Oxff byte6 wordl gt gt 8 amp Oxff byte7 wordl amp Oxff wordO0 word2 wordl word3 word2 vis_u32 rc 2 i next word3 vis_u32 src 2 i next 1 lookup vis 1d u8 i vis ras table byte 7 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte6 accum vis faligndata lookup accum lookup vis 1d u8 i vis ras table byte5 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte4 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte3 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table byte2 accum vis faligndata lookup accum lookup vis ld u8 i vis ras table bytel accum vis faligndata lookup accum Sun Microsystems Inc 115 VIS Instruction Set User s Manual lookup vis ld u8 i vis ras table byte0 accum vis faligndata lookup accum vis d64 dst i accum break Case 1 for i 0 i lt doubles i byte0 word0 gt gt 16 Oxff bytel word0 gt gt 8 Oxff byte2 word0 amp Oxff
111. ource vis d64 sp 8 byte aligned start point in source vis u8 da dst start of a line in destination vis u8 dend end point of a line in destination vis d64 dp 8 byte aligned destination start point int off address alignment offset in destination int emask edge mask vis d64 sd sl s0 sdh sdl source data vis d64 t0 tl t2 threshold vis ft32 tf vis u32 tu vis d64 a0 al a2 above value vis u32 auh aul vis d64 b0 bl b2 below value vis u32 buh bul int cmask cmaskh cmaskl comparison masks int i num loop variables Prepare the destination address dp vis d64 vis u32 da amp 7 off vis u32 dp vis u32 da dend da 3 length 1 Prepare the source address Sp vis_d64 vis_alignaddr sa off Prepare the thresholds tu thresh 9 off 3 lt lt 24 thresh 10 off 3 16 thresh 11 off 3 lt lt 8 thresh 9 off 3 tf vis to float tu to vis_fexpand tf tu thresh 10 off 3 lt lt 24 thresh 11 off 3 lt lt 16 thresh 9 off 3 lt lt 8 Sun Microsystems Inc 109 VIS Instruction Set User s Manual thresh 10 off 3 tf vis to float tu tl vis fexpand tf tu thresh 11 off 3 lt lt 24 thresh 9 off 3 lt lt 16 thresh 10 off 3 lt lt 8
112. owercomp Description vis read hi6 vis read lo and vis write hi vis write lo permit read and write operations to the upper uppercomp or lower lowercomp 32 bit components of a vis d64 variable However code written with these instructions cannot be optimized as easily as that written using vis freg pair Example 1 vis d64 data 64 vis f32 data 32 Extracts the upper 32 bits of data 64 and places them into data 32 data 32 vis read hi data 64 In practice the compiler can often accomplish the same effect by taking advantage of register pairs For example if the value data 64 resides in the register d30 vis read hi data 64 becomes a reference to f30 and vis read lo data 64 becomes a reference to f31 in the generated assembly code Example 2 vis d64 data 64 vis f32 data 32 Sun Microsystems Inc 45 VIS Instruction Set User s Manual Writes data 32 to the lower portion of data 64 leaving the upper half of data 64 intact data 64 vis write lo data 64 data 32 If data 64 resides in d30 and data 32 resides in f5 the C statement might be translated to the assembly language statement fmovs f5 d31 4 3 8 vis freg pair Function Join two vis f32 variables into a single vis d64 variable Syntax vis d64 vis freg pair vis f32 datal 32 vis 32 data2 32 Description vis freg pair joins two vis f32 values data1 32 and data2 32 into a single vis d64 variable
113. r init value 0x0028 fsr init value 1 3 6 9 3 Incas Command Batch File for vis example3 echo on focus ieul load 0 raml vis example3 breakpoint add amp main breakpoint add amp vdk vis blend88 breakpoint add amp exit main run wait time vdk vis blend88 run wait time exit run wait time quit 3 7 Process Tuning To perform process tuning for increased performance you may find it useful to refer to Table 17 1 in the UltraSPARC I User s Manual which shows the latencies for floating point and graphics instructions and to Appendix A for hints and sug gestions for performance optimization As a general guideline it is not recom mended that an instruction be issued prior to its input data becoming available Sun Microsystems Inc 39 VIS Instruction Set User s Manual Sun Microelectronics 40 Using VIS 4 4 1 Overview This chapter introduces the comprehensive set of VIS instructions that are used to write primarily but not restricted to graphics and multimedia applications While the majority of the instructions have a C interface via an inline mechanism some for example the array instructions do not have a C interface and must be written in assembly language Topics included in this chapter are A definition of the data structures used A description of Utility Inlines A description of Logical Instructions A description of Arithmetic Instructions A description of Packing Instr
114. rce code vis example3 c its assembly listing vis example3 s and INCAS log file vis example3 log in the directory VSDKHOME examples src The source code and assembly listing are also presented in section 3 6 9 1 Start INCAS as described in section 3 6 4 on page 28 2 Load your Binary File e g vis example3 into RAMI starting at address 0 with the following command ieul load 0 raml vis example3 3 You may set breakpoints where you want to examine the code in detail See file vis example3 s in directory VSDKHOME examples src or in section 3 6 9 for corresponding location of the breakpoints ieul breakpoint add amp vdk vis blend88 0x38 ieul breakpoint add amp vdk vis blend88 0x64 4 Start the simulation with the command run The simulation will stop when it reaches a breakpoint ieul run ieul breakpoint 1 stage G at vdk vis blend88 0x38 0x8550 Sun Microelectronics 30 3 Development Flow encountered 5 You may check the content of integer registers at any point with the command ieul iregs iregs when you focus on ieul ieul iregs Youngest registers in window 2 INS LOCALS OUTS GLOBALS 0 0x00000000007ffef8 OxXXXXXXXXXXXXXXXX 0x000000000ff00ff0 0x0000000000000000 1 0x00000000007ffef0 OxXXXXXXXXXXXXXXXX 0x0000000000000018 OxXXXXXXXXXXXXXXXX 2 0x00000000007ffee8 OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX 3 0x00000000007ffee0 OxXXXXXXXXXXXXXXXX OxXXXXXXXXXXXXXXXX OxXXX
115. re 4 9 vis fpadd32s and vis fpsub32s Sun Microelectronics 56 4 Using VIS Example vis d64 datal 4 16 data2 4 16 datal 2 32 data2 2 32 vis d64 sum 4 16 difference 4 16 sum 2 32 difference 2 32 vis f32 datal 2 16 data2 2 16 sum 2 16 difference 2 16 vis f32 datal 1 32 data2 1 32 sum 1 32 difference 1 32 sum 4 16 vis fpaddl6 datal 4 16 data2 4 16 difference 4 16 vis fpsubl6 datal 4 16 data2 4 160 sum 2 32 vis fpsum32 datal 2 32 data2 2 32 difference 2 32 vis fpsub32 datal 2 32 data2 2 32 sum 2 16 vis fpaddl6s datal 2 16 data2 2 160 difference 2 16 vis fpsubl 6s datal 2 16 data2 2 160 sum 1 32 vis fpadd32s datal 1 32 data2 1 32 difference 1 32 vis fpsub32s datal 1 32 data2 1 32 4 6 2 vis fmul8x16 Function Multiply the elements of an 8 bit partitioned vis f32 variable by the corresponding element of a 16 bit partitioned vis d64 variable to produce a 16 bit partitioned vis d64 result Syntax vis d64 vis fmul8x16 vis f32 pixels vis d64 scale Description vis fmul8x160 multiplies each unsigned 8 bit component within pixels by the corresponding signed 16 bit fixed point component within scale and returns the upper 16 bits of the 24 bit product after rounding as a signed 16 bit component in the 64 bit returned value Or in other words 16 bit result 8 bit pixel element 16 bit scale element 128 256 The operation is illustrated in Figure 4 10 Thi
116. rs Of major significance is the incorporation of 16 additional double precision floating point registers bringing the total up to 32 The floating point unit FPU data paths have been enhanced to include the capability to perform partitioned integer arithmetic operations required for graphics applications This capability is provided by a graphics adder that is organized as 4 independent 16 bit adders a graphics multiplier that is composed of four 8 x 16 multipliers and a pixel dis tance logic implementation A graphics status register GSR with scale factor and align offset fields is included to support format conversions and memory align ment The arithmetic is performed on 2 new partitioned data types pixel and fixed data Pixels consist of four 8 bit unsigned integers contained in a 32 bit word The vis pdist instruction accepts eight 8 bit unsigned integers in a 64 bit register Fixed data consists of either four 16 bit fixed point components or two 32 bit fixed point components both contained in a 64 bit word or two 16 bit or one 32 bit component in a 32 bit register To take advantage of the modified floating point pipeline to perform partitioned integer arithmetic a VIS Instruction Set extension is included to support graphics and other applications with the following functions 1 Format conversions such as converting pixel data to fixed data format operating on either 16 or 32 bit components 2 Arithmetic operations such as part
117. s for i 0 i lt times 1 r red i r0rlr2r3r4r5r6r7 g green il g0gig2g3g4g5g6g7 b blue i b0b1b2b3b4b5b6b7 a alpha i a0ala2a3a4a5a6a7 ag vis fpmerge vis read hi a vis read hi g a0g0algla2g2a3g3 br vis fpmerge vis read hi b vis read hi r bO0rOblrlb2r2b3r3 Merge to obtain a0b0g0r0alblglr1 abgr 4 i vis fpmerge vis read hi ag vis read hi br Merge to obtain a2b2g2r2a3b3g3r3 abgr 4 i 1 vis fpmerge vis read lo ag vis read lo br ag vis fpmerge vis read lo a vis read lo g a4g4a5g5a6g6a7g7 br vis fpmerge vis read lo b vis read lo r b4r4b5r5b6r6b7r7 Merge to obtain a4b4g4r4a5b5g5r5 abgr 4 i 2 vis fpmerge vis read hi ag vis read hi br Merge to obtain a6b6g6r6a7b7g7r7 abgr 4 i 3 vis fpmerge vis read lo ag vis read lo br Sun Microsystems Inc 93 VIS Instruction Set User s Manual 4 8 5 2 Transposing a Block of Bytes vis d64 po p4 p5 p6 p7 Inputs vis d64 q0 q4 q5 q6 q7 Outputs vis d64 m04 m37 m0426 m1537 Temporaries m04 vis fpmerge vis read hi p0 vis read hi p4 m15 vis fpmerge vis read hi pl vis read hi p5 m26 vis fpmerge vis read hi p2 vis read hi p9 m37 vis fpmerge vis read hi p3 vis read hi p7 m0426 vis fpmerge vis read hi m04 vis read hi m26 m1537 vis fpmerge vis read hi m15 vis read hi m37 q0
118. s within data 4 8 converts each integer to a 16 bit fixed value by inserting four zeroes to the right and to the left of each byte and returns four 16 bit elements within a 64 bit result Since the various vis fmul8x16 instructions can also perform this function vis fexpand is mainly used when the first operation to be used on the expanded data is an addition or a comparison Figure 4 20 illustrates the vis fexpand operation data 4 8 j 31 23 5 7 0 63 31 15 result 4 16 Ld 7 0 data 4 8 component 15 9 result 4 16 component 0000 0000 Figure 4 20 vis fexpand operation Example vis d64 result 4 16 vis f32 data 4 8 factor result 4 16 vis fexpand data 4 8 Sun Microsystems Inc 69 VIS Instruction Set User s Manual Using vis fmul8x16al to perform the same function factor vis to float 0x0100 result 4 16 vis fmul8x16al data 4 8 factor 4 75 vis fpmerge Function Merges two 8 bit partitioned vis u32 arguments by selecting bytes from each in an alternating fashion Syntax vis d64 vis fpmerge vis f32 pixelsl vis f32 pixels2 Description vis fpmerge interleaves four corresponding 8 bit unsigned values within pixels1 and pixels2 to produce a 64 bit merged result The operation is illustrated in Figure 4 21 pixels pixels2 mergeresult 63 56 47 39 31 23 15 7 0 Figure 4 21 vis_fpmerge operation Sun Microelectronics 70 4 Using VIS Ex
119. s for vis d64 datal 64 vis d64 data2 64 vis f32 vis fors vis f32 datal 32 vis f32 data2 32 vis d64 vis fand vis d64 datal 64 vis d64 data2 64 vis f32 vis fands vis f32 datal 32 vis f32 data2 32 vis d64 vis fxor vis d64 datal 64 vis d64 data2 64 vis f32 vis fxors vis f32 datal 32 vis f32 data2 32 vis d64 vis fnor vis d64 datal 64 vis d64 data2 64 vis f32 vis fnors vis f32 datal 32 vis f32 data2 32 vis d64 vis fnand vis d64 datal 64 vis d64 data2 64 vis f32 vis fnands vis f32 datal 32 vis f32 data2 32 vis d64 vis fxnor vis d64 datal 64 vis d64 data2 64 vis f32 vis fxnors vis f32 datal 32 vis f32 data2 32 vis d64 vis fornot vis d64 datal 64 vis d64 data2 64 vis f32 vis fornots vis f32 datal 32 vis f32 data2 32 vis d64 vis fandnot vis d64 datal 64 vis d64 data2 64 vis f32 vis fandnots vis f32 datal 32 vis f32 data2 32 Description The 64 bit version of these instructions performs one of eight 64 bit logical operations between datal 64 and data2 64 The 32 bit version of these instructions performs one of eight 32 bit logical operations between datal 32 and data2 32 Example vis f32 datal 32 data2 32 result 32 Sun Microelectronics 50 vis d64 datal 64 data2 64 4 Using VIS result 604 result 64 holds the result of a logical operation between datal 64 and data2 64 result 32 holds the result of a logical operation between datal 32 and data2 32 for r
120. s fpaddl6 aco hi vis fmul8xl6au vis read hi pix2 23 acc lo vis fpaddl6 acc 1o vis fmul8xl6au vis read lo pix2 23 Add high and low words of f3 pix3 to accumulator acc hi vis fpaddl6 acc hi vis fmul8x16al vis read hi pix3 23 acc lo vis fpaddl6 acc lo vis fmul8x16al vis read lo pix3 23 Pack join halves and store result into obuf obuf p vis freg pair vis fpackl6 acc hi vis fpack16 acc lo Sun Microsystems Inc 107 VIS Instruction Set User s Manual 5 2 2 Handling Three Band Data This example illustrates how to handle three band pixel data The value of each pixel in each band is compared to a threshold thresh for that band If the pixel band value is above the threshold the destination is set to the above value for that band otherwise it is set to the below value of that band Each pixel is represented by three values of B G and R Since the VIS processes data as 8 byte partitioned 64 bit words it is not possible to store an even number of complete pixels in a word efficiently To overcome this pixels are arranged for processing in three 8 byte segments that are defined depending on the destination address offset If the destination address offset is 0 then the three processing segments used are de fined as follows Segment 1 BO GO RO B1 G1 R1 B2 G2 Segment 2 R2 B3 G3 R3 B4 G4 R4 B5 Segment 3 G5 R5 B6 G6 R6 B7 G7 R7 If the destination address offset i
121. s fsrcs vis fnot vis fNots eee 49 4 4 3 vis f or and xor nor nand xnor ornot andnot s 50 Pixel Compare Instructions tenentes 52 45 1 vis fcmp gt le eq ne It ge 16 32 se 52 Avithmetic Instructions ose HE over oe 54 4 6 1 vis fpadd 16 16s 32 32s vis fpsub 16 16s 32 32s 54 46 2 vis fmul8xl6 A is 57 4 6 3 vis fmul8x16au vis_fmul8xl6al eee 58 4 6 4 vis fmul8sux16 vis fmul8ulx16 eee 60 4 65 vis fmuld8sux16 vis fmuld8ulx16 u see 62 Pixel Formatting Instructions sse tenentes 64 L vis fpaeklo tiae tei A dete e ege erdt 64 44 2 Vis fpack932 csset tete ins 66 BULB Vis fpacktix uoo lobes ates cates ceebevevtarcshestviveaGllcecre r 67 44 E AA eem Rh dodano pool uza EH 69 AD vis fpmerge i ssec e eet ed eerie e etin dee 70 47 6 vis alignaddr vis faligndata sss 7i 474 7 xiszedee 8 16 92 0 5 tae e do reel 74 4 7 8 vis pst 8 16 32 tre meteo meten 81 4 7 9 Short Loads and Stores sse 82 4 7 10 Array IottuchonS sperie te teen eee a 84 4711 ISE PAS A atu tete U SO EO DO Deo e 87 4 7 12 Block Load and Store Instructions eene 88 Code Examples en dtr ee he e ier n pe e et e rete 89 48 1 Averaging Two Images sn rennin i a EEE e R 89 4 8 2 Blending Two Images by a Fixed Percentage
122. s instruction treats the pixels values as fixed point with the binary point to the left of the most significant bit For example this operation is used with filter coefficients as the fixed point scale value and image data as the pixels value Sun Microsystems Inc 57 VIS Instruction Set User s Manual pixels scale msb msb msb msb result Y Y Y Y 63 47 31 15 0 Figure 4 10 vis fmul8x16 Operation Example vis f32 pixels vis d64 result scale result vis_fmul8x16 pixels scale 4 6 3 vis_fmul8x16au vis fmul8x16al Function Multiply the elements of an 8 bit partitioned vis f32 variable by one element of a 16 bit partitioned vis f32 variable to produce a 16 bit partitioned vis d64 result Syntax vis d64 vis fmul8x16au vis f32 pixels vis f32 scale vis d64 vis fmul8x16al vis f32 pixels vis f32 scale Description vis fmul8x16au multiplies each unsigned 8 bit value within pixels by a single 16 bit fixed point component The 16 bit fixed point component is the most significant 16 bits of the 32 bit scale The four pixel values in the 32 bit variable pixels are each multiplied in the same manner as vis fmul8x16 described in section 4 6 2 except that the same 16 bit scale value is used for all four multiplications The operation is illustrated in Figure 4 11 vis fmul8x16al is the same as vis fmul8x16au except that Sun Microelectronics 58 4 Using VIS the l
123. s not 0 the processing byte segment arrange ment is circularly shifted by the offset value For example a destination address offset of 2 would result in the following processing segments Segment 1 G7 R7 BO G0 RO B1 G1 R1 Segment 2 B2 G2 R2 B3 G3 R3 B4 G4 Segment 3 R4 B5 G5 R5 B6 G6 R6 B7 Then the last length less than 8 pixels if present is processed with three if con ditionals ARGUMENTS Src pointer to first byte of first pixel of source data dst pointer to first byte of first pixel of destination length lenght of the data in pixels thresh pointer to array of thresholds above pointer to array of values for pixels above thresholds below pointer to array of values for pixels below thresholds jt finclude vis types h include vis proto h define THRESHOLD tdh tdl ad bd X s0 sp 0 N sl sp 1 N sd vis faligndata s0 s1 N N sdh vis_fexpand_hi sd Sun Microelectronics 108 5 Advanced Topics sdl vis fexpand lo sd cmaskh vis fcmplel6 tdh sdh N cmaskl vis fcmplel6 tdl sdl N cmask cmaskh lt lt 4 cmaskl N vis pst 8 ad dp emask amp cmask N vis pst 8 bd dp emask amp cmask sp dp mask vis edge8 dp dend BRK KK kk kk kk kk kk kk kk kk kk kk K k AA void vis thresh83 vis u8 src vis u8 dst int length vis sl6 thresh vis sl6 above vis sl6 below vis u8 sa src start point of a line in s
124. sparc UltraSPARC I It contains the latest information about the UltraSPARC I including a PostScript copy of the current UltraSPARC I Data Sheet The latest information about VIS is located at http www sun com sparc vis More information can be found at Sun Microelectronics home page http www sun com sparc Sun Microelectronics vi Tableof Contents Preface seed ei itest eade diee n tere E Related Documents E R reete ete eei dcn Table of Contents concede eee E S ene e nete ee enter enn Lisbo Figures iene di nece eee eodd ea 1 Jntrod ction tege A E R ERIS ER Un nieto t UPS MEE O VN TRE MT NEED 12 UltraSPARC o enobestelitiene pnm aee dite n iere dite 13 Performance Advantage of VIS 2 UltraSPARC Concepts ee eee ne tee eei er eto it eren 21 JOVeryview doen eden A d tede Pe pee aide abet be ee a tae ore dado 2 2 The Functional Units of Ultrasparc I ssssssssssseeeeererneeenne 23 The UltaSPARC Front End inter eerte ee te e e eve dt 23 1 Integer Execution Unit IEU sse ee nnns 23 2 Floating Point Graphics Unit FGU sess 2 3 3 Load Store Unit LSU noD pedet ZDA External Cache etel ia ai 2 3 5 System Interfaces snar renna eieo akonto geent E ao SPa TARER eee entente nennen 24 Processor Pipeline giii Mrans oo eo end ER 25 Pipeline Stage Description erreina i e eean eaen e ia eene nennen 20 1 Stage T Fetch E Stage reiese or
125. special cased slightly more memory than needed may be allocated to ensure that there are valid bytes available after the end of the data Example The following example illustrates how these instructions may be used together to read a group of eight bytes from an arbitrarily aligned address addr as follows void addr vis_d64 addr_aligned vis_d64 data_hi data_lo data addr aligned vis_d64 vis alignaddr addr 0 data hi addr aligned 0 data lo addr aligned 1 data vis faligndata data hi data lo When data are being accessed in a stream it is not necessary to perform all the steps shown above for each vis d64 Instead the address may be aligned once and only one new vis d64 read per iteration addr aligned vis d64 vis alignaddr addr 0 data hi addr aligned 0 for i 0 i lt times 1 data lo addr aligned i 1 data vis faligndata data hi data lo Use data here Move data window to the right data hi data 1o Of course the same considerations concerning read ahead apply here In general it is best not to use vis alignaddr to generate an address within an inner loop e g addr aligned vis alignaddr addr offset data hi addr aligned 0 offset 8 fou ES Sun Microsystems Inc 73 VIS Instruction Set User s Manual Since this means that the data cannot be read until the new address has been computed Instead compute the a
126. sters Syntax Short Stores void vis st u8 vis d64 data void address void vis st u8 i vis d64 data void address vis u32 index void vis st ul6 vis d64 data void address void vis st ul6 i vis d64 data void address vis u32 index void vis st u8 le vis d64 data void address void vis st ul6 le vis d64 data void address Short loads vis d64 vis ld u8 void address vis d64 vis ld u8 i void address vis u32 index vis d64 vis ld ul6 void address vis d64 vis ld ul6 i void address vis u32 index vis d64 vis ld u8 le void address vis d64 vis ld ul6 le void address Sun Microelectronics 82 4 Using VIS Description vis Id u 8 8 i 16 16 i and vis st u 8 8 i 16 16 i perform 8 and 16 bit loads or stores to and from 64 bit variables Bytes and shorts may be loaded to and stored from the floating point register file Bytes may be loaded from and stored to arbitrary addresses and shorts from to even addresses Instructions with the i suffix add index to address just prior to loading from or storing to memory vis Id u 8 le 16 le and vis st u 8 le 16 le perform the same function but use the little endian addressing convention A common trick uses vis faligndata and vis ld st u80 to read a series of noncontiguous bytes accumulate them into a vis d64 and store them all at once This trick can almost double the speed of some memory bound loops Example vis u8 ad
127. sult is set if the corresponding value of datal 4 16 data1_2_32 less than the corresponding value of data2 4 16 data2 2 32 For vis fcmpge each bit within the 4 bit or 2 bit compare result is set if the corresponding value of datal 4 16 datal 2 32 is greater or equal to the corresponding value of data2 4 16 data2 2 32 The four 16 bit pixel comparison operations are illustrated in Figure 4 4 and the two 32 bit pixel comparison operations are illustrated in Figure 4 5 data1 4 16 data2_4 16 mask Figure 4 4 data1_2 32 data2 2 32 mask Figure 4 5 Lo E a 63 47 31 15 0 fcmp gt le eq ne It ge 16 LL T I a AM 31 3 0 Four 16 bit Pixel Comparison Operations m E 63 31 0 fempf gt le eq ne It ge 32 63 31 0 31 10 Two 32 bit Pixel Comparison Operation Sun Microsystems Inc 53 VIS Instruction Set User s Manual Example int mas vis d64 mask datal mask datal mask datal mask datal mask datal mask datal mask datal mask vis pst k datal 4 16 data2 4 vis fcmpgt16 datal 4 vis fcmpnel6 datal 4 4 16 data2 4 16 vis fcmpgt16 datal 4 4 16 gt data2 4 16 4 16 gt data2 4 16 16 datal 2 32 16 data2 4 16 data2 4 16 data2 4 16 data2 4 16 data2 4 16 data2 4 16 data2 4 Lue data2 2 32 may be use
128. t data 32 Causes the desired bit pattern to be placed into f 4 3 5 vis to double vis to double dup Function Place two vis u32 values into a vis d64 variable Syntax vis d64 vis to double vis u32 datal 32 vis u32 data2 32 vis d64 vis to double dup vis u32 data 32 Description vis to double places two vis u32 variables datal 32 and data2 32 in the upper and lower halves of a vis d64 variable The vis to double dup places the same vis u32 variable data 32 in the upper and lower halves of a vis d64 variable Example vis u32 datal 32 data2 32 vis d64 resultl 64 result2 64 result1_64 vis to double datal 32 data2 32 datal 32 in upper half and data2 32 in lower half Sun Microsystems Inc 47 VIS Instruction Set User s Manual result2 64 vis to double dup datal 32 datal_32 in upper and lower halves vis to double dup datal 32 is equivalent to vis to double datal 32 datal 32 Sun Microelectronics 48 4 Using VIS 44 VIS Logical Instructions These Instructions include logical operations involving none one or two argu ments 4 4 1 vis fzero vis fzeros vis fone vis fones Function Set variable to all ones base 2 or clear variable to zero Syntax vis d64 vis fzero void vis f32 vis fzeros void vis d64 vis_fone void vis f32 vis fones void Description vis fzero and vis fzeros return vis d64 and vis 32 zero filled variables and vis
129. t and Graphics Unit Sun Microsystems Inc 11 VIS Instruction Set User s Manual A maximum of two floating point graphics Operations FGops and one FP load store operation are executed in every cycle plus another integer or branch instruction All operations except for divide and square root are fully pipelined Divide and square root operations complete out of order without inhibiting the concurrent execution of other FGops The two graphics units are both fully pipe lined and perform operations on 8 or 16 bit pixel components with 16 or 32 bit intermediate results The Graphics Adder performs single cycle partitioned add and subtract data alignment merge expand and logical operations Four 16 bit adders are utilized and a custom shifter is implemented for byte concatenation and variable byte length shifting The Graphics Multiplier performs three cycle partitioned multi plication compare pack and pixel distance operations Four 8x16 multipliers are utilized and a custom shifter is implemented Eight 8 bit pixel subtractions abso lute values additions and a final alignment are required for each pixel distance operation 2 3 8 Load Store Unit LSU The Load Store Unit LSU executes all instructions that transfer data between the memory hierarchy and the Integer and Floating Point Graphics Register files The LSU includes the Data Cache Load Buffer Store Buffer and is very closely coupled to the second level external cach
130. ta Formats 4 2 2 Fixed Data Formats Fixed data values provide an intermediate format with enough precision and dy namic range for filtering and simple image computations on pixel values Con version from pixel data to fixed data occurs through pixel multiplication or application of the vis fexpand instruction Conversion from fixed data to pixel data is done with the pack instructions which clip and truncate to an 8 bit un signed value Conversion from 32 bit fixed to 16 bit fixed is also supported with Sun Microsystems Inc 43 VIS Instruction Set User s Manual the vis fpackfix instruction Rounding can be performed by adding one to the round bit position Complex calculations needing more dynamic range or preci sion should be performed using floating point data 4 2 3 Include Directives The following include directives apply to all code examples finclude vis types h finclude vis proto h 43 Utility Inlines Utility inlines are not part of the VIS extension and are included to complement the use of the VIS These instructions offer the ability to read and write upper and lower components of floating point registers and to modify the contents of the Graphics Status Register 4 9 1 vis write gsr vis read gsr Function Assign a value to the Graphics Status Register GSR and read the Graphics Status Register Syntax unsigned int vis read gsr void vis write gsr unsigned int gsr Description vis
131. tc and linking with INCASHOME util incas utils o Actually not including them in your code is preferable Replace all dynamically located arrays and variables by statically declared ones For example replace char a a malloc 512 by char a 512 Insert pseudo breakpoint routines into your code sim break0 vis fpaddl6 a b sim break1 where void sim breakO void sim _breakl 1 Re compile and statically link your VIS code using the dn option with INCAS utility routines incas_utils o the map file prom ld traps routines traps o and static library libc a When compiling INCAS modified code you may use all of the compiler flags as if compiling for execution on the UltraSPARC cc c vis il xchip ultra xarch v8plusa filel c cc c vis il xchip ultra xarch v8plusa file2 c ld dn M prom ld traps o incas utils o filel o file2 o usr lib libc a o file There is a makefile in directory VSDKHOME examples src You can use it to prepare the binary for the following sections mak f Makefile example3 Sun Microsystems Inc 27 VIS Instruction Set User s Manual Because INCAS calculates the processor states cycle by cycle it is very computa tionally intensive It is therefore recommended that you remove all nonessential functions and statements from your code and concentrate on those parts that you wish to debug or cycle count 3 6 4 Starting INCAS To start I
132. teger Unit Register File IURP and it routes valid data to each integer functional unit The G Stage sends up to two floating point or graphics instructions out of the four candidates to the Floating Point and Graphics Unit FGU Additionally the logic in the G Stage is responsible for comparing register addresses for integer data bypassing and for handling pipeline stalls due to interlocks 2 5 4 Stage 4 Execution E Stage In this stage data from the integer register file is processed by the two integer ALUS during this cycle if the instruction group includes ALU operations Re sults are computed and are available for other instructions through bypasses in the very next cycle The virtual address of a memory operation is also calculated in this stage in parallel with ALU computation In the Floating point Graphics pipe this stage corresponds to the Register R Stage of the FGU The floating point register file is accessed during this cycle The instructions are also further decoded and the FGU control unit selects the proper bypasses for the current instructions 2 5 5 Stage 5 Cache Access C Stage In this stage the virtual addresses of memory operations calculated in the E Stage are sent to the tag RAM to determine if the access load or store type is a hit or a miss in the D Cache In a parallel operation the virtual address is sent to the data Sun Microelectronics 18 2 UltraSPARC Concepts MMU to be translated into
133. tion describes how to extend the XIL library using arbitrary VIS program ming It is essentially the same technique as described in the XIL Programmer s Guide but using VIS instead of C code The XIL Programmer s Guide contains ad ditional information about gaining access to the image data within the XIL frame work It is available as part of the Solaris Software Developer Kit SDK Consider the following part of a simple XIL application XilSystemState state Xillmage iml im2 XilKernel kernel float scale 1 offset 1 state xil open Create images single band BYTE data iml xil create state width height 1 XIL BYTE im2 xil create state width height 1 XIL BYTE Load data into iml Load data into the kernel object Do a convolve with zero filled edges xil convolve iml im2 kernel XIL EDGE ZERO FILL Since the convolve operation is already accelerated using VIS there is little bene fit in writing explicit VIS code to implement it Suppose however that after the convolve we wished to multiply the result of the convolution with itself taking the upper 8 bits of the multiplicand XIL via the xil multiply function re turns the lower 8 bits We first must get the data pointers to the image XIL re quires that images be exported prior to user data access so we add the following code Sun Microelectronics 132 B Extending an XIL pro
134. tioned C Sun Microsystems Inc 123 VIS Instruction Set User s Manual finclude stdlib h finclude vis types h finclude vis proto h fdefine max a b a b a define min a b a lt b a unsigned long long vis sumabsdiff vis u8 framel int fllb vis_u8 frame2 int f21b int flx int fly int f2y int sx int sy int sh int sw framel pointer to byte data of frame 1 fllb 4 of bytes in one row of frame 1 width frame2 pointer to byte data of frame 2 f21b 4 of bytes in one row of rame 2 width fix f2y upper left corner of 16x16 block in frame 1 f2x f2y upper left corner of 16x16 block in frame 2 sx sy upper left corner of search area in frame 1 sh sw height and width of search area in frame 1 dst pointer to first sample of destination data xl start point in framel vis u8 sal framel fllb fly f1x vis u8 sa2 frame2 f21b f2y flx start point in frame2 vis u8 sll s12 vis d64 spl 8 byte aligned start point in framel vis d64 sp2 8 byte aligned start point in frame2 vis d64 sdl sll s10 source data vis d64 sd2 s21 s20 vis d64 accum accumulated sum of differences union vis d64 d64 unsigned long long ull result int i j int x y nx ny nx8 find intersection of search area and 16x16 block starting at flx fly x max sx fix nx min sx sw f1x 16 x new width in bytes y
135. tive the output is clamped to 0 if greater than 255 it is clamped to 255 Otherwise the eight bits to the left of the binary point are taken as the output Another way to conceptualize this process is to think of the binary point as lying to the left of bit 22 scale factor i e 23 scale factor bits of fractional precision The 4 bit scale factor can take any value between 0 and 15 inclusive This means that 32 bit partitioned variables which are to be packed using vis fpack32 may have between 8 and 23 fractional bits The following code examples takes four variables red green blue and alpha each containing data for two pixels in a 32 bit partitioned format r0r1 g0g1 b0b1 a0a1 and produces a vis d64 pixels value containing eight 8 bit quantities r0g0b0a0r1g1b1a1 vis_d64 red green blue alpha pixels red green blue and alpha contain data for 2 pixels red pixels pixels vis_fpack32 green pixels pixels vis_fpack32 blue pixels pixels vis_fpack32 alpha pixels The result is two sets of red green blue and alpha values packed in pixels Sun Microelectronics 66 4 Using VIS 63 55 47 39 31 23 15 7 data 2 32 data 8 8 result FF FIN LN GSR scale_factor 0110 element of data_2_32 gt 3130 23 22 5 0 37 8 bit byte of result Figure 4 18 vis fpack32 operation 4 7 3 vis fpackfix Function
136. uctions A description of Array Instructions Code examples illustrating VIS Sun Microsystems Inc 41 VIS Instruction Set User s Manual 4 2 Data Types Used Figure 4 1 illustrates the data types used Signed byte vis s8 ON I Unsigned byte vis u8 7 0 Signed short vis_s16 15 14 0 Unsigned short vis_ul6 15 0 Signed long vis_s32 S 3130 0 Unsigned long vis_u32 31 Float vis_f32 31 Double vis d64 63 Figure 4 1 Graphics Data Formats All VIS signed values are 2 s complement Sun Microelectronics 42 4 Using VIS 4 2 1 Partitioned Data Formats Figure 4 2 illustrates some of the partitioned data formats used en RA 31 23 15 7 0 An example of four 8 bit unsigned integers contained in a 32 bit variable Typically they represent intensity values for an image pixel e g a B G R vis f32 s16 s16 31 16 15 An example of two 16 bit signed fixed point values contained in a 32bit variable For example they may represent filter coefficients or scaling factors visaa S s16 s16 s16 s16 63 47 3l 15 An example of four 16 bit signed fixed point values contained in a vis d64 variable For example they may represent the result of partitioned multiplication me A 3 55 47 us 6 39 31 23 15 7 0 An example of eight 8 bit values contained in a vis_d64 variable Typically they would represent two pixels o o oo Figure 4 2 Partitioned Da
137. ur 16 bit Pixel Comparison Operations sess 53 Two 32 bit Pixel Comparison Operation sss 53 vis fpadd16 and vis fpsub16 operation sse 55 vis fpadd32 and vis fpsub32 operation sss 56 vis fpadd16s and vis fpsub16s operation sss 56 vis fpadd32s and vis fpsub32s ssssssssssssseeeeee 56 vis fmul8x16 Operation ssssssseseee eene een nnne nnns 58 vis fmul8x16au operation c ccocococinnonnnnencnnanoninninenconeninconinnnonin nono eren 59 vis fmul8x16al operation seen 59 vis fmul8sux16 operation sssssssssseeeeeee eee eere 61 vis fmul8ulx16 operation sssrinin anvisas 62 vis fmuld8sux16 operation sse eee een 63 vis fmuld8ulx16 operation sse eee e enn 63 vis fpack16 operation sse 65 vis fpack32 operation erreke erkei eiieeii r r i EEn aiei eene 67 vis fpackfix operatori 1 i cri 68 vis fexpand operation isi ene aeoea aee A eE nennen enne 69 vis fpmerge operation ssis sii iaee se veee risaie A eee eee 70 Sun Microsystems Inc X VIS Instruction Set User s Manual Figure 4 22 vis alignaddr example niais nte iaei esee ea ate atashni aa E aaan 72 Figure 4 23 vis faligndata example sss 72 Figure 4 24 Start Point Handling in vis inverse8a sssssssssssseees 76 Figure 4 25 Start Point Handling
138. ust be multiple of 8 T U ESCRIPTION dst 255 src inimum size of stack frame according to SPARC ABI efine MINFRAME 96 ENTRY provides the standard procedure entry code efine ENTRY x align 4 global x SET_SIZE trails a function and sets the size for the ELF symbol table define SET_SIZE x size X R SPARC have four integer register groups i registers i0 to i7 hold input data o registers 00 to 07 hold output data l registers 10 to 17 hold local data g registers g0 to g7 hold global data Note that g0 is always zero write to it has no program visible effect When calling an assembly function the first 6 arguments are stored in i registers from i0 to i5 The rest arguments are Stored in stack Note that i6 is reserved for stack pointer and i7 for return address define src 10 define dst Sil define sz 12 frame pointer i6 return addr 17 stack pointer 06 call link 07 define sa 10 define da 11 define lpcnt 12 Sun Microsystems Inc 95 VIS Instruction Set User s Manual define sd S 0 define dd tf2 Section save sp MINFRAM do some error checking tst sz ble pn icc ret calculate loop count sra Sz 3 lpcnt mov src sa mov dst da sub da 8 da ldd sa sd loop add da 8 da add sa 8 sa fnotl sd dd deccc lpcnt std dd da bg pt Sicc loop ldd sa sd
139. xemplifies the use of multiple cases based on input alignment as well as a common trick for consolidating output writes to demonstrate perfor mance improvement over a standard C implementation 5 Advanced Topics Sun Microsystems Inc 111 VIS Instruction Set User s Manual The function to be performed as written for C is for i 0 dst i i lt width i table input i Using the VIS instructions that permit up to eight 8 bit loads and stores per cycle increases the performance considerably Writing 8 bytes at a time however re quires the destination to be double word aligned The required alignment is achieved by a small initial loop which processes pixels naively until the destina tion becomes aligned Unpacking the source bytes requires the use of shifts and logical ands Since the source may not be single word aligned as required the source pointer is aligned dynamically and the pattern of byte extractions is deter mined by its original alignment If the pointer was unaligned some readahead is needed to span the boundaries between each chunk of four source bytes In or der to avoid reading beyond the end of the sources one is subtracted from the loop trip count and another naive byte by byte loop at the end of the routine is performed to handle any leftover pixels Consolidation of the output bytes is performed using vis faligndata with the GSR alignment bits set to 7 The result of accum vis fal

VIS Instruction Set User`s Manual

Contents

Download Pdf Manuals

Related Search

Related Contents