Home

An Approach in Nine Exercises

1. CN as ATAJO move 4w r0 d4 d5 d6 d7 PERSE Jp 3 First Code Section move 2w r0 d0 d1 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 13 Memory Alignment Exercise data 0x01 0x23 0x45 0x67 0x89 OxAB OxCD OxEF OxAA OxBB OxCC OxDD OxEE OxFF 0x11 0x22 e T 1 7 J move w rO dO move 2w r0 d0 d1 move 2f 10 d2 d8 og E jl I I 1 move 4w r0 d4 d5 d6 d7 d8 alle 1 MM Expected Simulator a T JU T I m eS sE EAN EEE 1 ell IL JH E ges spp E SLE NES S gu pn ei eee ON Oo 0 10 Second Code Section Compile the Ex4 c file ccsc100 be Ex4 c o Ex4 eld The Big Endian be option is used in this exercise to make it easier to read the data in the simulator memory window If desired the Little Endian mode can also be used Run the GUI simulator guisc100 In the simulator command window type reset d ml to put the simulator in Big Endian mode Open an assembly window Windows gt Assembly Load the file Load Ex4 eld Set a breakpoint on main by typing break maininto the command window Type go The code should now be at the start of main Open a memory window Windows gt Memory and click OK Type data into the Scroll box of the memory window to display the contents of the array data defined in Ex4 c Verify that these contents are as expected Type next to step through
2. var2 resl L_mac resl a i 2 var3 res2 L mac res2 a i 2 var0 res3 L mac res3 a i 2 varl varl x_ptr varl x n i 3 res0 L_mac res0 a i 3 varl resl L_mac resl a i 3 var2 res2 L mac res2 a i 3 var3 res3 L mac res3 a i 3 var0 Var0 x ptr Var0 x n i 4 Truncate results and store in y y n extract_h res0 y n 1 extract h res1 y n 2 extract h res2 y n 3 extract h res3 x ptr 20 Increment pointer by 20 to point to x n 7 for next iteration Print results y for n 0 n 32 n printf y d 0x 04hX n n y n Exercise 7 RRR RRR kkk kk kk kk he e kk kk kkk kk kk kk kk kk kk kk RRR kk k ERK EK KER KERR KEKE RK KEK KEKE RK RK RAR k k Freescale Semiconductor Inc COPYRIGHT 1999 Freescale Semiconductor Inc EX Xd ERK EERE RRR ERROR EERE EERE KR EERE REE 3422 4G d d RR dd Ed RE RR ARA RARA ER INTRODUCTION TO THE SC140 TOOLS kkk kkk kk kkk k k ce e khe RK EKER k heck KER ce k k kk ce he kk ARRE RRA kk k k kc ke kk kk kk RRR RAR RAR RAR RE kkk kk Y short array1 101 1 1 1 2 2 2 2 2 3 3 short array2 10 main short i short array2_ptr short tmp array2_ptr amp array2 0 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 33 Solutions to Exercises for i 0 i lt 10 1 tmp array1 i if tmp l
3. 2 x i 2 x i 3 x i 3 Equation 3 i 0 4 8 Equation 3 explicitly highlights the four multiply accumulate operations that can be performed in parallel Figure 9 highlights where each parallel execution is represented by Group 0 Group 1 and so on It also shows that the sample number i from one group to the other is incremented by four Figure 9 Signal Power Calculation Using the Split Summation Technique 1 Open the Ex5 c file Build the code with Ot2 then run it and notice the output result Split the current implementation of the loop that is res L mac res x i x il into four independent equations as represented in Figure 9 Independent means that the four equations are accumulated into different variables Therefore create four variables for each product Tip Watch your index increment Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 15 Multi Sample Exercise 4 Recompile the file and run it The output result should be the same as before Recompile with the S option and view the sl file 6 Your code is optimized when the loop is only one cycle and computes four operations at a time If the inner loop is equal to one cycle for four operations and the result is still correct congratulations You have completed Exercise 5 7 Inthe box provided below write the optimized inner loop code a The split summation technique allows
4. Freescale Semiconductor 5 Integer and Fractional Arithmetic Exercise instructions are mapped based on the type of the arithmetic required For integer arithmetic the compiler generates integer instructions for example imac For fractional arithmetic it generates fractional instructions for example mac Also move instructions are generated with correct data alignment Integer Fractional long a long a short b c short b c Supported by intrinsics a a b c a L_mac a b c S move w r0 dO S move f r0 d0 imac d0 d1 d2 mac d0 d1 d2 Figure 3 Integer and Fractional Compiler Support The energy of a signal x represented by Equation 1 is considered 1 d y y x 1 Equation 1 i 0 where x i is the signal input sample at iteration i y is the energy of the signal and N is the signal length 1 Open the example file Ex2 c Integer Arithmetic 2 Compile the file using ccsc100 Ot2 Ex2 c o Ex2 eld where the Ot2 option optimizes the code for time Force Parallelization Run the executable using runsc100 Recompile the file with the S option which stops the compiler after compilation Open the generated assembly file Ex2 s1 and look at the integer instructions within the loop goes 0 In the box provided here write down the integer C code and the generated assembly instructions for the loop Notice that the first data load is automatically pipelined in the software Integer Arith
5. Rev 1 Freescale Semiconductor 9 Local Versus Global Optimization Exercise Ex prod long Prod short alll short a2 1 Figure 6 Files for the Local Versus Global Optimization Exercise Open the two files and understand their functionality Local Optimization 2 Compile the two files ccsc100 Ot2 Ex3 main c Ex3 prod c o Ex3 eld Run the code runsc100 t Ex3 eld The t option for runsc100 enables the cycle count generation Write the cycle count in the box below Local Optimization Default Mode Cycle Count Global Optimization 4 5 Compile the files using global optimization ccsc100 Ot2 Og Ex3 main c Ex3 prod c o Ex3 glo eld where Og is the global optimization option Run the code runsc100 t Ex3_glo eld Write the cycle count in the box below Global Optimization Og option Cycle Count Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor Memory Alignment Exercise To understand how global optimization makes best use of available information perform these steps 1 Recompile the application with S option Stop After Compilation and with the local optimization ccsc100 Ot2 Ex3 main c Ex3 prod c S Rename the sl files as 1 Ex3 mainl sl and Ex3 prod1 sl er 0 Open the files to see what the compiler has produced Enable global optimization ccsc100 Ot2 Og Ex3_main c Ex3 pro
6. Solutions to Exercises Exercise 1 RRR kk kk kkk kk kk kk kk kk kk KER RK kk kk k k k k KER KERR k k RE KR RAR k k RRA RARE k k k k k k k kk RAR k k k Freescale Semiconductor Inc COPYRIGHT 1999 KEKKKKKK KKK KK KK KK KKK KKK KEK KKK ck ck KKK KKK KKK KKK KKK KEK KKK KKK ck ck ko ck ck Sk ck ck ko KK ck ck ck ck Ck ko ko ko KK ko ko ko KK x INTRODUCTION TO THE SC140 TOOLS kkk k k k k k k k k k k k ce ehe ehe che k k e e k k k k k k k k k k k k k k k k k k k kk k k k k k k k k k k k k k k k k k kk k k kk k k kk kk kkk ER Y include lt stdio h gt main printf Welcome to StarCore SC140 Tools n Exercise 2 E kkk kk kkk kkk kkk kk kk kk kkk kk kk kk kk kk k k kk kk k k k k k kk kkk k kk k k k kk k k k k k k kk kk kk k ck k k k k Freescale Semiconductor Inc a COPYRIGHT 1999 Freescale Semiconductor Inc KEKKKKKK KKK KK KKKK KK KK KK KEK KKK KKK KKK KK KKK KKK KKK KEK KKK KKK KK KK ck ck ck ck Ck KK ck ck ck ck KEK KK KK ko ko ko KK 5 INTRODUCTION TO THE SC140 TOOLS kkk k k k k k k k k k k k k ehe k k he he k he k k k k k k k k k k k k he k k k k k k k k k k k kk k k kk kk k k kk k kk kk kk kk kk kk kk kkk kk f include lt stdio h gt include lt prototype h gt short x 12 0 1 2 3 4 5 6 7 8 9 10 11 main short i long res 0 long fres 0 for i 0 i lt 12 1i for i 0 i lt 12 i fres L_mac fres x i x il j printf The integer result is Sad 0x x n res r
7. by the simulator are not aligned operations If this is not taken into account unpredictable results can occur when migrating to the hardware which requires aligned data Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 27 Solutions to Exercises Exercise 5 RRR kk kk kkk kk kk she he kk he e kk kkk kk kk kk kk kk k k kk RRE RARE RAR kc RK RARA RARE k k k RRA k k k k k k Freescale Semiconductor Inc COPYRIGHT 1999 Freescale Semiconductor Inc INTRODUCTION TO THE SC140 TOOLS kkk k kk kk k k ce KERR KER EKER KEK KER KEK RK KR ERR ERK KER KKK KKK KKK RK KEKE KERR EKER kk kk k kkk kkk ck k A Split Summation Technique Exercise include lt stdio h gt include lt prototype h gt short x 12 0 1 2 3 4 5 6 7 8 9 10 11 main short i long res1 0 res2 0 res3 0 res4 0 for i 0 1 lt 12 i 4 resl L mac resl1 x i x il res2 L mac res2 x i 1 x i 1 res3 L mac res3 x i 2 x i 2 res4 L mac res4 x i 3 x i 3 j To optimise the code further break the following dependency resl resl res2 res3 res4 into resl res
8. perform both fractional and integer arithmetic This exercise presents a reminder about integer and fractional arithmetic representation and then shows how to use the StarCore compiler fractional intrinsics Values stored in memory or registers are interpreted differently depending on the operation performed For integers the binary point is considered to be immediately to the right of the LSB For the fractional case the binary point is considered to be immediately to the right of the MSB Table 1 illustrates this for 16 bit data values Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 4 Freescale Semiconductor Integer and Fractional Arithmetic Exercise Table 1 Interpretation of 16 bit Integer and Fractional Data Values Binary Representation Hexadecimal Integer Value Fractional value Representation decimal decimal 0100 0000 0000 0000 0x4000 16384 0 5 0001 0000 0000 0000 0x1000 4096 0 125 0000 0000 0000 0000 0x0000 0 0 0 1100 0000 0000 0000 0xC000 16384 0 5 1111 0000 0000 0000 OxFOOO 4096 0 125 2 1 Hardware Support on StarCore StarCore has a dual instruction set for operations that produce different results depending on whether fractional or integer arithmetic is used The instruction set is complementary when an integer or a fractional operation leads to the same result regardless of the operation type for example an addition The instruction set is dual
9. the fractional assembly instructions generated to the assembly integer instructions 12 Recompile the code without the S option to produce an executable file 13 Run the code using runsc100 The variables res and fres should print to the screen What is the algebraic relationship between these two variables Congratulations you have completed Exercise 2 Good To Know To perform fractional operations e Intrinsics are used The variable types remain integer e The header file prototype h should be included in the C source file All assembly instructions compiler generated or hand written between square brackets execute in parallel as a single execution set Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 7 Local Versus Global Optimization Exercise 3 Local Versus Global Optimization Exercise The local versus global optimization exercise shows the difference between two C compiler options local optimization the default and global optimization Local optimization compiles each file of the project individually as represented in Figure 4 Global optimization acts as a global binder that links all the intermediate representation IR files into one file before optimizing the application Since all the application code information is available this approach enables further optimizations beyond those achieved using local optimization alone Compilation
10. wide data move instructions if alignment is not guaranteed However if a function is implemented in assembly language and uses wide data move instructions you must ensure that the data is aligned on the appropriate boundary Otherwise the wrong data P 0x00 AA BB CC is transferred DD EE FF AB BC 8 bytes P 0x08 01 23 45 67 89 AB CD EF 8 bytes P 0x10 Figure 7 Memory Granularity The following instructions bring more than one byte at a time to the data register move w Rx Dn move f Rx Dn Transfer one 16 bit word from memory 2 bytes Transfer one 16 bit word from memory 2 bytes Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 11 Memory Alignment Exercise move 2w Rx Dh Transfer two 16 bit words from memory 4 bytes move 2f Rx Dh Transfer two 16 bit words from memory 4 bytes move 4w Rx Dk Transfer four 16 bit words from memory 8 bytes move 4f Rx Dk Transfer four 16 bit words from memory 8 bytes move 21 Rx Dh Transfer two 32 bit words from memory 8 bytes where x spans from 0 to 15 and the data register notations are as follows e Dn represents DO D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 or D15 e Dh represents DO D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 or D14 D15 e Dk represents DO DI D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 or D12 D13 D14 D15 M
11. x ptr var3 x n 3 var2 x_ptr var3 x n 2 varl x_ptr var3 x n 1 Var0 x_ptr Var3 x n ES x ptr now points to x n 1 for i 0 i lt 12 i res0 L mac resO a i var0 resl L_mac resl a i var1 res2 L mac res2 a i var2 res3 L mac res3 alil var3 var3 var2 var2 varl varl var0 Var0 x ptr var0 x n i 1 Truncate results and store in y y n extract_h res0 y n 1 extract h res1 y n 2 extract h res2 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 30 Freescale Semiconductor Solutions to Exercises y n 3 extract h res3 x ptr 20 Increment pointer by 20 to point to x n 7 for next iteration Print results y for n 0 n 32 n printf y d 0x 04hxX n n y n Further Optimizing the Speed RRR kk kk kkk kk kk kk kk kk kk kk k kk kk kk k k k k KERR RRE RARE RAR RRE RARA RE k k k k RRA RARA RR k k Freescale Semiconductor Inc COPYRIGHT 1999 Freescale Semiconductor Inc ck ck ck ck ck kck KKK KKK KKK Ckck ko ko ck ok ck ck ok ck ck ck ck ck ko ck ko ck ck Ck ck ck kk kk ck kk ck kk ck ko ck ok ck ck ck ck ck Sk Ck ck ko ck ko ck ck ck ck Ck ko ko A ko ko ko ko ko ko INTRODUCTION TO THE SC140 TOOLS kkk kkk kk ecce k ce e e ehe k k k he k he k k k KER ehe ce che che he k k he e k k k kk kk RARA kk ec KKK KERR kk kk kk kc k
12. CO y 15 0x0E80 y 16 0x0E40 y 17 0x0E00 y 18 Ox0DCO y 19 0x0D80 y 20 0x0D40 y 21 0x0D00 y 22 0x0CCO y 23 0x0C80 y 24 0x0C40 y 25 0x0C00 y 26 0OxOBCO y 27 0x0B80 y 28 0x0B40 y 29 0x0OBOO y 30 0x0ACO y 31 0x0A80 kkk kkk k kkk KK KK KKK KERR RRR RE KK KEKE KER KEK KERR RRR KERR KERR RARA k e kk he kc kc RE RR RRA ck kk f main long res0 resl res2 res3 short var0 varl var2 var3 short n i x_ptr x ptr amp input 14 for n 0 n 32 n 4 res0 0 resl 0 res2 0 res3 0 var3 x_ptr var2 x ptr varl x ptr Var0 x ptr x ptr now points to x n 1 x ptr points to input 14 var3 var3 var3 var3 for i 0 i lt 12 i 4 res0 L mac resO alil resl L mac res1 alil res2 L mac res2 alil res3 L mac res3 alil var3 x ptr var3 x n 3 x n 2 x n 1 x n var0 varl var2 var3 x n i 1 be A eA 4 5 which is x 3 T Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 32 Freescale Semiconductor Solutions to Exercises res0 L_mac res0 a li 1 var3 resl L_mac resl a i 1 var0 res2 L mac res2 ali 1 varl res3 L mac res3 a i 1 var2 var2 x ptr var2 x n i 2 res0 L_mac res0 a i 2
13. Freescale Semiconductor Application Note AN2009 Rev 1 11 2004 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises By Emmanuel Roy and David Crawford This document presents a quick comprehensive hands on introduction to the StarCore SC140 DSP core using programming examples and exercises The goal is to help the software developer start writing high level language applications in C Included are software related tips on how to get the most from the StarCore hardware architecture We recommend that you complete the exercises in sequential order The exercises require the use of the SC140 C tools including compiler assembler linker and simulator to generate executable files from C and assembly language source files and to verify the code performance The tools are invoked from a command prompt DOS or UNIX If you desire you can use an integrated development environment IDE Be sure to consult the appropriate IDE manuals This application note provides step by step instructions to walk you through the exercises included in the software accompanying it AN2009SW zip You can download this zip file at the web site listed on the back cover of this document Solutions to the exercises are provided at the end of this application note The following StarCore software development tools were used in the development of the SC140 exercises Later versions of the SC140 tools should generate similar or better resu
14. StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 35 How to Reach Us Home Page www freescale com E mail support freescale com USA Europe or Locations not listed Freescale Semiconductor Technical Information Center CH370 1300 N Alma School Road Chandler Arizona 85224 1 800 521 6274 or 1 480 768 2130 support freescale com Europe Middle East and Africa Freescale Halbleiter Deutschland GMBH Technical Information Center Schatzbogen 7 81829 Munchen Germany 44 1296 380 456 English 46 8 52200080 English 49 89 92103 559 German 33 1 69 35 48 48 French support freescale com Japan Freescale Semiconductor Japan Ltd Headquarters ARCO Tower 15F 1 8 1 Shimo Meguro Meguro ku Tokyo 153 0064 Japan 0120 191014 or 81 3 5437 9125 support japan freescale com Asia Pacific Freescale Semiconductor Hong Kong Ltd Technical Information Center 2 Dai King Street Tai Po Industrial Estate Tai Po N T Hong Kong 800 2666 8080 For Literature Requests Only Freescale Semiconductor Literature Distribution Center P O Box 5405 Denver Colorado 80217 1 800 441 2447 or 303 675 2140 Fax 303 675 2150 LDCForFreescaleSemiconductor hibbertgroup com AN2009 Rev 1 11 2004 Information in this document is provided solely to enable system and software implementers to use Freescale Semiconductor products There are no express or implied copyright license
15. as shown in Table 2 in two cases which automatically take care of data alignment zero filling and sign extension e When an integer or a fractional operation leads to a different result depending on the operation type for example a multiplication e When data is transferred from to memory Table 2 Fractional and Integer Assembly Language Instructions Operation Integer Fractional Multiply impy mpy Multiply accumulate imac mac Move move b move w move 2w move 4w move f move 2f move 4f 2 2 Compiler Support on StarCore The StarCore compiler implements fractional arithmetic using built in intrinsic functions based on integer data types Any fractional values or constants must therefore be defined using their integer equivalent Useful relationships for deriving these integer representations from the fractional vales are as follows e 16 bit Integer Value Fractional Value x 2 E e 32 bit Integer Value Fractional Value x 2 31 e 40 bit Integer Value Fractional Value x 2 id The names of the built in intrinsics conform to the ITU ETSI basic operation functions For instance the L_mac intrinsic function is used in the following example see Figure 3 and a complete list of the intrinsic functions for fractional arithmetic can be found in the SC100 C C Compiler User s Manual The example illustrates how the Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1
16. d M are passed on the stack 1 Open the Ex8 c and addvecs asm files and familiarize yourself with the code 2 Inaddvecs asm are two constants Z OFFSET and M OFFSET whose values are not set and which are represented by question marks These offsets pull z and M from the stack Find the lines of code that perform this task 3 Before the code can be built you must assign values to Z_ OFFSET and M_OFFSET To help you to do this Figure 12 shows the stack on entry to addvecs Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 22 Freescale Semiconductor Calling an Assembly Routine From C Exercise SP p gt On function entry 4 Status Register 4 Return Address Pushed on stack by jsr bsr instruction em Parameters pushed onto stack prior to jsr bsr Spe 48 Prior to function call Figure 12 Stack Contents on Entry to advecs 4 Inthe box provided here write what you think the offsets should be Z OFFSET M OFFSET Modify the addvecs asm file to incorporate your offset values Build the code Run the code runsc100 Ex8 eld 9 umm The following output should be displayed Zum 3855 Mie Dy Xl 195 45 E sum 80 9 Ifthe above output is displayed your offset values are correct 10 Rebuild the code this time with the S option 11 Open the generated assembly file Ex8 s1 and find the call to addvecs 12 Find
17. d c S Open Ex3_main sl to see what the compiler has produced Since the compiler has all information on the application it optimizes the application further than with local optimization The compiler avoids calling the function by in lining the function into the main code as shown in Ex3_main s1 Therefore it eliminates the cycle overhead associated with jumping to and returning from the function and passing the parameters to the functions Congratulations you have completed Exercise 3 Good To Know e Global optimization requires a longer compilation time than local optimization e Global optimization further optimizes the application speed 4 Memory Alignment Exercise The memory alignment exercise shows the usage of wide data moves and the necessary alignments for performing these moves The SC140 memory has byte granularity as represented in Figure 7 Two arithmetic address units AAUS transfer the data from memory to the 4 ALUs and vice versa via two 64 bit data buses Each data bus allows the transfer of up to eight bytes from memory to the data registers in one cycle and vice versa If the compiler must generate the wide data move instructions available in the StarCore instruction set such as move 2w move 2f move 4w and so on data must be correctly aligned in memory This is due to the way the address and data buses operate for multi byte accesses in the StarCore architecture The compiler does not generate
18. e calculations in Group 3 x n 2 x n 1 and x n should already exist in the DSP registers from the calculation of Group 2 The result is a reduction in memory bandwidth requirements that increases code efficiency 1 Open the Ex6 c file 2 Compile Ex6 c using the Ot2 option Run the code and verify that the output is correct See the comments in Ex6 c for the correct values of y 3 Recompile Ex6 c using the Ot2 and S options Examine the assembly language file Ex6 s1 to see how the inner loop is compiled Intermediate Version Compromise Between Memory and Speed 4 Save Ex6 cas Ex6 1 c 5 Change the C code of Ex6 1 c according to the following steps a Process the first four samples at a time Replace the implementation of y n a x n with the equations defined as Group 0 in Equation 5 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 17 Multi Sample Exercise b Replace x n x n 1 x n 2 x n 3 with variables for example varO varl var2 var3 respectively as follows resO ali varo res1 ali var res2 ali var2 res3 ali var3 Group 0 This processes the first group Group 0 To process the remaining groups Group 1 and so on the values from var0 varl and var2 from Group 0 must be transferred to varl var2 var3 respectively for processing Group 1 c Transfer the val
19. es printf The fractional result is Sd 0x x n fres fres Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 25 Solutions to Exercises Exercise 3 No code modification is required Exercise 4 data 0x01 0x23 0x45 0x67 0x89 OxAB OxCD OxEF OxAA OxBB OxCC OxDD OxEE OxFF 0x11 0x22 Expected Simulator do 0000 0123 0000 0123 s 4567 4567 move 2w r0 d0 d1 0000 0000 es oo coo owe 89AB 89AB d6 FF FFFF 89AB FF FFFF 89AB d7 FF FFFF CDEF FF FFFF CDEF move 4w r0 d4 d5 d6 d7 000129 scii OU nie 99 d9 FF 89AB CDEF FF 89AB CDEF Aligned 00 00 00 00 d2 00 0123 0000 00 0123 0000 d3 00 4567 0000 00 4567 0000 00 00 00 00 F F CDEF CDEF Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 26 Freescale Semiconductor Solutions to Exercises data 0x01 0x23 0x45 0x67 0x89 OxAB OxCD OxEF OxAA OxBB OxCC OxDD OxEE OxFF 0x11 0x22 Expected Simulator move data 2 r0 ro 0000 4567 move w r0 dO ae do 0000 di Fj FFFF seAB d2 0000 0123 0 4567 FR FFFF 89AB ous oo oo ET ene oo ao om Y ma FFFF CDEF AABB move 2w r0 d0 d1 move 2f r0 d2 d3 E E d4 d5 d6 d7 move 4w r0 d4 d5 d6 d7 d8 d9 CDEF d 0123 4567 895B Aligned Not Aligned The crosses indicate that the results provided
20. full use of all four ALUs reducing the cycle time by more than 70 percent relative to use of a single ALU The 4 ALU technique does not guarantee bit exactness with the single ALU technique because the order of accumulation is different Using the 4 ALU technique therefore has implications in applications that are defined by bit exact standards such as speech coding standards from ITU ETSI TIA EIA and so on Good To Know e The use of four variables removes the accumulation dependency that is required for parallelism e Bit exact considerations must be understood if this technique is used overflow saturation characteristics may change during split summation 6 Multi Sample Exercise The multi sample exercise demonstrates the multisample technique As the exercise in Section 5 shows the split summation technique allows a sum of products operation to be calculated using all four ALUs by evaluating four intermediate products at a time However it does not guarantee bit exact agreement with serially accumulating each intermediate product using a single ALU To ensure bit exactness the order of summation must be preserved by performing each intermediate product accumulation in turn Therefore the intermediate products cannot be Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 16 Freescale Semiconductor Multi Sample Exercise evaluated in parallel Furthermore the split summation technique may not be sui
21. ion avoids transferring data You must reduce the number of loop iterations by a factor of four to compensate for the fact that the loop is unrolled by a factor of 4 If your inner loop consumes just four cycles and your code still produces the correct output congratulations You have completed Exercise 6 Notice that each group of four MAC operations and two data load operations now requires just one processor cycle which is half the time required by the filtering operation and a quarter of the time required by a single ALU DSP device However the code size for the inner loop has increased by a significant amount approximately four times that of the second implementation and this must be weighed up against the cycle count performance improvements obtained Table 3 summarizes the main characteristics of the multi sample technique Table 3 Inner Loop Characteristics of Multi sample and Single sample Techniques Characteristic Single sample Algorithm ea Cycle count N N 4 Registers used Fewer More Sample delay 1 4 Number of memory moves bandwidth 2N N 2 Code size Small Large 7 Control Code The True Bit Exercise The True bit exercise shows how the compiler uses the True bit and how you can help the compiler to improve the performance The True bit is set cleared by compare or test instructions The use of the True bit as a control flag together with DSP specific code makes the SC140 very powerful f
22. k k k k k k k k k k k k RARE RE RAR RR RR RARA k k k k k k k k k k For reference the following output should be observed after KER running the code e y 0 0x0020 y 1 0x0080 Es y 2 0x0140 E y 3 0x0280 y 4 0x0460 y 5 0x0700 y 6 0x0A80 E y 7 0x0D00 E y 8 OxOEAO i y 9 0x0F80 y 10 Ox0FCO af y 1 0x0F80 F y 12 0x0F40 y 13 OxOFO00 X y 14 Ox0ECO y 15 0x0E80 y 16 0x0E40 y 17 Ox0E00 y 18 Ox0DCO i y 19 0x0D80 a y 20 0x0D40 a y 21 0x0D00 a Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 29 Solutions to Exercises y 22 0x0CCO E y 23 0x0C80 zi y 24 0x0C40 d y 25 0x0C00 X y 26 0x0BC0 y 27 0x0B80 y 28 0x0B40 E y 29 0x0B00 y 30 0x0ACO y 31 0x0A80 E AN f main long res0 resl res2 res3 short var0 varl var2 var3 short n 1 x ptr x ptr amp input 14 x ptr points to input 11 which is x 3 for n 0 n 32 n 4 res0 0 resl 0 res2 0 res3 0 var3
23. l res2 res3 res3 res4 resl resl res3 printf Result d 0x x n res1 res1 j Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 28 Freescale Semiconductor Solutions to Exercises Exercise 6 Intermediate version Compromise between Memory and Speed RRR k kkk kkk e ke kk kk kk kk kk kkk kk kk kk kk kk KER KERR k k RE kk ck RRR KER EKER kc kc k k k k k kk kk k ko kc k k k Freescale Semiconductor Inc A COPYRIGHT 1999 Freescale Semiconductor Inc ck ck ck ck ck kck KKK KK KARA ko ck kk ck ck kk ck kk kk ck ck ck ck koc ko kk kk Sk ko ck kk ck ko ck ck ko ck ck ko ck ck ko ck ko ck sk ck ck ck ko ko ko ko ko ko ko ko ko INTRODUCTION TO THE SC140 TOOLS Kk k k k k k k KKK KEK KERR ehe ce k k he he k k k k k k k k k che ce he k k heck k k k k k RARA RA RRA kc KK k k KERR RA RRA hok kkk kk kkk kk Y Multi sample technique Exercise on an FIR Filter include lt stdio h gt include lt prototype h gt short a 12 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 0x7000 0x8000 0x9000 0xA000 0xBO000 0xC000 short input 32 11 0 0 0 0 0 0 0 0 0 0 0 zero padding 0x0100 0x0200 0x0300 0x0400 0x0500 0x0600 0x0700 0x0800 0x0900 0x0A00 0x0B00 0x0C00 0x0D00 0x0E00 0x0F00 0x1000 0x1100 0x1200 0x1300 0x1400 0x1500 0x1600 0x1700 0x1800 0x1900 0x1A00 0x1B00 0x1C00 0x1D00 0x1E00 0x1F00 0x2000 short y 32 ERK kk kk kkk kk kk kk he e he ke kk kk k k k kk kk
24. lts e StarCore 100 C Compiler Produces highly optimized code Compiler features include ANSI C standard compliance fixed point optimization global optimization and a standard C library Freescale Semiconductor Inc 2001 2004 All rights reserved CONTENTS 1 File VO Exercise 4 4 ep tiere ipee ee 4 2 Integer and Fractional Arithmetic Exercise 4 2 1 Hardware Support on StarCore RN 5 2 2 Compiler Support on StarCore ee 5 3 Local Versus Global Optimization Exercise 8 4 Memory Alignment Exercise pp 5 Split Summation Exercise sss 6 Multi Sample Exercise ee 7 Control Code The True Bit Exercise 8 Calling an Assembly Routine From C Exercise 21 9 The Challenge coro ed ota 24 10 Solutions to Exercises ee 25 Seg freescale semiconductor e StarCore 100 Assembler Translates assembly language files into machine readable object files e Linker Links and relocates the object files and produces executable program files Complex memory configurations can be specified and detailed linker maps can be generated e StarCore 100 Simulator and Run time Simulator The StarCore 100 simulator can run from either a text based or a graphical user interface GUI A separate simulator utility runsc100 is included for run time I O support Before starting the exercises install the files in AN2009SW zip on your computer in the following directory Ona Windows pla
25. me Optimized for Space Recompile using the compiler optimization option for code size Os option Open the generated assembly file Ex7 s1 and look at the conditional instructions within the loop Write down how many execution sets are within the loop in the box Save Ex7 c as Ex7 1 c Modify the program to obtain two cycles within the loop Tip consider using a temporary variable for both storing the immediate value of the array1 and the conditional test Compile the code using the Ot2 option Open the file Ex7_1 s1 12 13 If you have obtained two cycles for the inner loop congratulations If you have not please try again In the following box write the optimized C code Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor Calling an Assembly Routine From C Exercise C Code Generated Assembly Code 8 Calling an Assembly Routine From C Exercise Practical DSP application commonly use a mixture of C and assembly language This exercise shows how an assembly language function can be called from C code The code for this exercise is contained in two files Ex8 c and addvecs asm The C code in Ex8 c calls the assembly language function addvecs in file addvecs asm to add two vectors together and return the sum of all the elements of the resultant vector The prototype for addvecs is as follows short add vecs sh
26. metic C code Generated Assembly code Fractional Arithmetic 7 For fractional arithmetic copy and paste the loop of Ex2 c Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 6 Freescale Semiconductor Integer and Fractional Arithmetic Exercise The first loop remains unchanged and performs integer calculation while the second loop is modified to perform fractional arithmetic 8 In the second loop replace the integer arithmetic operation with the appropriate fractional intrinsic Remember fractional arithmetic is performed using C compiler intrinsics In this example the L_mac intrinsic is used Its prototype is long int L mac long int short int short int Therefore the code modifications should be Create a new variable fres of type long int Replace res x i x 1 with the instruction fres L_mac fres x i x 1 Include the file prototype h which contains all the intrinsics prototypes Bop um 9 Add another printf statement to print out the fractional result The result is still a long int so d should still be used 9 Recompile the code with the S option and look at the generated assembly file Ex2 s1 within the second loop 10 In the box provided below write the fractional C code and the generated assembly instructions for that loop Fractional Arithmetic C code Generated Assembly code 11 Compare
27. n Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 21 Calling an Assembly Routine From C Exercise address and status register contents are pushed onto the stack by the jsr or bsr instruction If the called function modifies register d6 d7 r6 or r7 it should first save them on the stack and then restore them before returning All other registers are free for use without saving or restoring them The calling function must save these registers if it needs their values to be preserved On function exit the status register contents and return address are popped from the stack by the rts instruction and the calling function deallocates the stack space used to pass parameters 3 4 5 and so on High Address SP A Local Variables if any SP Saved C urrent Registers PI SP Return Return Return Address Address Address SP Parameters Parameters Parameters Parameters l 4 5 4 5 4 5 4 5 SP 3 4 5 3 4 5 3 4 5 3 4 5 SP 1 2 3 14 15 a 1 Low Address Prior to function call On entry to function 3 During function execution YA Prior to exit from function Yo On return from function 2 Calling function deallocates parameters on stack Figure 11 Typical Stack Contents During Function Execution Therefore for the function addvecs parameters x and y are passed in rO and r1 while z an
28. ntended or unauthorized application Buyer shall indemnify and hold Freescale Semiconductor and its officers employees subsidiaries affiliates and distributors harmless against all claims costs damages and expenses and reasonable attorney fees arising out of directly or indirectly any claim of personal injury or death associated with such unintended or unauthorized use even if such claim alleges that Freescale Semiconductor was negligent regarding the design or manufacture of the part Freescale and the Freescale logo are trademarks of Freescale Semiconductor Inc StarCore is a trademark of StarCore LLC All other product or service names are the property of their respective owners Freescale Semiconductor Inc 2004 2 freescale semiconductor
29. oh kkk kk kkk kk Y Multi sample technique Exercise on an FIR Filter include lt stdio h gt include lt prototype h gt short a 12 0x1000 0x2000 0x3000 0x4000 0x5000 0x6000 0x7000 0x8000 0x9000 0xA000 0xB000 0xC000 short input 32 11 0 0 0 0 0 0 0 0 0 0 0 zero padding 0x0100 0x0200 0x0300 0x0400 0x0500 0x0600 0x0700 0x0800 0x0900 0x0A00 0x0B00 0x0C00 0x0D00 0x0E00 0x0F00 0x1000 0x1100 0x1200 0x1300 0x1400 0x1500 0x1600 0x1700 0x1800 0x1900 0x1A00 0x1B00 0x1C00 0x1D00 0x1E00 0x1F00 0x2000 short y 32 EAEk ck kk ecce ehe e he kk he e he e kk kkk ehe ke he kk kk kkk k kk kkk kk k k kk kk kk kk kk k RARE RAR k k kk k For reference the following output should be observed after rdi running the code KEK y 0 0x0020 i y 1 0x0080 is y 2 0x0140 y 3 0x0280 y 4 0x0460 y 5 0x0700 ES y 6 0x0A80 x y 7 0x0D00 y 8 Ox0EAO Rd y 9 0x0F80 i y 10 Ox0FCO a Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 31 Solutions to Exercises y 11 0x0F80 y 12 0x0F40 y 13 OxOFO00 y 14 0x0E
30. onsiderations The following instructions require data to be aligned on the specified boundaries move w r0 d0 2 byte boundary move f 10 d0 2 byte boundary move 2w r0 d0 d1 4 byte boundary Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 12 Freescale Semiconductor Memory Alignment Exercise move 2f r0 d0 d1 4 byte boundary move 4w 10 d0 d1 d2 d3 8 byte boundary move 4f 10 d0 d1 d2 d3 8 byte boundary move l r0 d0 8 byte boundary move 21 10 d0 d1 8 byte boundary 1 Openthe Ex4 c file which contains a series of assembly instructions within a C framework using asm statements For alternative and nicer ways of incorporating assembly code consult the C700 C C Compiler User s Manual 2 Look at the assembly instructions to understand the wide data move instructions Notice that the code comprises two sections the first section with aligned data and the second with non aligned data 3 Foreach instruction write the result you expect from each section in the boxes provided here in the Expected Columns Array data is of type long int and therefore aligns on a 4 byte boundary data Ox01 0x231 0x45 0x67 0x89 OxAB OxCD OxEF OxAA OxBB OxCC OxDD OxEE OxFF 0x11 0x22 Expected Simulator move fdatar0 ep ow oo FL jj fT O move w r0 d0 oL TT qmd 2j mp qo AE ell I J LL T Eu PEE move 2f r0 d2 d3 n sx E j
31. or applications including both control and DSP code The True bit can affect conditional branching as well as conditional execution of groups of instructions Conditional branching includes BT BF Branch relative if True bit is True False e BTD BFD Branch delayed relative if True bit is True False e JI JF Jump if True bit is True False e JID JFD Jump delayed if True bit is True False Conditional execution of instructions includes IFT IFF IF True bit is True False e IFA IF always which is unconditionally executed with IFT IFF The conditional execution set combinations are very flexible and are represented in Figure 10 which represents the maximum number of ALUS that is two and one Arithmetic Address Unit AAU per subset The C compiler automatically generates the conditional execution set and some examples are provided to highlight potential code optimization Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 19 Control Code The True Bit Exercise 20 oS 2 N 10 11 Figure 10 Control Instructions Using the True Bit Open the example Ex7 c file Understand the conditional test in the code Compile the project with the Ot2 and S options Open the generated assembly file Ex7 s1 and look at the conditional instructions within the loop In the box provided here write down how many execution sets are within the loop Optimized for Ti
32. ort x Input vector Ef short y Input vector y short z Output vector short length Length of vectors Four parameters are passed to addvecs The first three are pointers to arrays and are therefore 32 bit values addresses are 32 bits in StarCore The fourth parameter is the length of the vectors and is a 16 bit value The mechanism by which parameters are passed is specified in the application binary interface ABD Generally speaking this ABI specifies the following calling convention e The first parameter is passed in dO if it is a numeric scalar or in rO if it is an address e The second parameter is passed in dl if it is a numeric scalar or in rl if it is an address e Subsequent parameters are pushed onto the stack e The return value if any is passed back to the calling function in dO if it is a numeric scalar or in rO if it is an address For simple functions with two parameters or fewer the stack is not used to pass parameters and it may be possible to write the entire assembly language function without explicitly using the stack at all In general however the stack is used to pass parameters into the function and to store local variables Its contents are as shown in Figure 11 Just prior to the function call parameters 3 4 5 and so on are pushed onto the stack in reverse order and parameters 1 and 2 are stored in d0 r0 and d1 r1 as described previously The function is then called and the retur
33. ost processors require operands to be aligned in memory and multiple operand load stores to be aligned For example a double operand load requires an even address and a quad operand load requires a double even address These restrictions reduce the complexity of the address generation hardware particularly for modulo addressing For example let us consider the move 4w Rx Dk instruction more specifically move 4w RO DO D1 D2 D3 four 16 bit words are moved from the memory address of RO into the data registers DO D1 D2 and D3 respectively The data must align on an 8 byte boundary so the address contained in RO should be a multiple of eight The examples in Figure 8 further illustrate this point Aligned Not Aligned Bringing one word from memory e move w r0 d0 where rO 0x0 or 0x2 e move w r0 d0 where rO 0x1 or 0x3 P 0x00 AA BB CC DD P 0x00 AA BB CC DD SS Aligned on a 2 byte boundary Ns Not Aligned on a 2 byte boundary Correct Operation brings either AABB Erronenous Operation brings wrong if RO 0x0 or CCDD if RO 0x2 data in dO Bringing two words from memory e move 2w r0 d0 d1 where rO 0x0 or 0x4 e move 2w r0 d0 d1 where rO 0x1 0x2 or 0x3 P 0x00 AA BB CC DD EE P 0x00 AA BB CC DD EE m a Aligned on a 4 byte boundary Ns Not Aligned on a 4 byte boundary Correct Operation brings AABB CCDD Erronenous Operation brings wrong if RO 0x0 data in d0 d1 Figure 8 Alignment C
34. s granted hereunder to design or fabricate any integrated circuits or integrated circuits based on the information in this document Freescale Semiconductor reserves the right to make changes without further notice to any products herein Freescale Semiconductor makes no warranty representation or guarantee regarding the suitability of its products for any particular purpose nor does Freescale Semiconductor assume any liability arising out of the application or use of any product or circuit and specifically disclaims any and all liability including without limitation consequential or incidental damages Typical parameters which may be provided in Freescale Semiconductor data sheets and or specifications can and do vary in different applications and actual performance may vary over time All operating parameters including Typicals must be validated for each customer application by customer s technical experts Freescale Semiconductor does not convey any license under its patent rights nor the rights of others Freescale Semiconductor products are not designed intended or authorized for use as components in systems intended for surgical implant into the body or other applications intended to support or sustain life or for any other application in which the failure of the Freescale Semiconductor product could create a situation where personal injury or death may occur Should Buyer purchase or use Freescale Semiconductor products for any such uni
35. t 0 tmp tmp array2_ptrt tmp Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor Solutions to Exercises Exercise 8 Z OFFSET equ 12 M OFFSET equ 14 Exercise 9 Ekk ck KER ehe se he ke he he e he e kk kkk kk kk kk kk kk k k kk kkk kk k k kk kk kk kk RARA RARE RRE RE RRA k k k k k k Freescale Semiconductor Inc COPYRIGHT 1999 Freescale Semiconductor Inc KREKKKKKK KKK KK KK KK KKK KKK KKK KK KKK KKK KK Ck ck KK KKK KK KEK KK KKK KKK ko ck ck ko ck ck ko KK ck ck ck ck ck ko ko ko KK ko ko ko KK K INTRODUCTION TO THE SC140 TOOLS kkk kkk kk kk k k k k e k k k k ce k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k k kk k k kk k kk kk kk kk kk k kkk kkk kk A include lt prototype h gt define DATA_LENGTH 6 Word16 y 2 Word16 a 12 0x0200 0x0400 0x0200 0x0400 0x0200 0x0400 0x0200 0x0400 0x0200 0x0400 0x0200 0x0400 Word16 b 12 0x0100 0x0800 0x1000 0x2000 0x1000 0x0800 0x0200 0x0100 0x1000 0x0800 0x0200 0x0100 void main Word16 i Word32 L_Rel L Re2 L_Im1l L Im2 L Rel Lo Re2 L_Im1 L Im2 0 for i 0 i 2 DATA LENGTH i 2 L Rel L_mac L_Rel a i b i L_Iml L mac L Im1 a i b li 1 L Im2 L mac L Im2 a li 1 blil L Re2 L mac L Re2 ali 1 bli 1 v 0 round L_Rel L Re2 y 1 round L Im1 L Im2 Introduction to the
36. takes longer when global optimization is enabled Global optimization compilation flow is represented in Figure 5 StarCore C Compiler C Compiler C Compiler C Compiler Front End Front End Front End IR Files IR Files IR Files Obj Obj Obj Local Optimization Optimizer Optimizer Optimizer icode icode icode Assembler Assembler asmsc100 asmsc100 Assembler asmsc100 Object library files elb Figure 4 StarCore Local Optimization Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 8 Freescale Semiconductor Local Versus Global Optimization Exercise StarCore C Compiler C Files C Files C Files C h c h c h C Compiler Front End IR Files Global Optimization C Compiler Front End IR Files C Compiler Front End IR Files Optimizer icode Assembler Assembler Assembler asmsc100 asmsc100 asmsc100 Figure 5 StarCore Global Optimization Object Library Files elb The benefit of Global Optimization is most apparent when several files containing cross references are used as is often the case in any sizeable application In this example two files are used e the main file called Ex3_main c e a function file called Ex3_prod c The main file Ex3_main c calls a routine defined in the function file Ex3_prod c as shown in Figure 6 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises
37. ted for the application Other techniques can be used where it is possible to evaluate one intermediate product from each of four output sample calculations in parallel Consider the FIR filtering operation described by Equation 4 N 1 y n gt a x n i forO lt n lt L Equation 4 i 0 A C code implementation of this operation typically resembles the implementation of Ex6 c To use all four ALUS the operations can be grouped as illustrated in the following equation y n agx n a x n 1 a x n 2 a3x n 3 cay x n N 2 tay x n N 1 1 agx n 1 a x n ayx n 1 asx n 2 ay ox n N 3 cay x n N 2 Equation 5 2 agx n 2 t ayx n 1 a x n a4x n l1 ay ox n Nr 4 t ay ix n N 3 3 agx n 3 t aax n 2 a x n 1 azx n dy aX n N 5 ay x n N 4 GroupO Groupi Group2 Group 3 Group N 2 Group N 1 In Equation 5 the products and accumulations within each group are calculated in parallel but the groups themselves are evaluated in sequence thus preserving the order of accumulation which in turn preserves the bit exactness of Equation 4 Therefore parallelization is achieved by processing multiple samples in parallel rather than multiple intermediate products belonging to only one output sample When one group for example Group 2 is evaluated only two words of data need to be loaded for the next group Group 3 a3 and x n 3 The other values needed for th
38. tform C FreescaleDSP SC140 e Ona UNIX platform FreescaleDSP SC140 The exercises directory structure and files are represented in Figure 1 This directory structure is only a recommendation any location can be used Once you have installed the exercise files and if you are running on a Windows platform all the exercises are located in e c FreescaleDSP SC140 Exercises This path is the reference path for all exercises discussed in this document EALO To be created Ex2 Integer amp Fractional Arithmetic Ex2 c Ex3 Local vs Global Optimization Ex3 main c amp Ex3_prod c Ex4 Memory Alignment Considerations Ex4 asm Exercises Ex5 Split Summation Technique Ex5 c Ex6 Multi Sample Technique Ex6 c Ex7 Control Code Use of the True Bit Ex7 c Ex8 Calling an Assembly Routine from C Ex8 c and AddVecs asm Ex9 The Challenge Ex9 c Figure 1 Directory Structure and Files for SC140 Exercises A typical development process is represented in Figure 2 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 2 Freescale Semiconductor eee C Compiler IR library a Files lib IR files IR Intermediate Representation Obj Optimizer icode Compiler Assembly files Assembly Sl Filesasm Assembler Assembler asmsc100 Object Library Object Files Listing Files elb eln Files Ist Linker sc100 ld Linker M Map Files map R
39. the code Look at the register contents in the session window and write the values in the Simulator Columns boxes above for both sections Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor Split Summation Exercise Congratulations you have completed Exercise 4 Good To Know e Unaligned data accesses lead to erroneous results You must consider these issues when developing assembly code 5 Split Summation Exercise The split summation exercise shows how to modify C code using the split summation technique to get better parallelization The split summation technique helps to maximize the multiple ALU loading by performing arithmetic operations in parallel while requiring little algorithmic or code modifications To illustrate this technique the example performs the the optimization of the energy of a signal calculation already considered in Exercise 2 The power calculation is represented in Equation 2 N 1 2 s y y x i Equation 2 i 0 where x i is the signal input sample at iteration i y is the power of the signal and N is the signal length As Exercise 2 shows computing the signal energy directly from Equation 2 results in the use of only one ALU out of the four with one multiply accumulate operation performed at each iteration However the split summation technique can load all four ALUs Equation 2 is expanded as follows N 1 y b3 x i x i x i 1 x i 1 x i
40. the instructions that put z and M onto the stack just prior to the function call Write the offsets in the box provided here Z_OFFSET M_OFFSET 13 Are the offsets used in Ex8 c the same as the offsets used in addvecs asm If not can you explain why Congratulations you have completed Exercise 8 Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor 23 The Challenge Good To Know The stack pointer must always be a multiple of 8 It is illegal to increment it by a non multiple of 8 9 The Challenge This section presents you with a challenge involving an example that implements a complex scalar product The objective of this session is to optimize the code from Ex9 c for speed and obtain the minimum number of cycles 1 Put into practice the techniques previously explained to optimize Ex9 c The original number of cycles in the inner loop is Original Inner loop So far after having modified the code your best result is Your Best Inner loop The optimized C code result is Target Inner loop 1 cycle with ALUs and AAUs 100 percent used If your best result is within 10 percent of the target result congratulations You have completed all the exercises and the challenge as well Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 24 Freescale Semiconductor Solutions to Exercises 10
41. ues in varl var2 and var3 and load the new sample x n 1 into var0 Compile the code with the Ot2 option and run the code to verify that the correct output values are obtained Recompile Ex6_1 c using the Ot2 and S options The inner loop should be only two cycles long If not return to Step 5 During each iteration of the loop the coefficient a 1 is loaded into a data register The data value x n 1 i is loaded into another data register The values in the other three registers are reused but they must first be transferred into the registers where the four MAC instructions expect them This transfer results in two clock cycles for every four MAC instructions In the box on the following page write the code for the intermediate version C Code Generated Assembly Code Further Speed Optimization The register to register transfers can be eliminated by expanding the inner loop so that each group of four MAC instructions uses the data registers already containing the required data values This yields faster code but code size is greater Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor Control Code The True Bit Exercise 1 Save Ex6_1 cas Ex6_2 c 2 In Ex6 2 c unroll the inner loop instructions four times so that the first four groups Group 0 Group 1 Group 2 and Group 3 are all processed in the loop This loop expans
42. un time Simulator runsc100 Absolute Files Interactive Simulator eld simsc100 Execute Program to completion C file I O capability DOS based Figure 2 StarCore Development Process Introduction to the StarCore SC140 Tools An Approach in Nine Exercises Rev 1 Freescale Semiconductor File I O Exercise 1 File I O Exercise The file I O exercise shows how to use standard ANSI C I O features within the current tools suite 1 Create a new text file called io c 2 Within the io c file write code using the ANSI C printf function to display Welcome to StarCore SC140 Tools on the screen remember to include the header file stdio h 3 Compile the file using ccsc100 io c o io eld The o option specifies the output file name for example io ela If the application does not compile successfully correct the reported mistake s and recompile the application until a successful compilation occurs 4 Run the executable runsc100 io eldto display Welcome to StarCore SC140 Tools The runsc100 executable is a cycle accurate run time simulator It allows you to run an application to completion and print out intermediate final results You can use this executable for quick code verification and or debugging purposes Congratulations you have completed Exercise 1 2 Integer and Fractional Arithmetic Exercise One of the strengths of both the StarCore architecture and the StarCore compiler is the ability to

An Approach in Nine Exercises

Contents

Download Pdf Manuals

Related Search

Related Contents