Home

Intel(R) C++ Compiler for Linux* Systems User's Guide

1. sourcel cpp creating precompiled header file sourcel pchi source2 cpp using precompiled header file sourcel pchi source3 cpp using precompiled header file sourcel pchi If you don t use pragma hdrstop a different PCH file is created for each source file if different headers follow common h and the subsequent compile times will be longer pragma hdrstop has no effect on compilations that do not use these PCH options 56 Building and Debugging Applications Linking This topic describes the options that let you control and customize the linking with tools and libraries and define the output of the 1d linker See the 1d man page for more information on the linker Option Ldirectory Qoption tool list shared shared libcxa i dynamic static static libcxa Bstatic Bdynamic Description Instruct the linker to search directory for libraries Passes an argument list to another program in the compilation sequence such as the assembler or linker Instructs the compiler to build a Dynamic Shared Object DSO instead of an executable shared libcxa has the opposite effect of stat ic libcxa When it is used the Intel provided Libcxa C library is linked in dynamically allowing the user to override the static linking behavior when the stat ic option is used Note By default all C standard and support libraries are linked dynamically Specifies that
2. m128 mm shuffle pat m128 a m128 b unsigned int imm8 Selects four specific SP FP values from a and b based on the mask imm8 The mask must be an immediate See Macro Function for Shuffle Using Streaming SIMD Extensions for a description of the shuffle semantics __m128 mm unpackhi ps m128 a m128 b Selects and interleaves the upper two SP FP values from a and b r0 a2 rl b2 r2 a3 r3 b3 __m128 mm unpacklo pat m128 a __m128 b Selects and interleaves the lower two SP FP values from a and b rO a0 rl b0 r2 al r3 bl mi28 mm loadh pi m128 __m64 const p Sets the upper two SP FP values with 64 bits of data loaded from the address p rO a0 rl al r2 pO r3 pl 243 Intel C Compiler for Linux Systems User s Guide void _mm_storeh_pi __m64 p __m128 a Stores the upper two SP FP values to the address p p0 a2 pl a3 __m128 mm movehl ps m128 a __m128 b Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result The upper 2 SP FP values of a are passed through to the result r3 a3 r2 a2 rl b3 r0 b2 m128 mm movelh ps m128 a m128 b Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result The lower 2 SP FP values of a are passed through to the result r3 bl r2 b0 rl al rO a0 m128 mm loadl pi m128 a m64 const p Sets the lower two SP FP values with
3. sse 129 data dependence c 121 default compiler behavior ssssseeseeseeseeseeesrsreeesee 39 compiler options sss 37 denormal results 54 SR EE 11 Dname value option 11 dryrun option 11 dynamic linker option 11 BE Option a c tesa eh e e 11 ECCCFG enviroment variable 48 ECPCCFG enviroment variable 48 EMMS Instruction 210 environment customing seine ne e e 48 setting with iccvars sh 40 variables e ERRARE Pen x 48 BP option ence eee d 11 erf library function 184 erfc library function 184 exp library function ssssssesss 180 exp10 library function sss 180 exp2 library function sssssssss 180 expml library function sssssss 180 SF ODUORD npe ius 11 f no verbose asm option 11 Intel C Intrinsics Reference fabs library function sssssssss 190 falias option aee ERO 11 faSt OptlOn at sesso alias ale eu RARE 11 fcode asm option 11 fdim library Dumcetton seee 190 features and bene tss nine dress 2 n H 1 finalias Oppen sed ee 11 files configuration sess 49 for compiler input 42 for precompiled headers 55 WNC UG EE A one 50 PESPONSC LE edel CS 50 finite librar
4. sssssssssss 75 gcc name option ss sessseseeeesseeesseserseesersee 11 71 gcc version OPTION 11 71 global symbols sss 63 Hopton es aset t e 11 help option sets 11 hypot library function 180 Bert H 11 1_ dynamic Option 11 IA32ROOT enviroment variable 48 IA64ROOT enviroment variable 48 ICCCFG enviroment variable 48 ICC VALS CSD nak eoe EROS nn gees 40 ICCVAIS SM aui eR d eee eia o nte 40 ICPCCFG enviroment variable 48 idirafter option 11 ilogb library function 180 include files Searching for ere 51 includ files d mee demie 50 inline EXPANSION tisse nie 97 98 393 inline_debug info option sseeeeeeee 11 92 Intel extensions ninne ai 148 Intel math library 60 176 179 180 184 187 189 190 193 intermediate language 94 intrinsics benefits of USING sse 200 for cross processor implementation 311 315 318 323 for data altenment 307 308 for Itanium R processor 285 288 291 292 296 300 for new Intel processors 283 284 285 MMX TM Technology 211 213 215 217 218 220 Streaming SIMD Extensions 221 222 225 226 231 233 234 235 236 239 243 244 246 248 Streaming SIMD Extensions 2 floating point 249 251 252 257 259 260 261 262 integer 263 268 269 272 274 275 277 280 281 282 S
5. m128 Initialization 128 I128vecl A m128 m Iul6vec8 m128 m m64 Initialization 64 I64vecl A m64 m Iu8vec8 A __m64 m int64 Initialization M64 I64vecl A int64 m Iu8vec8 A int64 m 342 Intel C Intrinsics Reference Operation Class int i M64 I64vecl A int i Iu8vec8 A int i Initialization I32vec2 I32vec2 A int Al int AO Is32vec2 A signed int Al signed int A0 Iu32vec2 A unsigned int A1 unsigned int A0 int initialization I32vecA4 I32vec4 A short A3 short A2 short Al short A0 int Initialization Is32vec4 A signed short A3 signed short A0 Tu32vec4 A unsigned short A3 unsigned short A0 I16vec4 Il6vec4 A short A3 short A2 short Al short AO short int Initialization Isl6vec4 A signed short A3 signed short A0 Iul6vec4 A unsigned short A3 unsigned short A0 short int Il6vec8 Il6vec8 A short A7 short A6 short Initialization Al short A0 Isl6vec8 A signed A7 signed short A0 Iul6vec8 A unsigned short A7 unsigned short A0 char I8vec8 I8vec8 A char A7 char A6 char Al Initialization char A0 Is8vec8 A signed char A7 signed char AO Iu8vec8 A unsigned char A7 unsigned char A0 char I8vecl6 I8vecl6 A char A15 char AO Initialization Is8vecl6 A signed char A15 signed char A0 Iu8vecl6 A uns
6. Note that mm andnot y intrinsics do not apply to the fvec classes 385 Conditional Select Operators Corresponding Intrinsics and Classes Part 2 Operators select Corresponding Intrinsic 64vec2 F32vec4 F32vec1 eq select neq mm cmpeq x mm and y mm andnot y mm or y TD mm cmpeq x TD select gt mm and y mm andnot y mm or y mm cmpgt x select ge mm and y mm andnot y mm or y e mm cmpge x TD Lt select _mm_and_ y mm andnot y mm or y S 9 le select 7 F pd pd pd pd pd pd 17777 mm_cmple select_ngt _mm_and_ mm andnot y mm or y mm cmpgt x select nge mm cmpge x lect nlt lect nle mm cmplt li mm cmple x 386 Intel C Intrinsics Reference Packing and Unpacking Operators Corresponding Intrinsics and Classes Part 1 Operators Corresponding l64vec2 I32vec4 I16vec8 I8vec16 I32vec2 Intrinsic unpack high mm unpackhi x epi64 epi32 epil6 epi8 pi32 unpack low mm unpacklo_ epi64 epi32 epil6 epi8 pi32 pack sat mm packs epi32 epil6 N A pi32 packu_sat mm packus x N A N A epil6 N A N A sat add mm adds x N A N A epil6 epi8 N A sat sub mm subs x N A N A epil6 epi8 N A Packing and Unpacking Operators Corresponding Intrinsics and
7. a i a i 1 0f pragma novector Syntax pragma novector Definition The novector loop pragma specifies that the loop should never be vectorized even if it is legal to do so In this example suppose you know the trip count ub 1b is too low to make vectorization worthwhile You can use pragma novector to tell the compiler not to vectorize even if the loop is considered vectorizable Example void foo int lb int ub pragma novector for Jj lb J lt ub j a j alj bil pragma vector nontemporal Syntax pragma vector nontemporal Definition pragma vector nontemporal results in streaming stores on Pentium A based systems An example loop float type together with the generated assembly are shown in the example below For large N significant performance improvements result on a Pentium 4 systems over a non streaming implementation 128 Parallel Programming Example pragma vector nontemporal for i 0 i lt N itt ali 1 BI Ss movntps XMMWORD PTR _aleax xmm0 movntps XMMWORD PTR a eax 16 xmm0 add eax 32 cmp eax 4096 jl B1 2 Dynamic Dependence Testing Example float p q for i L I lt U itt pli g i pL p 4 L pH p 4 U qL q 4 L qH q 4 U if pH lt qL pL gt qH loop without data dependence for i L i lt U i pli q il else For i L i lt U itt pli
8. 154 Optimization Support Features Loop Count and Loop Distribution loop count n Directive The loop count n directive indicates the loop count is likely to be n The syntax for this directive is pragma loop count n where n is an integer constant The value of Loop count affects heuristics used in software pipelining vectorization and loop transformations Example of loop count n Directive pragma loop count 10000 for i 0 i lt m i swp likely to occur in this loop a i b i 1 2 distribute point Directive The distribute point directive indicates to the compiler a preference of performing loop distribution The syntax for this directive is pragma distribute point Loop distribution may cause large loops be distributed into smaller ones This may enable software pipelining for more loops If the directive is placed inside a loop the distribution is performed after the directive and any loop carried dependency is ignored If the directive 1s placed before a loop the compiler will determine where to distribute and data dependency is observed Only one distribute directive is supported when placed inside the loop Example of distribute point Directive pragma distribute point for i l i m i b i a il 1 Compiler will automatically decide where to distribute Data dependency is observed c i a i b i d i scI Ii 1 for i l i m i b i a i 1
9. Add all the elements of a 20 element array J RRR KR RK KK KK IK Ck KCkCk KCkCk KCkCk kCk ck Ck kc k kckck k kck ck kck ck kok ck ck k kk k void Add20ArrayElements F32vec4 array float result F32vec4 vecO vecl vec0 mm load ps float zi array Load array s first 4 floats f RRR KKK KKK IK KK I I kOk ck kok ck kck ck ckok ck ck ck e k Add all elements of the array 4 elements at a time J RR RK KKK KK KCkCk IK RK KK kCkCK kCkCk ck kck ck kck ck kck kk kkk kkk kk lements 5 8 lements 9 12 lements 13 16 lements 17 20 vec0 array Add vec0 array Add vec0 array Add vec0 array Add el el el el J RRR ER KKK KK KR I I k kck ck k kkk ck kk There are now 4 partial sums Add the 2 lowers to the 2 raises then add those 2 results together J RRR RR KK KK KK IK RR KK kkk k kkk SHUFFLE vecl vecO 0x40 vecl SHUFFLE vecl vecO 0x30 vec0 vecl vecO SHUFFLE vec0 vecO 2 Jmm store ss result vec0 Store the final sum vecl vecO vecl lE OH 3d void main int argc char argv int i Initialize the array for i20 i lt SIZE i array i float i Call function to add all array elements 389 Add20ArrayElements array amp result Print average array element value printf Average of all array values fMn result 20 printf The correct answer is
10. Greater Than Greater Than mm cmpg mm andnot si64 or Equal To mm cmpg mm cmpg Less Than _mm_cmpg _mm_cmpg _mm_cmpg Less Than mm_cmpg mm_andnot_si64 or Equal To mm_cmpg _mm_cmpg 351 Comparison operators have the restriction that the operands must be the size and sign as listed in the Compare Operator Overloading table Compare Operator Overloading R Comparison A B I32vec2 R cmpeq T s u 32vec2 B I s u 32vec2 B cmpne Il6vec4 R I s u 16vec4 B I s u 16vec4 B I8vec8 R I s u 8vec8 B I s u 8vec8 B I32vec2 R cmpgt Is32vec2 B Is32vec2 B cmpge cmplt cmple Il6vec4 R Isl6vec4 B Isl6vec4 B I8vec8 R Is8vec8 B Is8vec8 B Conditional Select Operators For conditional select operands the third and fourth operands determine the type returned Third and fourth operands with same size but different signedness return the nearest common ancestor data type Conditional Select Syntax Usage Return the nearest common ancestor data type if third and fourth operands are of the same size but different signs Il6vec4 R select neq Isl6vec4 Isl16vec4 Isl6vec4 Iul6vec4 Conditional Select for Equality RO AO BO CO DO Rl Al B1 Cl D1 R2 A2 B2 C2 D2 R3 A3 B3 C3 D3 Conditional Select for Inequality RO AO BO CO DO R1 Al B1 C1 D1 R2 A2 B2 C2 D2 R3 A3 B3 C3 D3 352 Intel C Intrin
11. e floating point arithmetic comparisons conform to the IEEE 754 specification except for NaN behavior e the exact operations specified in the code are performed For example division is never changed to multiplication by the reciprocal e the compiler performs floating point operations in the order specified without reassociation e the compiler does not perform the constant folding optimization on floating point values Constant folding also eliminates any multiplication by 1 division by 1 and addition or subtraction of 0 For example code that adds 0 0 to a number is executed exactly as written Compile time floating point arithmetic is not performed to ensure that floating point exceptions are also maintained e floating point operations conform to ANSI C When assignments to type float and double are made the precision is rounded from 80 bits extended down to 32 bits float or 64 bits double When you do not specify mp the extra bits of precision are not always rounded before the variable 1s reused e sets the nolib inline option which disables inline functions expansion mp1 Option Use the mp1 option to improve floating point precision mp1 disables fewer optimizations and has less impact on performance than mp Options for IA 32 Only A Caution A change of the default precision control or rounding mode for example by using the pc32 flag or by user intervention may affect the results returned by some
12. omp_set_nest_lock lock Forces the executing thread to wait until the nested lock associated with lock is available The thread is granted ownership of the nested lock when it becomes available omp_unset_nest_lock lock Releases the executing thread from ownership of the nested lock associated with lock if the nesting count is zero Behavior is undefined if the executing thread does not own the nested lock associated with lock omp_test_nest_lock lock Attempts to set the nested lock associated with lock If successful returns the nesting count otherwise returns zero Timing Routines Function Description omp get wtime Returns a double precision value equal to the elapsed wallclock time in seconds relative to an arbitrary reference time The reference time does not change during program execution omp_get_wtick Returns a double precision value equal to the number of seconds between successive clock ticks 146 Parallel Programming Examples of OpenMP Usage The following examples show how to use the OpenMP feature A Simple Difference Operator This example shows a simple parallel loop where the amount of work in each iteration is different Dynamic scheduling is used to get good load balancing The for has a nowait because there is an implicit barrier at the end of the parallel region void for 1 float a float b int n Intl 7 pragma omp parallel shared a b n private i j
13. Identifies a construct that restricts execution of the associated structured block to a single thread at a time Synchronizes all the threads in a team Ensures that a specific memory location is updated atomically Specifies a cross thread sequence point at which the implementation is required to ensure that all the threads in a team have a consistent view of certain objects in memory 141 Intel C4 Compiler for Linux Systems User s Guide Directive Name ordered threadprivate OpenMP Clauses Clause private firstprivate lastprivate shared default reduction ordered if schedule copyin Description The structured block following an ordered directive is executed in the order in which iterations would be executed in a sequential loop Makes the named file scope or namespace scope variables specified private to a thread but file scope visible within the thread Description Declares variables to be private to each thread in a team Provides a superset of the functionality provided by the private clause Provides a superset of the functionality provided by the private clause Shares variables among all the threads in a team Enables you to affect the data scope attributes of variables Performs a reduction on scalar variables The structured block following an ordered directive is executed in the order in which iterations would be executed in a sequential loop Ift
14. M S S S S I LLLI Specifer pe ip args in regs 0 p ip ninl max stats n _ m ip ninl min stats n E ip ninl max total stats n Description Disables the passing of arguments in registers By default external functions can pass arguments in registers when called locally Normally only static functions can pass arguments in registers provided the address of the function is not taken and the function does not use a variable number of arguments Sets the valid max number of intermediate language statements for a function that is expanded in line The number n is a positive integer The number of intermediate language statements usually exceeds the actual number of source language statements The default value for n 1s 230 The compiler uses a larger limit for user inline functions Sets the valid min number of intermediate language statements for a function that is expanded in line The number n is a positive integer The default value forip ninl min stats is e IA 32 compiler ip ninl min stats 7 e Itanium compiler ip ninl min stats 15 Sets the maximum increase in size of a function measured in intermediate language statements due to inlining n is a positive integer whose default value is 2000 The following command activates procedural and interprocedural optimizations on source cpp and sets the maximum increase in the number of intermediate language statements to 5 for each functio
15. pragma omp for schedule dynamic 1 nowait for i Ll I lt ony iT Two Difference Operators The example below uses two parallel loops fused to reduce fork join overhead The first for has a nowait because all the data used in the second loop is different than all the data used in the first loop void for 2 float a float b float c float d int n int m inte dy Jj pragma omp parallel shared a b c d n m private i j pragma omp for schedule dynamic 1 nowait for i 1 i lt n itt for j 7 J lt i J b j i alj n i pragma omp for schedule dynamic 1 nowait for i 1 i lt m i for j NT d j i 147 Intel C Compiler for Linux Systems User s Guide Intel Extensions to OpenMP Intel Workqueuing Model The workqueuing model lets you parallelize control structures that are beyond the scope of those supported by the OpenMP model while attempting to fit into the framework defined by OpenMP In particular the workqueuing model is a flexible mechanism for specifying units of work that are not pre computed at the start of the worksharing construct For single for and sections constructs all work units that can be executed are known at the time the construct begins execution The workqueuing pragmas taskq and task relax this restriction by specifying an environment the taskq and the units of work the tasks separately Intel Exte
16. Intel C Compiler for Linux Systems User s Guide Corresponding Instruction p bo p bo al m128d _mm_add_sd __m128d a m128d Db a0 o a0 o a0 o 0 o Adds the lower DP FP double precision floating point values of a and b the upper DP FP value is passed through from a r0 a0 b rl al __mi28d _mm_add_pd __m128d a __m128d b Adds the two DP FP values of a and b ro a0 bO rl al bl __mi28d _mm_sub_sd __m128d a m128d b Subtracts the lower DP FP value of b from a The upper DP FP value is passed through from a ro a0 b0 rl al __mi28d mm sub pd mi28d a m128d b Subtracts the two DP FP values of b from a ro a0 bO ri al c obu mi28d mm mul sd mi28d a m128d b Multiplies the lower DP FP values of a and b The upper DP FP is passed through from a r0 a0 b rd al __mi28d mm mul pd mi128d a m128d b Multiplies the two DP FP values of a and b r0 a0 bo rl ze al bil 250 Intel C Intrinsics Reference __mi28d _mm_div_sd __m128d a __m128d b Divides the lower DP FP values of a and b The upper DP FP value is passed through from a rO a0 b rl Ze al __mi28d mm div pd m128d a m128d b Divides the two DP FP values of a and b ro a0 b ri al bl __mi28d mm sort sd mi128d a m128d b Computes the square root of the lower DP FP value of b The up
17. Manual creation of precompiled header filename pchi Link using C run time libraries provided with gcc requires gcc 3 2 or above Link using C run time libraries provided by Intel Maximize speed across the entire program Turns on 03 ipo and static Compilation is for the main executable Absolute addressing can be used and non position independent code generated for symbols that are at least protected Enables the compiler to treat common variables as if they were defined allowing the use of gprel addressing of common data variables Generates extra code after every function call to assure the FP stack is in the expected state Default OFF OFF OFF OFF OFF ON OFF OFF OFF OFF Compiler Options Quick Reference Option fvisibility extern default protected hidden internal fvisibility extern file fvisibility default file fvisibility protected file fvisibility hidden file fvisibility internal file fwritable strings gcc name name gcc version nnn Description Default Global symbols common and OFF defined data and functions will get the visibility attribute given by default Symbol visibility attributes explicitly set in the source code or using the symbol visibility attribute file options will override the visibility setting Space separated symbols listed in OFF the ile argument will get visibility set t
18. e The objects produced by the compilation phase of ipo might be linked without the ipo option and without the use of xild 94 Compiler Optimizations e You want to generate an assemblable file for each source file using S while compiling with ipo If you use ipo with S but without ipo_obj the compiler issues a warning and an empty assemblable file is produced for each compiled source file Implementing the IL Files with Version Numbers An IPO compilation consists of two parts the compile phase and the link phase In the compile phase the compiler produces a file containing an intermediate language IL version of your code In the link phase the compiler reads the IL and completes the compilation producing a real object file or executable Generally different compiler versions produce IL based on different definitions and therefore they can be incompatible The Intel C Compiler assigns a unique version number with each compiler s IL definition If a compiler attempts to read IL in a file with a version number other than its own the compilation proceeds but the IL 1s discarded and not used in the compilation The compiler then issues a warning about an incompatible IL IL in Objects and Libraries More Optimizations The IL produced by the Intel compiler is stored in a special section of the object file The IL stored in the object file is then placed in the library If this library is used in an IPO compilation
19. float _Complex cis float z Description The clog function returns the complex natural logarithm of z Calling interface double Complex clog double Complex z long double Complex clogl long double Complex z float Complex clogf float _Complex z Description The clog2 function returns the complex logarithm base 2 of z Calling interface double Complex clog2 double Complex z long double Complex clog21 long double Complex z float Complex clog2f float Complex z Description The conj function returns the complex conjugate of z by reversing the sign of its imaginary part Calling interface double Complex conj double Complex z long double Complex conj long double Complex z float Complex conjf float Complex z Description The cpow function returns the complex power function x Calling interface double Complex cpow double Complex x double long double Complex cpowl long double Complex x double Complex y float Complex cpowf float Complex x float Complex y Intel Math Library CPROJ CREAL CSIN CSINH CSQRT CTAN CTANH Description The cpro j function returns a projection of z onto the Riemann sphere Calling interface double Complex cproj double Complex z long double Complex cproj long double Complex z float Complex cprojf float Complex z Description The creal function returns the real part value of z Calling interfa
20. iii 136 Intel Extensions to OpenMP iii 148 Optimization Support Features ss 154 Compiler Directives f e eh tere Ur d n eder pe EO e p neve e Er RD S Pee dde 154 Optimizer Report Generatio Missies ent trt Da getto c Mec Pet e b reor Ra 159 Timing Your AppliGatlOn 2 eoe eene etr eterne ecce Bec Ene cope ete 160 COMP CE TAME ted e eoe ee b e RS ioa dotted e etes 162 IOP TES m M 163 Key Files Summary for IA 32 Compiler iii 163 Key Files Summary for Itanium Compiler esses 166 Diagnostics CNV CSSA BOS Lane dieit ye vost Ne ipee Das eluates 166 Diagnostic Messages saines nets de NS DR ER at ERR 168 Language Diagnostics oo eine ipe ben RERO he P bee rs 168 Suppressing Warning Messages with lint Comments ssesseeeene 169 Suppressing Warning Messages or Enabling Remarks sse 169 Limiting the Number of Errors Reported nennen 170 Remark Messages ENEE NENNEN ENEE ese endete ee Eva eed eb pose e Tes Ra ENEE eee Ree 170 Intel Math ENEE ata ae bed sue 171 Using the Intel Math Library sise 172 Man FUNCIONS 8858 lan Se SS ten Rees deg Sieger tee ER UR 176 Intel C Intrinsics Reference eese eese eene eene enn 199 MT OCUCTION PEPPEN faececg EE 199 Intrinsics Implementation Across All IA 204 MMX Technology Intrinsics sis 210 Streaming SIMD Extensions sie 221 Streaming SIMD Extensions 2 iii 249 New A 32 a EE 283 Intrinsics for Itanium Instructions 285
21. 11 76 BITAYS a eet rer eee edens 125 asin library function 176 asind library function 176 asinh library function 179 atan library function 176 atan2 library function 176 atand library function 176 atand2 library function 176 atanh library function 179 auto ilp32 option 11 AX 0 015 0 e cecssssessrsisrirsererssesrsseiiesen 11 86 119 bash profile 40 built in functions 74 er eee 11 C INCLUDE PATH enviroment variable 48 CIOE Op EE 11 cabs library function 193 cacos library function ssssssss 193 cacosh library function sues 193 captureprivate oo eee eeeeseeeeeeceeseeesecneeeaeeenes 150 carg library function eese 193 casin library function 193 casinh library function 193 catan library function 193 catanh library function 193 cbrt library function sss 180 ccos library function s sssesesseeseeseeseeeesseeeesee 193 ccosh library function 193 ceil library function sssssssesss 187 cexp library function 193 cexp10 library function 193 391 cimag library function sss 193 cis library function 193 class libraries floating point vector classes 363 364 365 366 370 371 372 376 379 380 381 389 integer vector classes 340 341 344 346 348 349 351 352 354 356 360 361 362 class libraries 332 333 334 33
22. An output list consists of one or more output specs separated by commas For the purposes of substitution in the asm template each output spec is numbered The first operand in the output 1list is numbered 0 the second is 1 and so on Numbering is continuous through the out put 1list and into the input list The total number of operands is limited to 10 i e 0 9 Similar to an output list an input list consists of one or more input specs separated by commas For the purposes of substitution in the asm template each input spec is numbered with the numbers continuing from those in the output list A clobber list tells the compiler that the asm uses or changes a specific machine register that 1s either coded directly into the asm or is changed implicitly by the assembly instruction The clobber list is a comma separated list of clobber specs The input specs tell the compiler about expressions whose values may be needed by the inserted assembly instruction In order to describe fully the input requirements of the asm you can list input specs that are not actually referenced in the asm template Intel C Intrinsics Reference Syntax Element clobber spec Description Each c1obber spec specifies the name of a single machine register that is clobbered The register name may optionally be preceded by a The following are the valid register names eax ebx ecx edx esi edi ebp esp ax bx cx dx si di b
23. Compute inverse hyperbolic tangent of the argument with double precision Compute inverse hyperbolic tangent of the argument with single precision Computes absolute value of complex number Computes smallest integral value of double precision argument not less than the argument Computes smallest integral value of single precision argument not less than the argument Computes the hyperbolic cosine of double precison argument Computes the hyperbolic cosine of single precison argument Computes absolute value of single precision argument Intel C Intrinsics Reference doub doub Intrinsic double floor double float floorf float double fmod double float fmodf float le hypot double le float hypotf float double rint double float rintf float double sinh double float sinhf float float sqrtf float double tanh double float tanhf float Description Computes the largest integral value of the double precision argument not greater than the argument Computes the largest integral value of the single precision argument not greater than the argument Computes the floating point remainder of the division of the first argument by the second argument with double precison Computes the floating point remainder of the division of the first argument by the second argument with single precison Computes the length of the hypotenuse of a right
24. I s u 8vec8 I s u 8vec8 132vec2 R select gt Is32vec2 select ge select lt select le Il6vec4 R Isl6vec4 I8vec8 R Is8vec8 353 Conditional Select Operator Return Value Mapping dc NM C and D operands be rrr rrr rrr rrr rrr rrr EE Debug The debug operations do not map to any compiler intrinsics for MMX TM instructions They are provided for debugging programs only Use of these operations may result in loss of performance so you should not use them outside of debugging Output The four 32 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Is32vec4 A cout lt lt Iu32vec4 A cout lt lt hex lt lt Iu32vec4 A print in hex format 3 A3 2 A2 1 A1 0 A0 Corresponding Intrinsics none The two 32 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Is32vec2 A cout lt lt Iu32vec2 A cout lt lt hex lt lt Iu32vec2 A print in hex format WD 2AL 0 AQ Corresponding Intrinsics none 354 Intel C Intrinsics Reference The eight 16 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Isl6vec8 A cout lt lt Iul6vec8 A cout lt lt hex lt lt Iul6vec8 A pr
25. Option Description Default isystemdir Add directory dir to the start OFF of the system include path vu This option indicates there is OFF absolutely no loop carried memory dependency in the loop where IVDEP directive is specified Kc Compile all source or ON unrecognized file types as C source files Knopic KNOPIC Deprecated Use fpic instead ON for Itanium of this option based systems OFF for IA 32 KPIC Kpic Deprecated Use pic instead OFF of this option Ldirectory Instruct linker to search OFF directory for libraries ERE ee Changes the default size of the OFF long double data type from 64 to 80 bits Generates makefile dependency OFF lines for each source file based on the include lines found in the source file march cpu Generate code excusively fora OFF given cpu Values for cpu are e pentiumpro Intel Pentium Pro processors e pentiumii Intel Pentium II processors e pentiumiii Intel Pentium III processors e pentium Intel Pentium 4 processors 19 Intel C Compiler for Linux Systems User s Guide Option Description Default mcpu cpu Optimize for a specific cpu For ON IA 32 cpu values are pentium e pentium Optimize for on IA 32 Pentium processor itanium2 e pentiumpro Optimize on Itanium based for Pentium Pro Pentium Systems II and Pentium III processors e pentium4 Optimize for Pentium 4 processo
26. SSE3 Options Quick Reference Guide This topic provides a reference to all the compiler options and some linker control options e Options supported on both IA 32 and Itanium based systems Option Description Default A Disables all predefined macros Analyze and reorder memory layout for variables and arrays no restrict Enables disables pointer disambiguation with the restrict qualifier Anamel value Associates a symbol name with OFF the specified sequence of value Equivalent to an assert preprocessing directive 11 Intel C Compiler for Linux Systems User s Guide Option Description Default alias_args This option implies arguments alias args may be aliased not aliased ansi Equivalent to GNU ANSI OFF ansi_alias ansi alias directs the OFF compiler to assume the following e Arrays are not accessed out of bounds e Pointers are not cast to non pointer types and vice versa e References to objects of two different scalar types cannot alias For example an object of type int cannot alias with an object of type float or an object of type float cannot alias with an object of type double If your program satisfies the above conditions setting the ansi_alias flag will help the compiler better optimize the program However if your program does not satisfy one of the above conditions the ansi_alias flag may lead
27. This section lists and describes the native intrinsics for Itanium instructions These intrinsics cannot be used on the IA 32 architecture The intrinsics for Itanium instructions give programmers access to Itanium instructions that cannot be generated using the standard constructs of the C and C languages The prototypes for these intrinsics are in the ia64intrin h header file Native Intrinsics for Itanium Instructions The prototypes for these intrinsics are in the ia64intrin h header file Integer Operations int64 m64 dep mr len int64 s const int pos 64 dep zr pos pos 64 dep zi const const int len const int int64 64 extr int64 r int pos const int len int64 r const m64 dep mi const int v S const int p const int int64 s const dep Deposit dep Deposit dep z Deposit dep z Deposit extr Extract 285 Intel C Compiler for Linux Systems User s Guide int64 m64 extru int64 r extr u Extract const int pos const int len m64 xmal int64 a xma 1 Fixed point multiply add using m64 xmah b int64 c the low 64 bits of the 128 bit result The result is signed m64 xmalu int64 a xma lu Fixed point multiply add using b __int64 c the low 64 bits of the 128 bit result The result is unsigned int64 a xma h Fixed point multiply add using const int count b __int6
28. b0 Oxffffffff rl al r2 a2 r3 a3 m128 mm cmpnle ost m128 a m128 Compare for not less than or equal rO a0 lt b0 Oxffffffff rl al lt b1 Oxffffffff r2 a2 lt b2 Oxffffffff r3 a3 lt D i Oxffffffff m128 mm cmpngt ss m128 a m128 Compare for not greater than rO a0 gt b0 Oxffffffff rl al r2 a2 r3 a3 m128 mm cmpngt ps m128 a m128 Compare for not greater than r0 a0 gt b0 Oxffffffff rl al gt bl Oxffffffff r2 ze l a2 gt b2 Oxffffffff r3 a3 gt b i Oxffffffff m128 mm cmpnge ss 1m128 a m128 Compare for not greater than or equal ro El al r2 a2 r3 m128 mm cmpnge ps 1m128 a a0 gt b0 Oxffffffff a3 m128 Compare for not greater than or equal rO a0 gt b0 Oxffffffff rl al gt bl Oxffffffff r2 a2 gt b2 Oxffffffff r3 a3 gt b3 Oxffffffff m128 mm cmpord ss m128 a m128 Compare for ordered rO a0 ord bO Oxffffffff rl al r2 a2 r3 a3 m128 mm cmpord ost m128 a m128 Compare for ordered rO a0 ord bO Oxffffffff rl al ord bl Oxffffffff r2 a2 ord D i Oxffffffff r3 a3 ord b3 Oxffffffff 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 229 Intel C Compiler for Linux Systems User s Guide __m128 _mm_cmpunord_ss __m128 a __m128 b Compare for unord
29. http gcc gnu org onlinedocs gec 3 2 gcc Pragmas html Pragmas http gcc gnu org onlinedocs gec 3 2 gcc Unnamed Fields html Unnamed 20Fields http gcc gnu org onlinedocs gec 3 2 gcc Min and Max html Min 20and 20Max http gcc gnu org onlinedocs gcc 3 2 1 gcc Volatiles html Volatiles http gcc gnu org onlinedocs gcc 3 2 1 gcc Restricted Pointers html Restricted 20Pointers http gcc gnu org onlinedocs gec 3 2 1 gcec Vague Linkage html Vague 20Linkage http gcc gnu org onlinedocs gec 3 2 1 gcc C Interface html C 20Interface http gcc gnu org onlinedocs gec 3 2 1 gcc Template Instantiation html Template 20Instantiation http gcc gnu org onlinedocs gcc 3 2 1 gcc Bound member functions html Bound 20member 20functions http gcc gnu org onlinedocs gec 3 2 1 gcc C Attributes html C 20 Attributes gcc Compatibility RE gcc Language Intel GNU Description and Examples Extension Support Java Exceptions No http gcc gnu org onlinedocs gcc 3 2 1 gcc Java Exceptions html Java 20Exceptions bel Deprecated Features No http gcc gnu org onlinedocs gcc 3 2 1 gcc Deprecated Features html Deprecated 20Features Kach Backwards No http gcc gnu org onlinedocs gcc 3 2 1 gcc Compatibility Backwards Compatibility html Backwards 20Compatibility Zi Note The Intel C Compiler supports gcc style inline ASM if the assembler code uses AT amp T System V 386 syntax as defined in the
30. intrinsics listed in this section are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 They will not function correctly on other IA 32 processors e Macro Functions e Floating point Vector Intrinsics e Integer Vector Intrinsics e Miscellaneous Intrinsics Macro Functions The macro function intrinsics listed below are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 MM SET DENORMALS ZERO MODE x Macro arguments one of MM DENORMALS ZERO ON MM DENORMALS ZERO OFF This causes denormals are zero mode to be turned on or off by setting the appropriate bit of the control register MM GET DENORMALS ZERO MODE No arguments This returns the current value of the denormals are zero mode bit of the control register Floating point Vector Intrinsics The floating point intrinsics listed below are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Single precision Floating point Vector Intrinsics extern m128 _mm_addsub_ps __m128 a __m128 b Subtracts even vector elements while adding odd vector elements rO a0 b0 EL al bi r2 i9 a2 b23 r3 a3 b3 extern m128 mm hadd ps m128 a m128 b Adds adjacent vector elements roO a0 al rl a2 a3 r2 b0 bi r3 b2 D extern m128 _mm_hsub_ps __m128 a _
31. lt lt Is32vec2 I s u N vec N A B Iu32vec2 gt gt lt lt lt lt Iu32vec2 I s u N vec N A B Logical ll6vec4 gt gt gt gt lt lt lt lt I16vec4 A Il6vec4 B Arithmetic Isl6vec4 gt gt gt gt lt lt lt lt Isl6ovec4 I s u N vec N A B Iul6vec4 gt gt lt lt lt lt Iul6vec4 s u N vec N A B Logical Logical Logical 350 Intel C Intrinsics Reference Comparison Operators The equality and inequality comparison operands can have mixed signedness but they must be of the same size The comparison operators for less than and greater than must be of the same sign and size Example of Syntax Usage for Comparison Operator The nearest common ancestor is returned for compare for equal not equal operations Iu8vec8 A Is8vec8 B I8vec8 C C cmpneq A B Type cast needed for different sized elements for equal not equal comparisons Iu8vec8 A C Isl6vec4 B C cmpeq A Iu8vec8 B Type cast needed for sign or size differences for less than and greater than comparisons Iul6vec4 A Isl6vec4 B C C cmpge Isl6vec4 A B C cmpgt B C Inequality Comparison Symbols and Corresponding Intrinsics Compare Operators Syntax Intrinsic For Equality _mm_cmpeq_pi32 mm cmpeq pil6 mm cmpeq pi8 Inequality cmpneq mm cmpeq pi32 mm andnot si64 mm cmpeq pil6 mm cmpeq pi8
32. mm cmpeq pd N A N A N A N A mm cmplt sd N A N A N A N A mm cmplt pd N A N A N A N A _mm_cmple_sd N A N A N A N A _mm_cmple_pd N A N A N A N A mm_cmpgt_sd N A N A N A N A _mm_cmpgt_pd N A N A N A N A _mm_cmpge_sd N A N A N A N A _mm_cmpge_pd N A N A N A N A _mm_cmpneq_sd N A N A N A N A mm_cmpneq_pd N A N A N A N A mm cmpnlt sd N A N A N A N A mm cmpnlt pd N A N A N A N A mm cmpnle sd N A N A N A N A mm cmpnle pd N A N A N A N A _mm_cmpngt_sd N A N A N A N A _mm_cmpngt_pd N A N A N A N A mm_cmpnge_sd N A N A N A N A _mm_cmpnge_pd N A N A N A N A 324 Intel C Intrinsics Reference Intrinsic mm com mm com mm com mm cmpord pd mm cmpord sd mm cmpunord pd mm cmpunord sd ieq sd ilt sd ile sd igt sd mm mm mm mm comige sd comineq sd ucomieq sd ucomilt sd ucomile sd ucomigt sd ucomige sd ucomineq sd cvtepi32 pd cvtpd_epi32 cvttpd epi32 cvtepi32 ps cvtps_epi32 cvttps epi32 Across MMX All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Streaming SIMD Extenions Extensions 2
33. spin_loop pause cmp eax A jne spin_loop In the above example the program spins until memory location A matches the value in register eax The code sequence that follows shows a test and test and set In this example the spin occurs only after the attempt to get a lock has failed get_lock mov eax 1 xchg eax Try to get lock cmp eax 0 Test if successful jne spin_loop Critical Section critical_section code mov A 0 Release lock jmp continue spin_loop pause spin loop hint cmp 0 Aj check lock availability jne spin_loop jmp get_lock continue other cod Note that the first branch is predicted to fall through to the critical section in anticipation of successfully gaining access to the lock It is highly recommended that all spin wait loops include the PAUSE instruction Since PAUSE is backwards compatible to all existing A 32 processor generations a test for processor type a CPUID test is not needed All legacy processors will execute PAUSE as a NOP but in processors which use the PAUSE as a hint there can be significant performance benefit Integer Intrinsics Using Streaming SIMD Extensions The integer intrinsics are listed in the table below followed by a description of each intrinsic with the most recent mnemonic naming convention The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file Intrinsic Alternate Ope
34. unroll n option ses 11 unrollO option 11 unwinder library 4 60 use asm OPIION 11 use msasm option 11 use pch option ssssssseeeee 11 55 V OPW OM ORE PET IS UNE TRU 11 variables environment Hr eret deed 48 setting environment esses 40 vec report n option 11 119 vectorizer 119 120 121 122 123 124 125 129 SW Option ec etis 11 Wall option 11 Wbrief Option 11 Wcheck option ain 11 Awd option centes edet 11 CWE OPUlOM Ses onde EO 11 Werror option 11 397 ee ee EE 11 SWIODLUOnDzs idee ostro RO tw 11 Wp64 option 11 SWI OPtlOM c eee eae e anes 11 WW Option cte ee ets 11 xX option 11 31 50 51 76 81 85 119 120 SXG OpOn sis c pt Da tell 11 398 XK ChOPU OM sss is eaaet der ete 11 KUNG ene ete Eeer 94 95 97 Xlinker option cccecceesseeseeeceeseeseeeseeeteensees 11 yO library function 2 0 0 ee eeceeteeteeteeeeeeeeees 184 yl library function sene 184 yn library function 184 ZAP OPHOM el onte be Meme 11
35. 262 Intel C Intrinsics Reference Integer Arithmetic Operations for Streaming SIMD Extensions 2 The integer arithmetic operations for Streaming SIMD Extensions 2 are listed in the following table followed by their descriptions The packed arithmetic intrinsics for Streaming SIMD Extensions 2 are listed in the Floating point Arithmetic Operations topic The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file Tm Computes Maxima Computes Maxima DDSB Addition N Computes Minima PMUL B W D Q Q E HW iW 263 Intel C Compiler for Linux Systems User s Guide Gerbe mm128i mm add epi8 m128i a __m128i bi m sees mmm em meme m aa fe er Subtraction Subtraction Subtraction Subtraction Subtraction BB BW BD BOQ BQ BSB BSW PSU PSU PSU PSU PSU PSU PSU PSU SU Adds the 16 signed or unsigned 8 bit integers in a to the 16 signed or unsigned 8 bit integers in b rO a0 bO rl al bl r15 a15 b15 mm128i mm add epil6 m128i a m128i b Adds the 8 signed or unsigned 16 bit integers in a to the 8 signed or unsigned 16 bit integers in b rO a0 bO rl al bl r7 a7 b7 m128i mm add epi32 m128i a m128i bi m64 Adds the 4 signed or unsigned 32 bit integers in a to the 4 signed or unsigned 32 bit integer
36. O1 O2 or 03 is explicitly specified in the command line together with g nolib inline Disables inline expansion of intrinsic functions FP Note You can turn off all optimizations for specific functions by using pragma optimize In the following example all optimization is turned off for function foo pragma optimize off foo Valid second arguments for pragma optimize are on or off With the on argument foo is compiled with the same optimization as the rest of the program The compiler ignores first argument values 80 Compiler Optimizations Floating point Optimizations Floating point Arithmetic Precision Options for 1A 32 and Itanium based Systems mp Option The mp option restricts optimization to maintain declared precision and to ensure that floating point arithmetic conforms more closely to the ANSI and IEEE standards For most programs specifying this option adversely affects performance If you are not sure whether your application needs this option try compiling and running your program both with and without it to evaluate the effects on both performance and precision Specifying the mp option has the following effects on program compilation e user variables declared as floating point types are not assigned to registers e whenever an expression is spilled moved from a register to memory it is spilled as 80 bits extended precision not 64 bits double precision
37. Type Instruction M _m aia m64 _m_psllw __m64 m __m64 count 1 1 i i T m sl m sl m sl m sl m sl m sl mm srl m srl m srl m srl m srl Shift four 16 bit values in m left the amount specified by count while shifting In Zeros m64 m psllwi m64 m int count Shift four 16 bit values in m left the amount specified by count while shifting in zeros For the best performance count should be a constant m64 m pslld m64 m m64 count Shift two 32 bit values in m left the amount specified by count while shifting In Zeros Intel C Compiler for Linux Systems User s Guide __m64 m64 m pslldi m64 m int count Shift two 32 bit values in m left the amount specified by count while shifting in zeros For the best performance count should be a constant m psllq m64m __m64 count m64 m64 Shift the 64 bit value in m left the amount specified by count while shifting in Zeros m psllgi m64 m int count Shift the 64 bit value in m left the amount specified by count while shifting in zeros For the best performance count should be a constant m psraw m64 m X m64 count m64 m64 Shift four 16 bit values in m right the amount specified by count while shifting in the sign bit _m_psrawi __m64 m int count Shift four 16 bit values in m right the amount specified b
38. __mi28d mm cmpneq pd m128d a Oxffffffff Ffffffff 128d b 0x0 Ml Compares the two DP FP values of a and b for inequality ro T a0 al b0 t PI __mi28d _mm_cmpnit_pd __m128d a Oxfffffffff P 0Xfftftfffff m Fffffff ETTfflIf 128d b 0x0 0x0 Compares the two DP FP values of a and b for a not less than b rO a0 lt bO Oxf fffffffffffffff 0x0 rl al lt bl Oxffffffffffffffff 0x0 __m128d _mm_cmpnle_pd __m128d a __m128d b Compares the two DP FP values of a and b for a not less than or equal to b ro rl a0 lt b0 Ox ffffff al lt bl Oxfffffff __mi28d _mm_cmpngt_pd __m128d a M fffffffff fffffffff 128d b 0x0 0x0 Compares the two DP FP values of a and b for a not greater than b ro Yd a0 gt bO Ox fffffff l al gt b1 Oxffffffff m128d mm cmpnge pd 1m128d a Compares the two DP FP values of a and b ro a0 gt b0 Oxfffffff rl al gt b1 Oxfffffff 254 M ffffffff ffffffff 128d b 0x0 0x0 for a not greater than or equal to b PEPE EE 0x0 fffffffff 0x0 Intel C Intrinsics Reference __mi28d _mm_cmpeq_sd __m128d a __m128d bi Compares the lower DP FP value of a and b for equality The upper DP FP value is passed through from a ro a0 bO Oxffffffffffffffff 0x0 rl al __mi28d _mm_cmplt_sd __m128d a __m128d b Compares the lower DP FP value of
39. addend __int64 increment new value not the original value See Note below 290 Intel C Intrinsics Reference FP Note _InterlockedSub64 is provided as a macro definition based on InterlockedAdd64 define _InterlockedSub64 target incr _InterlockedAdd64 target incr Uses cmpxchg to do an atomic sub of the incr value to the target Maps to a loop with the cmpxchg instruction to guarantee atomicity Load and Store You can use the load and store intrinsic to force the strict memory access ordering of specific data objects This intended use is for the case when the user suppresses the strict memory access ordering by using the serialize volatile option void Cl rel Generates an st char Slug instruction void st2 rel j Generates an s short value instruction void st4 rel i Generates an s int value instruction void st8 rel void ds Generates an s int64 value instruction unsigned char __ldl_acq void Generates an 1 src instruction unsigned short _ 1d2 acq void Generates an 1 src instruction unsigned int Idi acq void src Generates an 1 instruction unsigned _ int64 1d8 acq void Generates an 1 src instruction 291 Intel C Compiler for Linux Systems User s Guide Operating System Related Intrinsics The prototypes for these intrinsics are in the ia64intrin h header file E MM CNN unsigned X int6
40. codecov prj Project Name dpi customer dpi ref appTests dpi The coverage statistics of a differential coverage run shows the percentage of the code that was exercised on a new run but was missed in the reference run In such cases the coverage tool shows only the modules that included the code that was uncovered The coloring scheme in the source views also should be interpreted accordingly The code that has the same coverage property covered or not covered on both runs is considered as covered code Otherwise if the new run indicates that the code was executed while in the reference run the code was not executed then the code is treated as uncovered On the other hand if the code is covered in the reference run but not covered in the new run the differential coverage source view shows the code as covered Running for Differential Coverage To run the Intel compiler Code coverage Tool for differential coverage the following files are required e The application sources e The spi file generated by the Intel compiler when compiling the application for the instrumented binaries with the prof_genx option e The dpi file generated by the Intel compiler profmerge utility as the result of merging the dynamic profile information dyn files or the dpi file generated implicitly by Intel compiler when compiling the application with the prof use option Once the required files are available the coverage tool may be launched from this
41. double x double y float powf float x float y Description The scalb function returns x 2 where y is a floating point value errno ERANGE for underflow and overflow conditions Calling interface double scalb double x double y long double scalbl long double x long double y float scalbf float x float y Description The scalbn function returns x 2 where n is an integer value errno ERANGE for underflow and overflow conditions Calling interface double scalbn double x int n long double scalbnl long double x int n float scalbnf float x int n 183 Intel C Compiler for Linux Systems User s Guide SCALBLN SQRT Description The scalb1n function returns x 2 where n is a long integer value errno ERANGE for underflow and overflow conditions Calling interface double scalbln double x long int n long double scalblnl long double x long int n float scalblnf float x long int n Description The sqrt function returns the correctly rounded square root errno EDOM for x lt 0 Calling interface double sqrt double x long double sqrtl long double x float sqrtf float x Special Functions ANNUITY COMPOUND ERF The Intel Math library supports the following special functions 184 Description The annuity function computes the present value factor for an annuity 1 1 x x where x is a rate and y is a p
42. functions 0 02 OFF ofile Fefile or Fofile Name output file OFF 00 Od Disable optimizations OFF 01 01 Optimizes for speed OFF 02 02 ON P EP Preprocess to file OFF pc32 Qpc 32 Set internal FPU OFF precision to 24 bit significand pc64 Qpc 64 Set internal FPU OFF precision to 53 bit significand pc80 Qpc 80 Set internal FPU ON precision to 64 bit significand prec div Qprec div Improve precision of OFF floating point divides some speed impact prof dirdirectory Qprof dirdirectory Specify directory for OFF profiling output files dyn and dpi prof filefilename Qprof_filefilename Specify file name for OFF profiling summary file 34 Compiler Options Quick Reference Windows Description Linux Default prof gen x Qprof_genx Instrument program OFF for profiling with the x qualifier extra information is gathered Enable use of OFF profiling information during optimization Qinstall dir NA Set diras root of OFF compiler installation Qlocation str dir Qlocation tool path Setdirasthe OFF location of tool specified by st r Qoption str opts Qoption tool list Pass options opts to OFF tool specified by str Op P NA Compile and link for OFF function profiling with UNIX gprof tool Qprof_use prof use rcd Qrcd Enable fast floating OFF point to integer conversions restrict Qrestrict
43. functions do not set the errno variable So in code that relies upon the setting of the errno variable you should use the nolib inline option which turns off inline expansion of library functions Also if one of your functions has the same name as one of the compiler s supplied library functions the compiler assumes that it is one of the latter and replaces the call with the inlined version Consequently if the program defines a function with the same name as one of the known library routines you must use the nolib inline option to ensure that the program s function is the one used 3 Note Automatic inline expansion of library functions is not related to the inline expansion that the compiler does during interprocedural optimizations For example the following command compiles the program sum c without expanding the library functions but with inline expansion from interprocedural optimizations IPO prompt gt icpe ip nolib inline sum cpp For details on IPO see Interprocedural Optimizations MASM Style Inline Assembly The Intel C Compiler supports MASM style inline assembly with the use msasm option See your MASM documentation for the proper syntax GNU like Style Inline Assembly IA 32 only The Intel C Compiler supports GNU like style inline assembly The syntax is as follows asm keyword volatile keyword asm template asm interface Syntax Element Description asm keyword asm statements begin with
44. gt gt ml gt gt gt gt lal gt l gt l gt gt l gt l gt la gt 321 Intel C Compiler for Linux Systems User s Guide Intrinsic Name mm_set_ps mm_setr_ps _mm_setzero_ps mm_prefetch mm_stream pi mm_stream_ps mm_sfence m_pextrw m_pinsrw __m_pm axsw axub insw inub ovmskb ulhuw m_pshufw m maskmovq m pavgb m pavgw m psadbw 322 Alternate Name m extract pil6 m insert pil6 m max pil6 m max pu8 mm min pil6 m min pu8 mm movemask pi8 mm mulhi pul6 m shuffle pil6 _mm maskmove si64 mm avg pu8 m avg Gul mm sad pu8 Across MMX TM Technology All IA N A N A N A N A N A N A N A N A E N A N A N A d N A Z gt FER N A N N A N A Z gt N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A z z gt gt N A N A N A N A Itanium amp Architecture Streaming SIMD Extensions Streaming SIMD Extensions 2 rl rl ele ee ee S gt PLP ele ele gt er ee a ce ee ee ee l gt ee ee ee ee ee gt Intel C Intrinsics Reference Streaming SIMD Extensions 2 Intrinsics Implementation Streaming SIMD Extensions 2 operate on 128 bit quantities with 64 bit double precision floating point values The Intel Itanium processor does not support parallel double precision computation so Streaming SIMD Extensions 2 are not
45. help help Print help message listing Idirectory Idirectory Add directory to OFF include file search path Preserve the source OFF position of inlined code instead of assigning the call site source position to inlined code inline debug info Qinline debug info ip Qip Enable single file IP OFF optimizations within files ip no inlining Qip no inlining Optimize the behavior OFF of IP disable full and partial inlining requires ip or ipo Enable multifile IP OFF optimizations between files ipo Qipo Qipo obj Optimize the behavior OFF of IP force generation of real object files requires ipo KPIC NA Generate position OFF independent code same as Kpic Kpic NA Generate position OFF independent code same as KPIC long double Qlong double Enable 80 bit long double Instruct linker to OFF produce map file Generate makefile OFF dependency information 33 Intel C Compiler for Linux Systems User s Guide Linux Windows Description Linux Default mp Op Maintain floating point precision disables some optimizations mp1 Qprec Improve floating point precision speed impact is less than mp nobss init Onobss init Disable placement of OFF zero initialized variables in BSS use DATA nolib inline Disable inline OFF expansion of intrinsic
46. mode invoke the compiler with the openmp option prompt gt icpe openmp file cpp Before you run the multithreaded code you can set the number of desired threads in the OpenMP environment variable OMP NUM THREADS See OpenMP Environment Variables for further information openmp Option The openmp option enables the parallelizer to generate multithreaded code based on the OpenMP directives The code can be executed in parallel on both uniprocessor and multiprocessor systems The openmp option works with both 00 no optimization and any optimization level of 01 O2 default and 03 Specifying 00 with openmp helps to debug OpenMP applications OpenMP Directive Format and Syntax An OpenMP directive has the form pragma omp directive name where 140 clause newline pragma omp Required for all OpenMP directives directive name A valid OpenMP directive Must appear after the pragma and before any clauses clause Optional Clauses can be in any order and repeated as necessary unless otherwise restricted newline Required Proceeds the structured block which is enclosed by this directive Parallel Programming OpenMP Diagnostics The openmp_report 0 1 2 option controls the OpenMP parallelizer s diagnostic levels 0 1 or 2 as follows openmp_report0 no diagnostic information is displayed openmp_report1 display diagnostics indicating loops regions and s
47. most cases the compiler will consider outermost loops for parallelization and innermost loops for vectorization If deemed profitable however the compiler may even apply loop parallelization and vectorization to the same loop Note that in some cases successful loop parallelization either automatically or by means of OpenMP directives may affect the messages reported by the compiler for loop vectorization for example under the vec report 2 option indicating loops not successfully vectorized Vectorization Key Programming Guidelines The goal of vectorizing compilers is to exploit single instruction multiple data SIMD processing automatically Review these guidelines and restrictions see code examples in further topics and check them against your code to eliminate ambiguities that prevent the compiler from achieving optimal vectorization Guidelines for loop bodies e use straight line code a single basic block e use vector data only that is arrays and invariant expressions on the right hand side of assignments Array references can appear on the left hand side of assignments e useonly assignment statements Avoid the following in loop bodies e function calls e unvectorizable operations e mixing vectorizable types in the same loop e data dependent loop exit conditions Preparing your code for vectorization To make your code vectorizable you will often need to make some changes to your loops However you should make only
48. n 3 d a3 m64 m pmaxsw m64 a X m64 b Computes the element wise maximum of the words in a and b rO min a0 b0 rl min al b1 r2 min a2 b2 r3 min a3 b3 __m64 _m_pmaxub __m64 a __m64 b Computes the element wise maximum of the unsigned bytes in a and b rO min a0 bO rl min al bl s t min a7 b7 237 Intel C Compiler for Linux Systems User s Guide __m64 m64 m pminsw m64 a X m64 b Computes the element wise minimum of the words in a and b rO min a0 b0 rl min al b1 r2 min a2 b2 r3 min a3 b3 m pminub m64 a m i b Computes the element wise minimum of the unsigned bytes in a and b r0 min a0 bO rI min al bl r7 min a7 b7 int _m_pmovmskb _ m64 a m64 Creates an 8 bit mask from the most significant bits of the bytes in a r sign a7 7 sign a6 6 sign a0 _m_pmulhuw __m64 a _ m64 b Multiplies the unsigned words in a and b returning the upper 16 bits of the 32 bit intermediate results CD hiword a0 b0 rl hiword al bl r2 hiword a2 b2 r3 hiword a3 b3 m64 _m_pshufw __m64 a int n Returns a combination of the four words of a The selector n must be an immediate rO word n amp 0x3 of a rl word n gt gt 2 amp 0x3 of a r2 word n gt gt 4 amp 0x3 of a r3 word n gt gt 6 amp 0x3 of a void m maskmovq m64 d m64 n char p m64 Condition
49. pragma distribute point 155 Intel C Compiler for Linux Systems User s Guide Distribution will start here ignoring all loop carried dependency sub a n a c il sa i b i l Loop Unrolling Support unroll Directive The unro11 directive unroll n nounro11 tells the compiler how many times to unroll a counted loop The syntax for this directive is fpragma unroll fpragma unroll n pragma nounroll where n is an integer constant from 0 through 255 The unro11 directive must precede the for statement for each for loop it affects If n is specified the optimizer unrolls the loop n times If n is omitted or if it is outside the allowed range the optimizer assigns the number of times to unroll the loop The unro11 directive overrides any setting of loop unrolling from the command line The directive can be applied only for the innermost nested loop If applied to the outer loops it is ignored The compiler generates correct code by comparing n and the loop count Example of unroll Directive pragma unroll 4 for i 1 i m i Prefetching Support prefetch Directive The prefetch and noprefetch directives assert that the data prefetches are generated or not generated for some memory references This affects the heuristics used in the compiler The syntax for this directive is pragma noprefetch pragma prefetch pragma prefetch a b If the express
50. psubb Alternate Name mm empty mm cvtsi32 si64 mm cvtsi64 si32 mm packs pil6 mm packs pi32 mm packs pul6 Across MMX Itanium amp AITTA Technology Architecture Streaming SIMD Extensions Streaming SIMD Extensions 2 N A N A N A kk N A N A N A mm mm_ ERU uU u npac npac _unpac _unpac _unpac npac khi pig khi pil6 khi pi32 lo pig lo pil6 lo pi32 N A N A N A N A N A N A mm add pi8 mm add pil6 mm add pi32 mm adds pi8 mm adds pil6 mm adds pu8 mm adds pul16 mm sub pi8 Z gt dda gt gt gt gt gt gt j ri rrela N A 2 42 4 4 gt gt gt gt Z gt gt gt gt gt gt gt j ele ee er ee ee ee D Z gt 315 Intel C Compiler for Linux Systems User s Guide 316 Intrinsic Name psubw psubd psubsb psubsw psubusb psubusw _pmaddwd _pmulhw psraw _psrawi _psrad psradi Alternate Name m sub pil6 mm sub pi32 mm subs pi8 m subs pil6 m subs pu8 mm subs pul6 mm madd_pi16 m mulhi pil16 m sra pil6 mm srai pil6 m sra pi32 mm srai pi32 Ges N A N A N A N A N A N A N A a a a a ea a a a ee ee a Itanium Across MMX AIIA Technology Architecture Streaming SIMD Extensions Stre
51. q il Vectorization Examples This section contains a few simple examples of some common issues in vector programming Argument Aliasing A Vector Copy The loop in the example below a vector copy operation vectorizes because the compiler can prove dest i and src i are distinct Vectorizable Copy Due To Unproven Distinction void vec_copy float dest float src int len int i for i 0 i lt len i dest i l src il l The restrict keyword in the example below indicates that the pointers refer to distinct objects Therefore the compiler allows vectorization without generation of multi version code 129 Intel C Compiler for Linux Systems User s Guide Using restrict to Prove Vectorizable Distinction void vec_copy float restrict dest float restrict src int len int i for i 0 i lt len i dest i l src il Data Alignment A 16 byte or greater data structure or array should be aligned so that the beginning of each structure or array element is aligned in a way that its base address is a multiple of sixteen The Misaligned Data Crossing 16 Byte Boundary figure shows the effect of a data cache unit DCU split due to misaligned data The code loads the misaligned data across a 16 byte boundary which results in an additional memory access causing a six to twelve cycle stall You can avoid the stalls if you know that the data is aligned and you
52. r0 a rl b r2 Cc r3 d m128 mm setr ps float a float b float c float d Sets the four SP FP values to the four inputs in reverse order r0 d rl c r2 b r3 a m128 mm setzero ps void Clears the four SP FP values rO rl r2 r3 0 0 void mm store ss float v m128 a Stores the lower SP FP value ky al void mm store psl float v __m128 a Stores the lower SP FP value across four words v 0 a0 1 a0 v 2 a0 v 3 a0 void mm store ps float v _ m128 a Stores four SP FP values The address must be 16 byte aligned v 0 a0 1 al v 2 a2 v 3 a3 241 Intel C Compiler for Linux Systems User s Guide void _mm_storeu_ps float v __m128 a Stores four SP FP values The address need not be 16 byte aligned v 0 a0 v 1 al v 2 a2 v 3 a3 void mm storer ps float v __m128 a Stores four SP FP values in reverse order The address must be 16 byte aligned v 0 a3 v 1 a2 v 2 al 3 a0 m128 mm move ss m128 a __m128 bi Sets the low word to the SP FP value of b The upper 3 SP FP values are passed through from a rO bO rl al r2 a2 r3 a3 unsigned int mm getcsr void Returns the contents of the control register void mm setcsr unsigned int i Sets the control register to the value specified void mm prefetch char const a int sel void uses PREFETCH Loads on
53. select Compare for Inequality 4 floats F32vec4 R select neq 2 doubles F64vec2 R select neq float F32vecl R select negt Compare for Less Than 4 floats F32vec4 R select 2 doubles F64vec2 R select 1t 1 float F32vecl R select Compare for Less Than or Equal 4 floats F32vec4 R select 2 doubles F64vec2 R select F64vec2 A Returns Example Syntax Usage Intrinsic _mm_cmpeq_ps _mm_cmpeq_pd _mm_cmpeq_ss mm cmpneq ps mm cmpneq pd mm cmpneq ss mm cmpl mm cmpl mm cmplt mm cmpl _mm_cmple_ 377 1 float F32vecl R select le F32vecl A mm cmple ps Compare for Greater Than 4 floats F32vec4 R select gt F32vec4 A mm cempgt ps 2 doubles F64vec2 R select gt F64vec2 A mm cmpgt pd float F32vecl R select gt F32vecl A mm cmpgt ss Compare for Greater Than or Equal To 4 floats F32vecl R select ge F32vec4 A mm cmpge ps 2 doubles F64vec2 R select ge F64vec2 A mm cmpge pd 1 float F32vecl R select ge F32vecl A mm cmpge ss Compare for Not Less Than 4 floats F32vecl R select nlt F32vec4 A mm cmpnlt ps 2 doubles F64vec2 R select nlt F64vec2 A mm cmpnlt pd float F32vecl R select nlt F32vecl A mm cmpnlt ss Compare for Not Less Than or Equal 4 floats F32vecl R select nle F32vec4 A mm cmpnle ps 2 doubles F64vec2 R se
54. signedness indicates signed s or unsigned u For the Ivec class leaving this field blank indicates an intermediate class There are no unsigned Fvec classes therefore for the Fvec classes this field is blank bits specifies the number of bits per element elements specifies the number of elements Special Terms and Conventions The following terms are used to define the functionality and characteristics of the classes and operations defined in this manual e Nearest Common Ancestor This is the intermediate or parent class of two classes of the same size For example the nearest common ancestor of Iu8vec8 and Is8vec8 is I8vec8 Also the nearest common ancestor between Iu8vec8 and I16vec4 is M64 e Casting Changes the data type from one class to another When an operation uses different data types as operands the return value of the operation must be assigned to a single data type Therefore one or more of the data types must be converted to a required data type This conversion is known as a typecast Sometimes typecasting is automatic other times you must use special syntax to explicitly typecast it yourself e Operator Overloading This is the ability to use various operators on the same user defined data type of a given class Once you declare a variable you can add subtract multiply and perform a range of operations Each family of classes accepts a specified range of operators and must comply by rules and
55. truncation of the rounding mode for all floating point calculations including floating point to integer conversions Turning on this option can improve performance but floating point conversions to integer will not conform to C semantics fp port Option The fp_ port option rounds floating point results at assignments and casts An impact on speed may result fpstkchk Option When a function call returns a floating point value the return value should be placed at the top of the FP stack If the return value is unused the compiler pops the value off the stack to keep the FP stack in the correct state However if the application leaves out the function s prototype or incorrectly prototypes the function then the return value may remain on the stack This may result in the FP stack filling up and eventually overflowing Generally when the FP stack overflows a NaN value is put into FP calculations and the program s results differ Unfortunately the overflow point can be far away from the point of the actual bug The fpchkstk option places code that would access violate immediately after an incorrect call occurred thus making it easier to locate these issues 82 Compiler Optimizations Floating point Arithmetic Options for Itanium R based Systems The following options enable you to control the compiler optimizations for floating point computations on Itanium based systems e ftz e IPF fma e IPF fp speculat
56. 100 Compiler Optimizations Example of Profile guided Optimization The three basic phases of PGO are e Instrumentation Compilation and Linking Instrumented Execution e Feedback Compilation Instrumentation Compilation and Linking Use prof gen to produce an executable with instrumented information Use also the prof dir option as recommended for most programs especially if the application includes the source files located in multiple directories prof dir ensures that the profile information is generated in one consistent place For example prompt icpc prof gen prof dir profdata c al cpp a2 cpp a3 cpp prompt gt icpe al o a2 0 a3 o In place of the second command you could use the linker directly to produce the instrumented program Instrumented Execution Run your instrumented program with a representative set of data to create a dynamic information file prompt a out The resulting dynamic information file has a unique name and dyn suffix every time you run a o The instrumented file helps predict how the program runs with a particular set of data You can run the program more than once with different input data Feedback Compilation Compile and link the source files with prof use to use the dynamic information to optimize your program according to its profile prompt icpc prof use ipo al cpp a2 cpp a3 cpp Besides the optimization the compiler produces a pgopt i dpi file You typically specify t
57. 2 intrinsics have a corresponding C intrinsic that implements that instruction directly This frees you from managing registers and enables the compiler to optimize the instruction scheduling The MMX technology and Streaming SIMD Extension instructions use the following new features e new Registers Enable packed data of up to 128 bits in length for optimal SIMD processing e new Data Types Enable packing of up to 16 elements of data in one register The Streaming SIMD Extensions 2 intrinsics are defined only for IA 32 not for Itantum based systems Streaming SIMD Extensions 2 operate on 128 bit quantities 2 64 bit double precision floating point values The Itanium architecture does not support parallel double precision computation so Streaming SIMD Extensions 2 are not implemented on Itanium based systems 200 Intel C Intrinsics Reference New Registers A key feature provided by the architecture of the processors are new register sets The MMX instructions use eight 64 bit registers mmO to mm7 which are aliased on the floating point stack registers MMX Technology Registers Tag Word MMA Technology Registers 1 D 63 D Mio MMF OMS Streaming SIMD Extensions Registers The Streaming SIMD Extensions use eight 128 bit registers xmmO to xmm7 Steaming Sihi D Extension Registers 128 o ahd ho Shh moss These new data registers enable the processing of data elements in parallel Because each register can h
58. 3 2 gcc Nested Functions html Nested 20Functions http gcc gnu org onlinedocs gec 3 2 gcc Constructing Calls html Constructing 20Calls http gcc gnu org onlinedocs gec 3 2 gcc Naming Types html Naming 20Types http gcc gnu org onlinedocs gec 3 2 gcc Typeof htmlZTypeof http gcc gnu org onlinedocs gec 3 2 gcc Lvalues html Lvalues http gcc gnu org onlinedocs gec 3 2 gcc Conditionals html Conditionals http gcc gnu org onlinedocs gec 3 2 gcc Long Long html Long 20Long http gcc gnu org onlinedocs gec 3 2 gcec Complex html Complex http gcc gnu org onlinedocs gec 3 2 gcc Hex Floats html Hex 20F loats 67 Intel C4 Compiler for Linux Systems User s Guide gcc Language Extension Arrays of Variable Length Macros with a Variable Number of Arguments Escaped Newlines String Literals with Embedded Newlines Have Subscripts Arithmetic on void Pointers Arithmetic on Function Pointers Non Constant Initializers Compound Literals Designated Initializers Cast to a Union Type Case Ranges Mixed Declarations and Code Functions Attribute Syntax 68 Arrays of Length Zero pd Slightly Looser Rules for i Non Lvalue Arrays May m mm Declaring Attributes of Intel Support Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Most GNU Description and Examples http gcc gnu org onlin
59. 46 MERSION 4 crt rs 46 WCHAR TYPE 46 KACKEN 46 INTEGRAL MAX BITS eee 46 BEE ette ete aes 46 PGO INSTRUMENT 46 1386 5 s om ten REA en 46 1364 ice RUE RES ERA ec 46 e 46 UNIX oer rr o ee rS 46 preprocessor ODUORS iion EE SGH 43 PROF DIR environment variable 101 prof dir option ssssssssses 11 101 PROF DUMP INTERVAL environment Variable 2 c t aeo e e 116 prof file option 11 prof format 32 option sseesseeesseeserseesseeeesee 11 prof gen x option 11 99 100 101 PROF NO CLOBBER environment variable G One E E avevo P RR 101 prof use optton 11 99 100 profile information 114 115 116 protmerge xe e ees eee 102 Qinstall option 11 Qlocation option sss 11 Intel C Intrinsics Reference Qoption option sssssesessseseesesseesessseesses 11 92 98 Qoption specifiers ssssssssseeee 92 G IE OPO ee 11 CEGEDEL 11 81 remainder library function 189 remquo library function e sseseeeeeseeeesseeeesee 189 requirements lardwate itte 3 STE TOR eo ac os ee e 3 response files i 2 eoe 50 r strict OPtiON need e e ah 76 rint library Dumepon 187 round library function 187 SS ODLOTES ee A ne tet E e T 11 scalb library Duneton 180 scalbIn library function 180 scalbn library function 180 shared libraries iion 62 shared option sssssssessssseseesseserseesersesseer
60. Classes Part 2 Operators Corresponding I16vec4 l8vec8 F64vec2 F32vec4 F32vec1 Intrinsic unpack high mm unpackhi x pil6 pi8 pd unpack low mm unpacklo x pil6 pi8 pd pack sat mm packs x pil6 N A N A packu sat mm packus x pul6 N A N A sat add mm adds x pil6 pi8 pd sat sub mm subs x pil6 pi8 pil6 387 Conversions Operators Corresponding Intrinsics and Classes Operators F64vec2ToInt F32vecAToF64vec2 F64vec2ToF32vec4 IntToF64vec2 F32vecA4ToInt F32vecA4ToIs32vec2 IntToF32vec4 Is32vec2ToF32vec4 388 Corresponding Intrinsic _mm_cvttsd_si32 _mm_cvtps_pd _mm_cvtpd_ps mm cvtsi32 sd mm cvtt ss2si mm cvttps pi32 _mm cvtsi32 ss mm cvtpi32 ps Intel C Intrinsics Reference Programming Example This sample program uses the F32vec4 class to average the elements of a 20 element floating point array Include Streaming SIMD Extension Class Definitions include lt fvec h gt Shuffle any 2 single precision floating point froma into low 2 SP FP and shuffle any 2 SP FP from b into high 2 SP FP of destination define SHUFFLE a b i F32vec4 mm shuffle ps a b i include lt stdio h gt define SIZE 20 Global variables float result _MM ALIGN 16 float array SIZE J RRR KKK KR KKK KK I KC KCkCk KCKCk KCKCK kCKCk Ck kCk ck kck ck kck k kck kkk ck ck ck e k Function Add20ArrayElements
61. Enable the restrict OFF keyword for disambiguating pointers S S Generates OFF assemblable files with S suffix then stops the compilation sox Qsox Enable disable sox saving of compiler options and version in the executable syntax Zs Perform syntax check OFF only tpp5 Optimize for Pentium OFF processor tppo Optimize for Pentium OFF Pro Pentium II and Pentium III processors 35 Intel C Compiler for Linux Systems User s Guide Linux Windows Description Linux Default tpp7 Optimize for Pentium OFF 4 processor Uname Uname Remove predefined macro unrollO0 QunrollO Disable loop unrolling Display compiler version Gem lw in isplay errors errors OFF Enable remarks warnings and errors Wbrief Produces less verbose OFF diagnostics Control diagnostics OFF Display errors n 0 Display warnings and errors n 1 Display remarks warnings and errors n 2 wdL1 L2 Qwd tag Disable diagnostics OFF L1 through LN weLl L2 Qwe tag Change severity of OFF diagnostics L1 through LN to error wnn Qwn tag Print a maximum of n OFF errors Wp64 Wp64 Print diagnostics for 64 bit porting wrLi Change severity of OFF diagnostics L1 through LN to remark wwLl1 L2 Qww tag Change severity of OFF diagnostics L1 through LN to warning X X Remove standard OFF directories from include file sear
62. Intel C Intrinsics Reference Standard Arithmetic Operator Usage The following two tables show the return values for each class of the standard arithmetic operators which use the syntax styles described earlier in the Return Value Notation section Standard Arithmetic Return Value Mapping A Operators B F32vec4 F64vec2 F32vec1 gt TELL The table below lists standard arithmetic operator syntax and intrinsics Standard Arithmetic Operations for Fvec Classes c Operation Returns Example Syntax Usage Intrinsic m Addition 4 floats F32vec4 R F32vec4 A F32vec4 _mm_add_ps B F32vec4 R F32vec4 A c 2 F64vec2 R F64vec2 A F32vec2 _mm_add_pd doubles Br F64vec2 R F64vec2 A c 1 float F32vec1 R F32vec1 A F32vec1 _mm_add_ss B F32vecl R F32vecl A fe Subtraction 4 floats F32vec4 R F32vec4 A F32vec4 _mm_sub_ps B F32vec4 R F32vec4 A 367 pe Operation Returns Example Syntax Usage Intrinsic 2 F64vec2 R F64vec2 A F32vec2 _mm_sub_pd doubles B F64vec2 R F64vec2 A m 1 float F32vec1 R F32vec1 A F32vec1 _mm_sub_ss B F32vecl R F32vecl A Multiplication 4 floats F32vec4 R F32vec4 A F32vec4 _mm_mul_ps B F32vec4 R F32vec4 A 2 F64vec2 R F64vec2 A F364vec2 _mm_mul_pd doubles Bs F64vec2 R F64vec2 A 1 float F32vec
63. Itantum based systems e New Code Coverage and Test Prioritization Tools e New Symbol Visibility Options e New debug support for IPO e Updates to Intel Math Library e Other New Compiler Options e New functionality for Invoking the Compiler from the Command Line For further information on New Features see the Release Notes Intel C Compiler for Linux Systems User s Guide Features and Benefits The Intel C Compiler allows your software to perform best on computers based on the Intel architecture Using new compiler optimizations such as profile guided optimization prefetch instruction and support for Streaming SIMD Extensions SSE and Streaming SIMD Extensions 2 SSE2 the Intel C Compiler provides high performance Feature High Performance Support for Streaming SIMD Extensions Automatic vectorizer OpenMP Support Floating point optimizations Data prefetching Interprocedural optimizations Profile guided optimization Processor dispatch Benefit Achieve a significant performance gain by using optimizations Advantage of Intel microarchitecture Advantage of SIMD parallelism in your code achieved automatically Shared memory parallel programming Improved floating point performance Improved performance due to the accelerated data delivery Larger application modules perform better Improved performance based on profiling frequently used functions Taking advantage of the latest In
64. Less Than 8 Less Than 4 mm cmplt epil6 PCMPGTWr mm cmplt epi32 PCMPGTDr 272 Intel C Intrinsics Reference __m128i mm cmpeq epi8 m128i a m128i b Compares the 16 signed or unsigned 8 bit integers in a and the 16 signed or unsigned 8 bit integers in b for equality rO a0 b0 Oxff 0x0 rl al bl Oxff 0x0 SIE al5 b15 Oxff 0x0 __m128i mm cmpeq epil6 m128i a __m128i bi Compares the 8 signed or unsigned 16 bit integers in a and the 8 signed or unsigned 16 bit integers in b for equality roO a0 b0 Oxffff 0x0 rl al bl Oxffff 0x0 ric a7 b7 Oxffff 0x0 __m128i mm cmpeq epi32 m128i a m1281i bi Compares the 4 signed or unsigned 32 bit integers in a and the 4 signed or unsigned 32 bit integers in b for equality rO a0 b0 Oxffffffff OxO rl al bl Oxffffffff 0x0 r2 a2 b2 Oxffffffff 0x0 r3 a3 b3 Oxffffffff 0x0 m128i mm cmpgt epi8 m128i a m128i b Compares the 16 signed 8 bit integers in a and the 16 signed 8 bit integers in b for greater than rO a0 gt bO Oxff 0x0 rl al gt bl Oxff 0x0 r15 al5 gt b15 Oxff 0x0 __m128i _mm_cmpgt_epil6 __m128i a __m128i b Compares the 8 signed 16 bit integers in a and the 8 signed 16 bit integers in b for greater than r0 a0 gt b0 Oxffff 0x0 rl i al gt
65. N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Streaming SIMD Extenions Extensions 2 N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A Itanium amp Architecture N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A 331 Intel C4 Inte Compiler for Linux Systems User s Guide Intrinsic Across MMX Streaming Streaming Itanium amp AIl IA Technology SIMD SIMD Architecture Extenions Extensions mm setr epi8 N A N A N A N A mm setzero sil128 N A N A N A N A mm store si128 N A N A N A N A mm storeu sil128 N A N A N A N A mm storel epi64 N A N A N A N A mm maskmoveu sil128 N A N A N A N A mm stream pd N A N A N A N A mm stream sil128 N A N A N A N A mm clflush N A N A N A N A Amm lfence N A N A N A N A mm mfence N A N A N A N A mm stream si32 N A N A N A N A mm pause N A N A N A N A I C Class Libraries Introduction to the Class Libraries The Intel C Class Libraries enable Single Instruction Multiple Data SIMD operations The principle of SIMD operations is to exploit microprocessor architecture through parallel processing The effe
66. N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Itanium amp Architecture Streaming SIMD Extensions Streaming SIMD Extensions 2 Dm mm wm wo mm gt gt gt gt NAA gt gt rl er er gt gt w AA AA Al Al Aa S wl gt iw ww wo wo ww mm wl uw gt Intel C Intrinsics Reference Intrinsic Name mm mm mm mm mm mm mm mm mm _mm mm _mm mm mm _cvtps_pi8 move ss shuffle ps unpackhi ps unpacklo ps movehl ps movelh ps movemask ps getcsr setcsr loadh pi loadl pi load ss load ps1 load ps loadu ps loadr ps Storeh pi Storel pi store ss Store ps Store ps1 Storeu ps storer ps Set sS Set ps1 Alternate Name mm loadl ps mm storel ps mm setl ps Across MMX TM All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Technology N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Itanium amp Architecture Streaming SIMD Extensions Streaming SIMD Extensions 2 gt gt gt rjrj rja aaa gt gt gt l gt l gt gt l gt l gt l gt l gt l gt gt gt rl ele ee gt i gt gt
67. ORPD Computes the bitwise OR of the two DP FP values of a and b ro a0 bO rl al bl __mi28d mm xor pd mi28d a m128d b uses XORPD Computes the bitwise XOR of the two DP FP values of a and b r0 a0 b rl al bl Comparison Operations for Streaming SIMD Extensions 2 Each comparison intrinsic performs a comparison of a and b For the packed form the two DP FP values of a and b are compared and a 128 bit mask is returned For the scalar form the lower DP FP values of a and b are compared and a 64 bit mask is returned the upper DP FP value is passed through from a The mask is set to Ox f f f f ffffffffff for each element where the comparison is true and 0x0 where the comparison is false The r following the instruction name indicates that the operands to the instruction are reversed in the actual implementation The comparison intrinsics for the Streaming SIMD Extensions 2 are listed in the following table followed by detailed descriptions The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file Intrinsic Name Corresponding Instruction Compare For mm cmpeq pd CMPEQPD Equality mm cmplt pd CMPLTPD Less Than mm cmple pd CMPLEPD Less Than or Equal mm cmpgt pd CMPLTPDr Greater Than mm cmpge pd CMPLEPDr Greater Than or Equal mm cmpord pd CMPORDPD Ordered mm cmpunord pd CMPUNORDPD Unordered mm cmpneq pd CMPNEQPD
68. OpenMP C API the pragma omp parallel directive defines the parallel construct When the master thread encounters a parallel construct it creates a team of threads with the master thread becoming the master of the team The program statements enclosed by the parallel construct are executed in parallel by each thread in the team These statements include routines called from within the enclosed statements The statements enclosed lexically within a construct define the static extent of the construct The dynamic extent includes the static extent as well as the routines called from within the construct When the pragma omp parallel directive reaches completion the threads in the team synchronize the team is dissolved and only the master thread continues execution The other threads in the team enter a wait state You can specify any number of parallel constructs in a single program As a result thread teams can be created and dissolved many times during program execution 137 Intel C Compiler for Linux Systems User s Guide Using Orphaned Directives In routines called from within parallel constructs you can also use directives Directives that are not in the lexical extent of the parallel construct but are in the dynamic extent are called orphaned directives Orphaned directives allow you to execute major portions of your program in parallel with only minimal changes to the sequential version of the program Using this functional
69. QNaN Range errors occur when a mathematically valid argument results in a function value that exceeds the range of representable values for the floating point data type Attempting to evaluate exp 1000 results in a range error where the return value is INF When domain or range error occurs the following values are assigned to errno e domain error EDOM errno 33 e range error ERANGE errno 34 The following example shows how to read the errno value for an EDOM and ERANGE error errno c include lt errno h gt include lt mathimf h gt include lt stdio h gt int main void double neg one 1 0 double zero 0 0 The natural log of a negative number is considered a domain error EDOM printf log e e and errno EDOM d Vn neg one log neg_one errno The natural log of zero is considered a range error ERANGE printf log e e and errno ERANGE n zero log zero errno The output of errno c will look like this log 1 000000e 00 nan and errno log 0 000000e 00 inf and errno For the math functions in this section a corresponding value for errno is listed when applicable 174 Intel Math Library Other Considerations Some math functions are inlined automatically by the compiler The functions actually inlined may vary and may depend on any vectorization or process
70. Same as syntax OFF Option Description Default 15 Intel C Compiler for Linux Systems User s Guide Option Description Default ttz Flushes denormal results to zero OFF The option is turned ON with O3 funsigned bitfields Change default bitfield type to OFF unsigned funsigned char Change default char type to OFF unsigned fvisibility default file Space separated symbols listed OFF in the file argument will get visibility set to default fvisibility extern file Space separated symbols listed OFF in the file argument will get visibility set to extern fvisibility hidden file Space separated symbols listed OFF in the file argument will get visibility set to hidden fvisibility internal file Space separated symbols listed OFF in the file argument will get visibility set to internal fvisibility protected file Space separated symbols listed OFF in the file argument will get visibility set to protected fvisibility extern default protected hidden internal Global symbols common and OFF defined data and functions will get the visibility attribute given by default Symbol visibility attributes explicitly set in the source code or using the symbol visibility attribute file options will override the fvisibility setting fwritable strings Ensure that string literals are OFF placed in a writable data section g Gene
71. The mechanism is similar to specifying parallelism using the sect ions pragma but is much more flexible because it allows arbitrary code to sit between the taskq and the task pragmas and because it allows recursive nesting of the function to build a conceptual tree of taskq queues The recursive nesting of the taskq pragmas is a conceptual extension of OpenMP worksharing constructs to behave more like nested OpenMP parallel regions Just like nested parallel regions each nested workqueuing construct is a new instance and is encountered by exactly one thread However the major difference is that nested workqueuing constructs do not cause new threads or teams to be formed but rather re use the threads from the team This permits very easy multi algorithmic parallelism in dynamic environments such that the number of threads need not be committed at each level of parallelism but instead only at the top level From that point on if a large amount of work suddenly appears at an inner level the idle threads from the outer level can assist in getting that work finished For example it is very common in server environments to dedicate a thread to handle each incoming request with a large number of threads awaiting incoming requests For a particular request its size may not be obvious at the time the thread begins handling it If the 150 Parallel Programming thread uses nested workqueuing constructs and the scope of the request becomes large aft
72. __int64 i64 rotl int64 value int shift int64 i64 rotr int64 value int shift 311 Intel C Compiler for Linux Systems User s Guide double log10 double float logl0f float double exp double float expf float double pow double double float powf float float double sin double float sinf float double cos double float cosf float double tan double float tanf float double acos double float acosf float double acosh double float acoshf float double asin double float asinf float double asinh double float asinhf float double atan double float atanf float double atanh double float atanhf float float cabs double 312 Intel C Intrinsics Reference double ceil double float ceilf float double cosh double float coshf float float fabsf float double floor double float floorf float double fmod double float fmodf float double hypot double double float hypotf float double rint double float rintf float double sinh double float sinhf float float sqrtf float double tanh double float tanhf float char _strset char _int32 void memcmp const void cs const void ct size_t n void memcpy void s const void ct size t n void memset void s int c size_t n char Strcat char s const char ct int strcmp const
73. __m64 mmx result __m64 result __m64 const mmx_a __m64 const a __m64 const mmx b __m64 const b for length gt 3 length 4 mmx_result mm add pi16 mmx_a mmx_b The following code which takes care of excess elements is not needed if the array sizes passed are known to be multiples of four result unsigned short mmx_r a unsigned short const mmx_a b unsigned short const mmx_b for length gt 0 1 result 44 att b __declspec cpu_dispatch pentium pentium MMX void array sum int r int const a int b size t 1 Empty function body informs the compiler to generate the CPU dispatch function listed in the cpu_dispatch clause 89 Intel C Compiler for Linux Systems User s Guide Processor specific Runtime Checks 1A 32 Systems The Intel C Compiler optimizations take effect at run time For A 32 systems the compiler enhances processor specific optimizations by inserting a code segment in the program that performs the run time checks described below Check for Supported Processor with xN xB or xP To prevent execution errors the compiler inserts code in the program to check for proper processor usage Programs compiled with options xN xB or xP will check at run time whether they are being executed on the Intel Pentium 4 processor Intel Pentium M processor or the Intel Pentium 4 p
74. a and b for a less than b The upper DP FP value is passed through from a ro a0 lt bO Oxffffffffffffffff OxO ri LI __mi28d mm cmple sd m128d a m128d b Compares the lower DP FP value of a and b for a less than or equal to b The upper DP FP value is passed through from a rO a0 lt bO Oxffffffffffffffff 0x0 rl al mi28d mm cmpgt sd mi128d a m128d bi Compares the lower DP FP value of a and b for a greater than b The upper DP FP value is passed through from a rO a0 gt b0 Oxffffffffffffffff 0x0 rl al __m128d _mm_cmpge_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than or equal to b The upper DP FP value is passed through from a rO a0 gt b0 Oxffffffffffffffff 0x0 rl al __mi28d mm cmpord sd mi128d a m128d bi Compares the lower DP FP value of a and b for ordered The upper DP FP value is passed through from a rO a0 ord bO Oxffffffffffffffff 0x0 rli 4 al mi28d mm cmpunord sd mi128d a __m128d bi Compares the lower DP FP value of a and b for unordered The upper DP FP value is passed through from a rO a0 unord b0 Oxffffffffffffffff 0x0 rl al mi128d mm cmpneq sd mi28d a m128d bi Compares the lower DP FP value of a and b for inequality The upper DP FP value is passed through from a ro a0 bO Oxffffffffffffffff 0x0 ri al mi128d mm cmpnlt sd mi128d a m1
75. a full 128 bit signed result The 64 bit value c is zero extended and added to the product The least significant 64 bits of the sum are then returned int64 m64 xmalu jint64 a int64 b __int64 c The 64 bit values a and b are treated as signed integers and multiplied to produce a full 128 bit unsigned result The 64 bit value c is zero extended and added to the product The least significant 64 bits of the sum are then returned int64 m64 xmah int64 a int64 b int64 c The 64 bit values a and b are treated as signed integers and multiplied to produce a full 128 bit signed result The 64 bit value c is zero extended and added to the product The most significant 64 bits of the sum are then returned int64 m64 xmahu __int64 a int64 b AJ int64 c The 64 bit values a and b are treated as unsigned integers and multiplied to produce a full 128 bit unsigned result The 64 bit value c is zero extended and added to the product The most significant 64 bits of the sum are then returned int64 m64 popcnt _ int64 a The number of bits in the 64 bit integer a that have the value 1 are counted and the resulting sum is returned int64 m64 shladd int64 a const int count int64 b a is shifted to the left by count bits and then added to b The result 1s returned 287 Intel C Compiler for Linux Systems User s Guide __int64 m64_shrp int64 a __int64 b const int count a and b are conc
76. a string Returns s int strcmp const char const char Compares two strings Return lt 0 if cs lt ct 0 if cs ct or gt 0 if cs gt ct char strcpy char s const char ct Copies a string Returns s size t strlen const char cs Returns the length of string cs int strncmp char char int Compare two strings but only specified number of characters int strncpy char char int Copies a string but only specified number of characters Intrinsic Functions The intrinsic functions listed below are common to IA 32 and the Itanium architecture Intrinsic Description void alloca int Allocates the buffers int setjmp jmp buf A fast version of set jmp which bypasses the termination handling Saves the callee save registers stack pointer and return address exception code void Returns the exception code exception info void Returns the exception information 208 Intel C Intrinsics Reference Intrinsic abnormal termination void void enable void disable int int int int int int int int int int int int int _bswap int in byte int in dword int in word int inp int inpd int inpw int out byte int int out dword int int out word int int outp int outpd int outpw int int int int Description Can be invoked only by termination han
77. according to the C language e 2 Enables inlining of any function However the compiler decides which functions to inline Enables interprocedural optimizations and has the same effect as ip ofile Name output file OFF Enables the parallelizer to OFF generate multi threaded code based on the OpenMP directives The openmp option works with both 00 and any optimization level of O1 02 and 03 openmp openmp report 0 1 2 Controls the OpenMP ON parallelizer s diagnostic levels openmp reportl openmp stubs Enables OpenMP programs to OFF compile in sequential mode The OpenMP directives are ignored and a stub OpenMP library is linked sequentially 22 Compiler Options Quick Reference opt_report opt report filefilename opt report levellevel opt report phasename opt report routinesubstring opt report help Option Description Default Generates an optimization report OFF directed to stderr unless opt report file is specified Specifies the filename for the OFF optimization report It is not necessary to invoke opt_report when this option is specified Specifies the verbosity level OFF of the output Valid level arguments e min e med e max Ifa level is not specified min is used by default Specifies the compilation name OFF for which reports are generated The option can be used multiple times in the same compilation
78. and kmp realloc The memory allocated by these functions must also be freed by the kmp_free function While it is legal for the memory to be allocated by one thread and kmp_free d by a different thread this mode of operation has a slight performance penalty See the definitions of these functions in the Memory Allocation table below 148 Parallel Programming Stack Size Function Memory Allocation Function kmp_malloc size kmp calloc nelem kmp get stacksize s kmp get stacksize kmp set stacksize s size kmp set stacksize size elsize kmp_realloc ptr kmp_free ptr size Description Returns the number of bytes that will be allocated for each parallel thread to use as its private stack This value can be changed with kmp set stacksize s prior to the first parallel region or with the KMP STACKSIZE environment variable This function is provided for backwards compatibility only Use kmp get stacksize s for compatibility across different families of Intel processors Sets to size the number of bytes that will be allocated for each parallel thread to use as its private stack This value can also be set via the KMP STACKSIZE environment variable In order for kmp set stacksize s tohave an effect it must be called before the beginning of the first dynamically executed parallel region in the program This function is provided for backward compatib
79. b Subtracts the lower SP FP values of a and b The upper 3 SP FP values are passed through from a r0 a0 DU rl al r2 a2 r3 a3 m128 mm sub ps m128 a m128 b Subtracts the four SP FP values of a and b r0 a0 bO rl al bl r2 a2 D r3 a3 b3 m128 mm mul ss m128 a m128 b Multiplies the lower SP FP values of a and b the upper 3 SP FP values are passed through from a r0 a0 b rl al r2 r a2 r3 a3 m128 mm mul ps m128 a m128 bi Multiplies the four SP FP values of a and b rO a0 bO rl al bl r2 a2 b2 r3 a3 bi m128 mm div ss m128 a m128 b Divides the lower SP FP values of a and b the upper 3 SP FP values are passed through from a ro a0 b r1 al r2 a2 r3 a3 223 Intel C Compiler for Linux Systems User s Guide m128 mm div ps m128 a __m128 b Divides the four SP FP values of a and b r0 a0 bO d eat 4 51 p a2 4 E rg x CH 7 b3 m128 _mm_sqrt_ss __m128 a Computes the square root of the lower SP FP value of a the upper 3 SP FP values are passed through ro sqrt a0 ET al r2 a2 r3 a3 m128 mm sort ps 1m128 a Computes the square roots of the four SP FP values of a rO sqrt a0 rl sqrt al r2 sqrt a2 r3 sqrt a3 m128 mm rcp ss 1m128 a Computes the approximation of the reciprocal of the lower SP FP value of a the
80. b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned r a0 bO 0x1 0x0 _mm_ucomieq_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned r a0 b0 Ox1 0x0 mm ucomilt ss m128 a m128 b Compares the lower SP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt b0 Ox1 0x0 mm ucomile ss m128 a m128 b Compares the lower SP FP value of a and b for a less than or equal to b If a is less than or equal to b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 Intel C Intrinsics Reference int _mm_ucomigt_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 Oxi 0x0 int _mm_ucomige_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than or equal to b Ifa is greater than or equal to b is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_ucomineq_ss _m128 a __m128 b Compares the lower SP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned r e a0 b0 0x1 0x0 Conversion Operations for Streaming SIMD Extensions The conversions operations are l
81. be 16 byte aligned ro rl r2 r3 Set Operations for Streaming SIMD Extensions See summary table in Summary of Memory and Initialization topic PIS pl2 pli p 0 The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file m128 mm set ss float w Sets the low word of an SP FP value to w and clears the upper three words r0 w rl E r3 i 0 0 m128 mm set psl float w Sets the four SP FP values to w rO SL D 2 4 r3 w m128 _mm_set_ps float z float y float x float wi Sets the four SP FP values to the four inputs rO w rl x r2 y r3 z m128 _mm_setr_ps float z float y float x float wi Sets the four SP FP values to the four inputs in reverse order ro E rl r2 r3 z ox m128 mm setzero ps void Clears the four SP FP values rO rl r2 r3 0 0 234 Intel C Intrinsics Reference Store Operations for Streaming SIMD Extensions See summary table in Summary of Memory and Initialization topic The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin void _mm_store_ss float p __m128 a Stores the lower SP FP value p al void _mm_store_psl float p m128 a Stores the lower SP FP value across four words p 0 a0 pli a0 p 2 a0 p 3 a0 void _mm_store_ps float p __m128 a Stores four SP FP values The address must be 16 byte aligned p 0 a0 p
82. c include lt stdio h gt include lt mathimf h gt int main float fp32bits double fp64bits long double fp80bits long double pi_by_four 3 141592653589793238 4 0 pi 4 radians is about 45 degrees fp32bits float pi_by_four float approximation to pi 4 fp64bits double pi by four double approximation to pi 4 fp80bits pi_by_four long double extended approximation to pi 4 The sin pi 4 is known to be 1 sqrt 2 or approximately 7071067 printf When x 8 8f sinf x 8 8f Mn fp32bits sinf fp32bits printf When x 16 16f sin x 16 16f Wn fp64bits sin fp64bits printf When x 20 20Lf sinl x 20 20f Mn fp80bits sinl fp80bits return 0 Since the example program above includes the Long double data type be sure to include the long_double compiler option prompt icc long double real math c The output of a out will look like this When x When x 0 78539816 sinf x 0 70710678 0 7853981633974483 sin x 0 7071067811865475 When x 0 78539816339744827900 sinl x 0 70710678118654750275 172 Intel Math Library Example Using Complex Functions complex_math c include lt stdio h gt include lt mathimf h gt int main float _Complex c32in c32out double Complex c64in c64out double pi_by_four 3 141592653589793238 4 0 c64in 1 0 I pi by four Create the double precision
83. command line codecov prj Project Name spi pgopti spi dpi pgopti dpi The spi and dpi options specify the paths to the corresponding files The Code coverage Tool also has the following additional options for generating a link at the bottom of each HTML page to send an electronic message to a named contact by using mname and maddr options codecov prj Project Name mname John Smith maddr js company com 108 Compiler Optimizations Test prioritization Tool The Intel compiler Test prioritization Tool enables profile guided optimizations to select and prioritize application tests based on prior execution profiles of the application The tool offers a potential of significant time saving in testing and developing large scale applications where testing is the major bottleneck The tool can be used for both IA 32 and Itanium architectures This tool lets you select and prioritize the tests that are most relevant for any subset of the application s code When certain modules of an application are changed the Test prioritization Tool suggests the tests that are most probably affected by the change The tool analyzes the profile data from previous runs of the application discovers the dependency between the application s components and its tests and uses this information to guide the process of testing Features and Benefits The tool provides an effective testing hierarchy based on the application s code coverage The advantages o
84. const int count shift right 300 Intel C Intrinsics Reference Corresponding Instruction _pmpyshr2u __m64 a __m64 b pmpyshr2 u Parallel multiply count and shift right pshladd2 m a const int pshladd2 Parallel shift left m64 b and add 64 pshraddZ m a const int pshradd2 Parallel shift right m64 b and add _psubluus _m a __m64 b psub1 uus Parallel subtract _psub2uus __m a __m64 b psub2 uus Parallel subtract int64 m64 czxll m64 a The 64 bit value a 1s scanned for a zero element from the most significant element to the least significant element and the index of the first zero element is returned The element width is 8 bits so the range of the result is from 0 7 If no zero element is found the default result is 8 int64 m64 czxlr __ m64 a The 64 bit value a is scanned for a zero element from the least significant element to the most significant element and the index of the first zero element is returned The element width is 8 bits so the range of the result is from 0 7 If no zero element is found the default result is 8 int64 _m64_czx21 __m64 a The 64 bit value a is scanned for a zero element from the most significant element to the least significant element and the index of the first zero element is returned The element width is 16 bits so the range of the result is from 0 3 If no zero element is found the defau
85. d F32vec4 A int i float f Modify one of the four single precision floating point values of A Permitted values of int i are 0 1 2 and 3 For example If DEBUG is enabled and int i is not one of the permitted values 0 3 a diagnostic message is printed and the program aborts F32vec4 A 3 float f Corresponding intrinsics none Load and Store Operators Loads two double precision floating point values copying them into the two floating point values of A No assumption is made for alignment void loadu F64vec2 A double p Corresponding intrinsic mm loadu pd Stores the two double precision floating point values of A No assumption is made for alignment void storeu float p F64vec2 A Corresponding intrinsic mm storeu pd Loads four single precision floating point values copying them into the four floating point values of A No assumption is made for alignment void loadu F32vec4 A double p Corresponding intrinsic mm loadu ps Stores the four single precision floating point values of A No assumption is made for alignment void storeu float p F32vec4 A Corresponding intrinsic mm storeu ps Unpack Operators for Fvec Operators Selects and interleaves the lower double precision floating point values from A and B F64vec2 R unpack low F64vec2 A F64vec2 B Corresponding intrinsic mm unpacklo pd a b Selects and interleaves the higher double precision floating point v
86. declared with the inline keyword Also enables inlining according to the C language Set internal FPU precision to 64 bit significand Enables the insertion of software prefetching by the compiler Disables the saving of compiler options and version information in the executable file Enable C99 support for C programs Target optimization to the Intel Itanium 2 processor Generated code is compatible with the Intel Itanium processor Targets optimizations for the Intel Pentium 4 processors Control diagnostics Displays warnings and errors Packs structures on 16 byte boundaries Building and Debugging Applications Getting Started Default Behavior of the Compiler If you do not specify any options when you invoke the Intel C Compiler the compiler uses the following default settings Produces executable output with filename a out Invokes options specified in a configuration file first See Configuration Files The location of shared objects is specified by the LD_LIBRARY_PATH environment variable Sets 8 bytes as the strictest alignment constraint for structures Displays error and warning messages Performs standard optimizations using the default O2 option See Setting Optimization Levels On operating systems that support characters in Unicode multi byte format the compiler will process file names containing these characters If the compiler does not recognize a command line option that option
87. double Complex z float Complex ccosf float Complex z Description The ccosh function returns the complex hyperbolic cosine of z Calling interface double Complex ccosh double Complex z long double Complex ccoshl long double Complex z float Complex ccoshf float Complex z Description The cexp function computes e Calling interface double Complex cexp double Complex z long double Complex cexpl long double Complex z float Complex cexpf float Complex z Description The cexp10 function computes 10 Calling interface double Complex cexp10 double Complex z long double Complex cexpl01l long double Complex zi float Complex cexplOf float Complex zi Description The cimag function returns the imaginary part value of z Calling interface double cimag double Complex zi long double cimag long double Complex z float cimagf float Complex z 195 Intel C Compiler for Linux Systems User s Guide CIS CISD CLOG CLOG2 CONJ CPOW 196 _Complex y Description The cis function returns the cosine and sine as a complex value of z measured in radians Calling interface double _Complex cis double z long double Complex cis long double z float Complex cis float z Description The cis function returns the cosine and sine as a complex value of z measured in degrees Calling interface double Complex cis double z long double _Complex cis long double z
88. double x double y long double nexttowardl long double x long double y float nexttowardf float x float y Description The signbit function returns a non zero value if and only if the sign of x is negative Calling interface int signbit double x int signbitl long double x int signbitf float x SIGNIFICAND Description The significand function returns the significand of x in the interval 1 2 For x equal to zero NaN or infinity the original x is returned Calling interface double significand double x long double significandl long double x float significandf float x Complex Functions The Intel Math library supports the following complex functions CABS Description The cabs function returns the complex absolute value of z Calling interface double cabs double Complex z long double cabs long double Complex z float cabsf float Complex z 193 Intel C Compiler for Linux Systems User s Guide CACOS CACOSH CARG CASIN CASINH CATAN 194 Description The cacos function returns the complex inverse cosine of z Calling interface double Complex cacos double Complex z long double Complex cacosl long double Complex z float Complex cacosf float Complex z Description The cacosh function returns the complex inverse hyperbolic cosine of z Calling interface double Complex cacosh double Complex z long double Complex cacoshl long do
89. eee edes 11 par report option 11 134 135 par threshold n option 11 134 135 parallel option 11 120 134 PATH enviroment variable 48 pc32 optlon isse reete eerte neris 11 pc64 OPTION 11 Spe80 Option a Re me tar ae 11 395 peh option 11 55 pch dir option 11 55 Den ODER SA 81 el EE 101 102 pow library function ssssesseseeseeseeserseeeesseeeee 180 prec div option ee ceeesseeeteeseeeees 11 80 81 precompiled headers organizing source files for 55 precompiled headers 55 predefined macros ADATE ant ee 46 ECC ik EE 46 EDG eigenes On 46 EDG VERSION 46 EC E nn date es 46 EXTENSION eies tee aia e one 46 E Lu E 46 D e 46 GNUC_MINOR_ eee 46 GNUC PATCHLEVEL__ 46 GXX ABI VERSION eee 46 HONOR STD sh ete tiit teens 46 RO cec t en se 46 BEG WE 46 Ll in nn nee 46 I ugeet Seege Ae EeIE 46 Le S ta fist 46 INTEL COMPILER re 46 MAUR EE 46 linux zx iicet oet eee 46 linux exceso D Doe 46 LONG DOUBLE SIZE 46 ID ES NN Re dus 46 EP64 ete PEREAT 46 NOINLEINE ions min 46 NO MATH INLINBR eee 46 NO STRING INLINES 46 396 OPTIMIZE tette eth 46 PT RDIFE TYPE 46 CIE 46 REGISTER PREFIX _ oes 46 SIGNED CHARS onecie 46 SIZES TYPE inerte 46 FS RR EE E 46 STDC HOSTED ehe 46 OUTIME geen niin RET 46 UE 46 IUDEX 2 diet etie SEO RU aati 46 USER LABEL PREFIX
90. executed branches that are difficult to predict at compile time An example is code that is heavy with error checking in which the error conditions are false most of the time The cold error handling code can be placed such that the branch is rarely mispredicted Eliminating the interleaving of hot and cold code improves instruction cache behavior For example the use of PGO often enables the compiler to make better decisions about function inlining thereby increasing the effectiveness of interprocedural optimizations PGO Phases The PGO methodology requires three phases e Phase 1 Instrumentation compilation and linking with prof_gen x e Phase 2 Instrumented execution by running the executable e Phase 3 Feedback compilation with prof use A key factor in deciding whether you want to use PGO lies in knowing which sections of your code are the most heavily used If the data set provided to your program is very consistent and it elicits a similar behavior on every execution then PGO can probably help optimize your program execution However different data sets can elicit different algorithms to be called This can cause the behavior of your program to vary from one execution to the next In cases where your code behavior differs greatly between executions PGO may not provide noticeable benefits You have to ensure that the benefit of the profile information is worth the effort required to maintain up to date profiles When using pro
91. f n n n 9 5 390 Intel C Intrinsics Reference Index HASSETT NS ede rH ele MIRE RI eS 44 HSE ote rto ERES 44 pragma distribute point 155 Hpragma hdrstop see 55 Hpragma ivdep sesser 117 125 157 Hpragma loop count sesseeeeseeerereerererreese 155 pragma noprefetch sse 156 pragma NOSWP ne 154 Hpragma nounroll 156 pragma novector 125 157 pragma om 137 140 147 pragma optimize esee 80 Hpragma prefetch sess 156 fipragma SWP 154 Hpragma taskq sees 150 153 Hpragma unroll eere 156 pragma vector 125 157 HUNGER EE 44 noJalign option 11 no restrict option 11 GNUC predefined macro 71 GNUC MINOR predefined macro 71 GNUC PATCHLEVEL predefined macro RR COIT RC ROSE ES 71 ST DE Macro nn dae wrote 76 TIME Macro estate 76 SAS ODORE iens 11 acos library function 176 acosd library function 176 acosh library function 179 alias args option 11 align option iusticie 60 alignment eoe trace 54 alternate tools and path 53 Aname value option 11 annuity library function 184 nsi OPUOM 2d ee Lie ee 11 76 ANSU ISO standard 76 ansi alias option
92. file instead of stdout Unlike the E option the output from P does not include line number directives By default the preprocessor creates the name of the output file using the prefix of the source file name with a i extension You can change this by using the o file option For example the following command creates two files named prog1 i and prog2 i which you can use as input to another compilation prompt gt icpe P progl cpp prog2 cpp A Caution When you use the P option any existing files with the same name and extension are overwritten EP Using the EP option directs the preprocessor to not include 1ine directives in the output 1 is equivalent to E P el IO prompt gt icpe EP progl cpp prog2 cpp Preserving Comments in Preprocessed Source Output Use the C option to preserve comments in your preprocessed source output Comments following preprocessing directives however are not preserved Preprocessing Directive Equivalents Using You can use the A D and U options as equivalents to preprocessing directives e A is equivalent to a assert preprocessing directive e D is equivalent to a define preprocessing directive e U is equivalent to a undef preprocessing directive A Use the A option to make an assertion Syntax Aname value M Argument Description name Indicates an identifier for the assertion value Indicates a value for the assertion If a value is s
93. float h SSE intrinsics for Class Libraries Standard header file MMX instructions intrinsics for Class Libraries 163 Intel C4 Compiler for Linux Systems User s Guide File limits h mathf h mathimf h mmintrin h omp h omp_lib h pgouser h pmmintrin h proto h sse2mmx h stdarg h stdbool h stddef h syslimits h varargs h xarg h xmm_func h xmm_utils h xmmintrin h 164 Description Standard header file Principal header file for legacy Intel Math Library Principal header file for current Intel Math Library Intrinsics for MMX instructions Principal header file OpenMP Header file for OpenMP For use in the instrumentation compilation phase of profile guided optimizations Principal header file for Streaming SIMD Extensions 3 intrinsics Principal header file for Streaming SIMD Extensions 2 intrinsics Replacement header for standard stdarg h Defines _Bool keyword Standard header file Replacement header for standard varargs h Header file used by stdargs hand varargs h Header file for Streaming SIMD Extensions Utilities for Streaming SIMD Extensions Principal header file for Streaming SIMD Extensions intrinsics Key Files lib Files Library libguide a libguide so libguide_stats a libguide_stats so libompstub a libsvml a libirc a libircmt a libimf a libimf so libcprts a libcprts so libcprts so 3 libunwind a libunwind so libunwind so 3 libc
94. for the designated program If the argument is a command line option you must include the hyphen If the argument contains a space or tab character you must enclose the entire argument in quotation characters You must separate multiple arguments with commas The following example directs the linker to create a memory map when the compiler produces the executable file from the source 53 Intel C Compiler for Linux Systems User s Guide prompt gt icpe Qoption link map proto map proto cpp The Qoption link option in the preceding example is passing the map option to the linker This is an explicit way to pass arguments to other tools in the compilation process Also you can use the Xlinker val to pass values va 1 to the linker Monitoring Data Settings The options described below provide monitoring of Intel compiler generated code Specifying Structure Tag Alignments You can specify an alignment constraint for structures and unions in two ways e Place a pack pragma in your source file or e Enter the alignment option on the command line Both specifications change structure tag alignment constraints Flushing Denormal Values to Zero for Itanium based Systems Only Option ft z flushes denormal results to zero when the application is in the gradual underflow mode Use this option if the denormal values are not critical to application behavior Flushing the denormal values to zero with ftz may improve performance of you
95. form rO int aO ri int al m128 _mm_cvt_si2ss __m128 int Convert the 32 bit integer value b to an SP FP value the upper three SP FP values are passed through from a rO float b rl al r2 a2 r3 a3 m128 mm cvt pi2ps 1m128 __m64 Convert the two 32 bit integer values in packed form in b to two SP FP values the upper two SP FP values are passed through from a rO float bo rl float bl r2 a2 r3 a3 inline m128 mm cvtpil6 ps m64 a Convert the four 16 bit signed integer values in a to four single precision FP values rO float a0 rl float al r2 float a2 r3 float a3 inline m128 mm cvtpul6 ps m64 a Convert the four 16 bit unsigned integer values in a to four single precision FP values rO float a0 rl float al r2 float a2 r3 float a3 inline m128 mm cvtpi8 ps m64 a Convert the lower four 8 bit signed integer values in a to four single precision FP values rO float a0 rl float al r2 lt float a2 r3 float a3 232 Intel C Intrinsics Reference inline __m128 _mm_cvtpu8_ps __m64 a Convert the lower four 8 bit unsigned integer values in a to four single precision FP values rO float a0 rl float al r2 float a2 r3 float a3 inline m128 _mm_cvtpi32x2_ps __m64 a m i bi Convert the two 32 bit signed integer values in a and the two 32 bit signed integer
96. fvisibility default file e fvisibility protected file e fvisibility hidden file e fvisibility internal file where file is the pathname of a file containing a list of the symbol names whose visibility you wish to set The symbol names in the file are separated by white space blanks TAB characters or newlines For example the command line option fvisibility protected prot txt where file prot txt contains a bcd e sets protected visibility for symbols a b c d and e This has the same effect as attribute _ visibility protected on the declaration for each of the symbols Note that these two ways to explicitly set visibility are mutually exclusive you may use attribute visibilty onthe declaration or specify the symbol name in a file but not both You can set the default visibility for symbols using one of the command line options e fvisibility external e fvisibility default e fvisibility protected e fvisibility hidden e fvisibility internal This option sets the visiblity for symbols not specified in a visibility list file and that do not have attribute visibilty in their declaration For example the command line options fvisibility protected fvisibility default prot txt where file prot t xt is as previously described will cause all global symbols except a b c d and e to have protected visibility Those five symbols however will have default visibility and thus b
97. implemented on Itanium based systems Key to the table entries e A Expected to give significant performance gain over non intrinsic based code equivalent e B Non intrinsic based source code would be better the intrinsic s implementation may map directly to native instructions but they offer no significant performance gain e C Requires contorted implementation for particular microarchitecture Will result in very poor performance if used Intrinsic _add_sd add pd sub sd Sub pd mul sd mul pd Sqrt sd Sqrt pd div sd div pd min sd min pd max sd max pd and pd andnot pd Across MMX All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Streaming SIMD Extenions Extensions 2 N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A Itanium amp Architecture N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A 323 Intel C Compiler for Linux Systems User s Guide Intrinsic Across MMX Streaming Streaming Itanium amp AIl IA Technology SIMD SIMD Architecture Extenions Extensions 2 mmm or pd N A N A N A N A mm xor pd N A N A N A N A mm cmpeq sd N A N A N A N A
98. in the following tables apply when A and B are of different classes 344 Intel C Intrinsics Reference Ivec Logical Operator Overloading Return R AND OR XOR NAND A Operand I64vecl R amp AS andnot I s u 64vec2 A l64vec2 R amp d andnot I s u 64vec2 A I32vec2 R amp andnot I s u 32vec2 A 132vec4 R amp s andnot I s u 32vec4 A Il6vec4 R amp andnot I s u 16vec4 A Il6vec8 R amp andnot I s u 16vec8 A I8vec8 R amp i andnot I s u 8vec8 A I8vecl6 R amp A andnot I s u 8vec16 A B Operand 64vec2 32vec2 32vec4 8vec8 B 8vecl6 B For logical operators with assignment the return value of R is always the same data type as the pre declared value of R as listed in the table that follows ec Logical Operator Overloading with Assignment 1128vecl 64vecl 64vecl R L64vec2 64vec2 R I x 32vec4 I x 32vec4 I x 32vec2 I x 32vec2 I x 16vec8 I x 16vec8 I x l6vec4 I x 16vec4 I x 8vecl6 I x 8vec16 I x 8vec8 I x 8vec8 R Return Type Left Side R AND amp I128vecl R R R R R R ee pe amp Ir ele amp IF l s u e E j I s u e e f I s u je E f I s u e e f I s u je r I s u e e f I s u OR XOR Ri N vec vec vec vec vec vec vec vec vec ght Side Any lvec Type N A N A N A N
99. int mask __int64 _ReturnAddress void void __lfetch int lfhint void y void __lfetch_fault int lfhint void y void lfetch excl int lfhint void y void lfetch fault excl i nt lfhint void y unsigned int cacheSize unsigned int cacheLevel void memory barrier void void ssm int mask void rsm int mask Sets the user mask bits of PSR Maps to the sum imm24 instruction Resets the user mask Get the caller s address Generate the 1 et ch 1fhint instruction The value of the first argument specifies the hint type Generate the 1 etch fault lfhint instruction The value of the first argument specifies the hint type Generate the 1 etch excl lfhint instruction The value 0 1 2 3 of the first argument specifies the hint type Generate the 1 etch fault excl lfhint instruction The value of the first argument specifies the hint type CacheSize n returns the size in bytes of the cache at level n 1 represents the first level cache 0 Is returned for a non existent cache level For example an application may query the cache size and use it to select block sizes in algorithms that operate on matrices Creates a barrier across which the compiler will not schedule any data access instruction The compiler may allocate local data in registers across a memory barrier but not global data Sets the system mask Maps to the ssm imm24 instruction Resets the system mas
100. is ignored and a warning is displayed See Diagnostic Messages for detailed descriptions about system messages Compilation Phases To produce an executable file the compiler performs by default the compile and link phases When invoked the compiler driver determines which compilation phases to perform based on the file name extension and the compilation options specified in the command line The compiler passes object files and any unrecognized file name to the linker The linker then determines whether the file is an object file 0 ora library a The compiler driver handles all types of input files correctly thus it can be used to invoke any phase of compilation The relationship of the compiler to system specific programming support tools is presented in the diagram below 39 Intel C Compiler for Linux Systems User s Guide Application Development Cycle Phase Transkation Phase II Linking Code a Frase Ill i Li System MOST 4 Building Applications from the Command Line Invoking the Compiler The ways to invoke Intel C Compiler are as follows e Invoke directly Running Compiler from the Command Line e Use system make file Running from the Command Line with make Invoking the Compiler from the Command Line There are two necessary steps to invoke the Intel C Compiler from the command line 1 set the environment 2 invoke the compiler using icc or icpc 40 Building
101. long double x long double y int isgreaterf float x float y ISGREATEREQUAL ISINF Description The isgreaterequal function returns 1 if x is greater than or equal to y This function does not raise the invalid floating point exception Calling interface int isgreaterequal double x double y int isgreaterequall long double x long double y int isgreaterequalf float x float y Description The isinf function returns a non zero value if and only if its argument has an infinite value Calling interface int isinf double x int isinfl long double x int isinff float x 191 Intel C Compiler for Linux Systems User s Guide ISLESS Description The isless function returns 1 if x is less than y This function does not raise the invalid floating point exception Calling interface int isless double x double y int islessl long double x long double y int islessf float x float y ISLESSEQUAL Description The islessequal function returns 1 if x is less than or equal to y This function does not raise the invalid floating point exception Calling interface int islessequal double x double y int islessequall long double x long double y int islessequalf float x float y ISLESSGREATER Description The islessgreater function returns if x is less than or greater than y This function does not raise the invalid floating point exception Calling interface int islessgreater double
102. now holds column 1 of the original matrix and so on The transposition function of this macro is illustrated in the Matrix Transposition Using the MM TRANSPOSEA PS figure Matrix Transposition Using MM TRANSPOSEA4 PS Macro town Xo Yo Zo Vb tow X X Xe Xs tow X Y r VA rot Ye Y Ye Ys tow2 Xe Y Z We tow Zo Ed Ze Zs IWS Xa Ys Zs Ve tow3 VE M Ve Ws least most lees mast significant sigp bcant signitcant significant exement esment Geseent elesnent Omen 248 Intel C Intrinsics Reference Streaming SIMD Extensions 2 This section describes the C language level features supporting the Intel Pentium 4 processor Streaming SIMD Extensions 2 in the Intel C Compiler which are divided into two categories e Floating Point Intrinsics describes the arithmetic logical compare conversion memory and initialization intrinsics for the double precision floating point data type __m128d e Integer Intrinsics describes the arithmetic logical compare conversion memory and initialization intrinsics for the extended precision integer data type __m128i F Note The Pentium 4 processor Streaming SIMD Extensions 2 intrinsics are defined only for IA 32 platforms not Itanium based platforms Pentium 4 processor Streaming SIMD Extensions 2 operate on 128 bit quantities 2 64 bit double precision floating point values The Itanium processor does not support parallel double precision computati
103. pch source32 pchi use_pch filename This option directs the compiler to use the PCH file specified by filename It cannot be used in the same compilation as create pch filename The use_pch filename option supports full path names and supports multiple source files when all source files use the same pchi file Example 3 command line prompt gt icpe use_pch pch source32 pchi source cpp Example 3 output source cpp using precompiled header file pch source32 pchi pch_dir dirname Use the pch_dir dirname option to specify the path dirname to the PCH file You can use this option with pch create pch filename and use pch filename Example 4 command line prompt gt icpe pch pch_dir pch source32 cpp Example 4 output source32 cpp creating precompiled header fil pch source32 pchi Organizing Source Files If many of your source files include a common set of header files place the common headers first followed by the pragma hdrstop directive This pragma instructs the compiler to stop generating PCH files For example if sourcel cpp source2 cpp and source3 cpp all include common h then place pragma hdrstop after common h to optimize compile times include common h fpragma hdrstop include noncommon h When you compile using the pch option prompt icpc pch sourcel cpp source2 cpp source3 cpp the compiler will generate one PCH file for all three source files
104. pd double const dp uses MOVSD shuffling Loads a single DP FP value copying to both elements The address p need not be 16 byte aligned r p EI ze p __m128d mm loadr pd double const dp uses MOVAPD shuffling Loads two DP FP values in reverse order The address p must be 16 byte aligned ro p 1 rl p 0 __mi28d mm loadu pd double const dp uses MOVUPD Loads two DP FP values The address p need not be 16 byte aligned rO p 0 rl pli __mi28d mm load sd double const dp uses MOVSD Loads a DP FP value The upper DP FP is set to zero The address p need not be 16 byte aligned r0 ril 0 0 __mi28d mm loadh pd m128d a double const dp uses MOVHPD Loads a DP FP value as the upper DP FP value of the result The lower DP FP value is passed through from a The address p need not be 16 byte aligned r0 a0 rl ze p __mi28d _mm_loadl_pd __m128d a double const dp uses MOVLPD Loads a DP FP value as the lower DP FP value of the result The upper DP FP value is passed through from a The address p need not be 16 byte aligned r p rl al 260 Intel C Intrinsics Reference Set Operations for Streaming SIMD Extensions 2 The following set operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file __mi28d _mm_set_sd doubl
105. restrictions regarding typecasting and operator overloading as defined in the header files The following table shows the notation used in this documention to address typecasting operator overloading and other rules 340 Intel C Intrinsics Reference Class Syntax Notation Conventions Class Name I64vecl I s u 64vec2 I s u 32vec4 I s u 8vec16 I s u 16vec8 I s u 32vec2 I s u 16vec4 I s u 8vec8 I s u N vec N Rules for Operators To use operators with the Ivec classes you must use one of the following three syntax conventions Ivec Class Example 1 164vec1 R Description Any value except 1128vecl nor I64vec1 ___m64 data type two 64 bit values of any signedness four 32 bit values of any signedness eight 16 bit values of any signedness sixteen 8 bit values of any signedness two 32 bit values of any signedness four 16 bit values of any signedness eight 8 bit values of any signedness Ivec Class A operator Ivec Class B R I64vecl A amp I64vecl B Ivec Class Ivec Class Example 3 164vec1 Example 2 164vec1 operator Ivec Class A Ivec Class B R andnot I64vecl A I64vecl B operator Ivec Class A R amp I64vecl A operator Jan operator for example amp or Ivec Class an Ivec class R A B variables declared using the pertinent Ivec classes The table that follows sho
106. rl max a 1 r2 D r3 max a3 Logical Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file Operation Corresponding Instruction m128 mm and post m128 a m128 b FOURS the bitwise And of the four SP FP values of a and b rO a0 amp bO rl al amp bl r2 a2 amp b2 r3 a3 amp b3 m128 mm andnot ps m128 a m128 b Computes the bitwise AND NOT of the four SP FP values of a and b rO a0 amp bO rl al amp bl r2 a2 amp b2 r3 a3 amp b3 m128 mm or ps m128 a __m128 b Computes the bitwise OR of the four SP FP values of a and b rO a0 bO ri al bl r2 a2 b2 r3 a3 b3 m128 mm xor ps m128 a __m128 b EE bitwise XOR exclusive or of the four SP FP values of a and b r0 a0 bO rl e al bl r2 a2 b2 r3 a3 b3 225 Intel C Compiler for Linux Systems User s Guide Comparisons for Streaming SIMD Extensions Each comparison intrinsic performs a comparison of a and b For the packed form the four SP FP values of a and b are compared and a 128 bit mask is returned For the scalar form the lower SP FP values of a and b are compared and a 32 bit mask is returned the upper three SP FP values are passed through from a The mask is set to Oxf FFF for each element where the comparison is true and 0x0 where the comparison is f
107. signed 16 bit integers and saturates rO SignedSaturate a0 rl SignedSaturate al r2 SignedSaturate a2 r3 SignedSaturate a3 r4 SignedSaturate b0 r5 SignedSaturate bl r6 SignedSaturate b2 r7 SignedSaturate b3 __m128i mm packus epil6 m128i a __m128i bi Packs the 16 signed 16 bit integers from a and b into 8 bit unsigned integers and saturates r0 UnsignedSaturate a0 rl UnsignedSaturate al r7 UnsignedSaturate a7 r8 UnsignedSaturate b0 r9 UnsignedSaturate bl r15 UnsignedSaturate b7 int mm extract epil6 m128i a int imm Extracts the selected signed or unsigned 16 bit integer from a and zero extends The selector imm must be an immediate r imm 0 a0 imm 1 al imm 7 a7 m128i mm insert epil6 m128i a int b int imm Inserts the least significant 16 bits of b into the selected 16 bit integer of a The selector imm must be an immediate rO imm 0 b a0 rl imm 1 b al r7 imm 7 b a7 int mm movemask epi8 1m128i a Creates a 16 bit mask from the most significant bits of the 16 signed or unsigned 8 bit integers in a and zero extends the upper bits r al5 7 lt lt 15 al4 7 lt lt 14 ad 1 a0 7 278 Intel C Intrinsics Reference __m128i mm shuffle epi32 1m128i a int imm Shuffles the 4 signed or unsigned 32 bit integers in a as specified by imm Th
108. systems even though the intrinsics are supported e Use mm empty after an MMX instruction if the next instruction is a floating point FP instruction for example before calculations on float double or long double You must be aware of all situations when your code generates an MMX instruction with the Intel C Compiler i e 210 Intel C Intrinsics Reference e when using an MMX technology intrinsic e when using Streaming SIMD Extension integer intrinsics that use the m i data type e when referencing an m64 data type variable e when using an MMX instruction through inline assembly e Donat use mm empty before an MMX instruction since using mm empty before an MMX instruction incurs an operation with no benefit no op e Use different functions for operations that use FP instructions and those that use MMX instructions This eliminates the need to empty the multimedia state within the body of a critical loop e Use mm empty during runtime initialization of __m64 and FP data types This ensures resetting the register between data type transitions e Seethe Correct Usage coding example below Incorrect Usage Correct Usage __m64 x _m_ Y __m64 x m paddd y 2 float f ini float f mm empty init For more documentation on EMMS visit the http developer intel com Web site MMX Technology General Support Intrinsics The prototypes for MMX technology intrinsics a
109. the compiler to generate incorrect code auto ilp32 Specifies that the application OFF ani bas cannot exceed a 32 bit address space which allows the compiler to use 32 bit pointers whenever possible To use this option you must also specify ipo Using the auto ilp32 option on programs that can exceed 32 bit address space 2 32 may cause unpredictable results during program execution 12 Compiler Options Quick Reference Option Description Default ax K W N B P Generates specialized code for OFF processor specific codes K W N B and P while also generating generic IA 32 code e K Intel Pentium III and compatible Intel processors e W Intel Pentium 4 and compatible Intel processors e N lIntel Pentium 4 and compatible Intel processors e B Intel Pentium M and compatible Intel processors e P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 zc Places comments in OFF preprocessed source output c Stops the compilation process OFF after an object file has been generated The compiler generates an object file for each C or C source file or preprocessed source file Also takes an assembler file and invokes the assembler to generate an object file c99 Enables disables C99 support ON for C programs complex limited range Enables the use of delete basic OFF algebraic expansions of some arithmetic operations involving data oft
110. the keyword a sm Alternatively either asmor asm may be used for compatibility volatile keyword If the optional keyword volatile is given the asm is volatile Two volatile asm statements will never be moved past each other and a reference to a volatile variable will not be moved relative to a volatile asm Alternate keywords volatile and volatile may be used for compatibility 309 Intel C Compiler for Linux Systems User s Guide Syntax Element asm template asm interface output list input list clobber list input spec 310 Description The asm template is a C language ASCII string which specifies how to output the assembly code for an instruction Most of the template is a fixed string everything but the substitution directives if any is passed through to the assembler The syntax for a substitution directive is a followed by one or two characters The supported substitution directives are specified in a subsequent section The asm interface consists of three parts 1 an optional output list 2 an optional input list 3 an optional clobber list These are separated by colon characters If the output list is missing but an input list is given the input list may be preceded by two colons to take the place of the missing output list Ifthe asm interface is omitted altogether the asm statement is considered volatile regardless of whether a volatile keyword was specified
111. the number of the blocks that were executed are also displayed in front of the execution count In certain situations it may be desirable to consider all the blocks generated for a single source position as one entity In such cases it is necessary to assume that all blocks generated for one source position are covered when at least one of the blocks is covered This assumption can be configured with the nopartial option When this option is specified decision coverage is disabled and the related statistics are adjusted accordingly The code lines 11 and 12 indicate that the printf statement in line 12 was covered However only one of the conditions in line 11 was ever true With the nopartial option the tool treats the partially covered code like the code on line 11 as covered Differential Coverage Using the code coverage tool you can compare the profiles of the application s two runs a reference run and a new run identifying the code that is covered by the new run but not covered by the reference run This feature can be used to find the portion of the application s code that is not covered by the application s tests but is executed when the application is run by a customer It can also be used to find the incremental coverage impact of newly added tests to an application s test space The dynamic profile information of the reference run for differential coverage is specified by the ref option such as in the following command
112. the program to perform a run time check for the processor on which the program runs to verify it is one of the afore listed Intel processors Examples e Executing a program on a Pentium III processor enables FTZ but not DAZ e Executing a program on an Intel Pentium M processor or Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 enables both FTZ and DAZ These flags are only turned on by Intel processors that have been validated to support them For non Intel processors you can set the flags manually with the following macros Enable FTZ MM SET FLUSH ZERO MODE MM FLUSH ZERO ON Enable DAZ MM SET DENORMALS ZERO MODE MM DENORMALS ZERO ON The prototypes for these macros are in xmmintrin h FTZ and pmmintrin h DAZ 90 Compiler Optimizations Interprocedural Optimizations Use ip and ipo to enable interprocedural optimizations IPO which allow the compiler to analyze your code to determine where to apply the optimizations listed in tables that follow IA 32 and Itanium based Applications Optimization Affected Aspect of Program Inline function Calls jumps branches and loops expansion Interprocedural Arguments global variables and return values constant propagation Monitoring module Further optimizations loop invariant code level static variables Propagation of function Call deletion and call movement Also enables knowledge of character
113. the symbol will be overridden preempted by a definition of the same name in another component See Symbol Preemption If a function symbol has external visibility the compiler knows that it must be called indirectly and can inline the indirect call stub e DEFAULT Other components can reference the symbol Furthermore the symbol definition may be overridden preempted by a definition of the same name in another component e PROTECTED Other components can reference the symbol but it cannot be preempted by a definition of the same name in another component e HIDDEN Other components cannot directly reference the symbol However its address might be passed to other components indirectly for example as an argument to a call to a function in another component or by having its address stored in a data item reference by a function in another component e INTERNAL The symbol cannot be referenced outside its defining component either directly or indirectly Static local symbols in C C declared at file scope or elsewhere with the keyword static usually have HIDDEN visibility they cannot be referenced directly by other components or for that matter other compilation units within the same component but they might be referenced indirectly 63 Intel C Compiler for Linux Systems User s Guide F Note Visibility applies to references as well as definitions A symbol reference s visibility attribute is an assertio
114. these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them The software described in this User s Guide may contain software defects which may cause the product to deviate from published specifications Current characterized software defects are available on request Intel SpeedStep Intel Thread Checker Celeron Dialogic 1386 1486 iCOMP Intel Intel logo Intel386 Intel486 Intel740 IntelDX2 IntelDX4 IntelSX2 Intel Inside Intel Inside logo Intel NetBurst Intel NetStructure Intel Xeon Intel XScale Itanium MMX MMX logo Pentium Pentium II Xeon Pentium III Xeon Pentium M and V Tune are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries Other names and brands may be claimed as the property of others Copyright O Intel Corporation 1996 2004 Table Of Contents Welcome to the Intel C E CODnpller s uie cate e eri eet ettet ios 1 What s New in This Release sisi 1 Feat res and Berieflts 3 e eC ERR IE e pe e tisane 2 Product Web Site and Support iii 2 System Redqu irem6nts ette eap erbe hae EE TAERA meets 3 FLEXIm Electronic Licensing seen 3 Related Publications zres ar ne Are Een eon 3 How to Use This Document ss 4 Compiler Options Quick References eine 7 NEIOTI EER E n cabs het E A E EE EA 7 Options Quick Reference Guide sisi 11 Compiler Opti
115. to search for include files For multiple search directories multiple Tdirectory commands must be used Included files are brought into the program with a include preprocessor directive The compiler searches directories for include files in the following order e directory of the source file that contains the include e directories specified by the I option e directories specified in the CPATH C INCLUDE PATH and CPLUS INCLUDE PATH environment variables How to Remove Include Directories Use the X option to prevent the compiler from searching the default path specified by the environment variables You can use the X option with the I option to prevent the compiler from searching the default path for include files and direct it to use an alternate path For example to direct the compiler to search the path alt include instead of the default path do the following prompt icpc X I alt include source cpp Controlling Compilation If no errors occur during processing you can use the output files from a particular phase as input to a subsequent compiler invocation The table below describes the options to control the output Option Input Output P e Source files Preprocessed files i files E e Source files Preprocesses source file and directs output to stdout EP e Source files Preprocesses source file directs output to stdout and omits line numbers TC e Source files Compile to object only o do no
116. upper 3 SP FP values are passed through ro recip a0 r al r2 a2 r3 a3 m128 _mm_rcp_ps __m128 a Computes the approximations of reciprocals of the four SP FP values of a r0 recip a0 rl recip al r2 recip a2 r3 recip a3 m128 _mm_rsqrt_ss __m128 a Computes the approximation of the reciprocal of the square root of the lower SP FP value of a the upper 3 SP FP values are passed through rO recip sqrt a0 rl al r2 a2 r3 a3 m128 mm rsqgrt ps m128 a Computes the approximations of the reciprocals of the square roots of the four SP Fr values of a rO recip sqrt a0 rl recip sqrt al r2 recip sqrt a2 r3 recip sqrt a3 m128 _mm_min_ss __m128 a m128 b Computes the minimum of the lower SP FP values of a and b the upper 3 SP FP values are passed through from a rO min a0 b0 r1 al r2 a2 r3 a3 m128 mm min ps m128 a m128 bi Computes the minimum of the four SP FP values of a and b rO min a0 b0 rl min al b1 r2 min a2 b2 r3 min a3 b3 224 Intel C Intrinsics Reference m128 mm max ss m128 a m128 b Computes the maximum of the lower SP FP values of a and b the upper 3 SP FP values are passed through from a BO max a0 bO rl al r2 a2 r3 a3 m128 mm max ps m128 a m128 bi Computes the maximum of the four SP FP values of a and b ro e GC
117. values in b to four single precision FP values ro rl r2 r3 inline loat a0 float al float b float bl m64 _mm_cvtps_pil6 __m128 a Convert the four single precision FP values in a to four signed 16 bit integer values rO short a0 rl short al r2 short a2 r3 short a3 inline m64 mm cvtps pi8 m128 a Convert the four single precision FP values in a to the lower four signed 8 bit integer values of result r0 rl r2 r3 char a char a char a char a Load Operations for SIMD Extensions See summary table in Summary of Memory and Initialization topic The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file m128 mm load ss float p Loads an SP FP value into the low word and clears the upper three words ro T p 0 0 r2 0 0 r3 0 0 m128 mm load psl float p Loads a single SP FP value copying it into all four words ro ri r2 r3 p ow dl TD m128 mm load ps float p Loads four SP FP values The address must be 16 byte aligned ro r1 r2 r3 p 0 1 2 3 Wow gw D O TD 233 Intel C Compiler for Linux Systems User s Guide m128 _mm_loadu_ps float p Loads four SP FP values The address need not be 16 byte aligned ro p 0 rl pli IER p 3 r2 r3 m128 mm loadr ps float p Loads four SP FP values in reverse order The address must
118. values of A to two 32 bit integer with truncation returning the integers in packed form Is32vec2 F32vecA4ToIs32vec2 F32vec4 A r0 int AO rl int Al Convert the 32 bit integer value B to a floating point value the upper three floating point values are passed through from A F32vec4 IntToF32vec4 F32vec4 A int B rO float B rl Al r2 A2 r3 A3 Convert the two 32 bit integer values in packed form in B to two floating point values the upper two floating point values are passed through from A F32vec4 Is32vec2ToF32vec4 F32vec4 A Is32vec2 B rO float BO rl float Bl r2 A2 r3 A3 362 Intel C Intrinsics Reference Floating point Vector Classes The floating point vector classes F64vec2 F32vec4 and F32vec1 provide an interface to SIMD operations The class specifications are as follows F64vec2 A double x double y F32vec4 A float z float y float x float w F32vecl B float w The packed floating point input values are represented with the right most value lowest as shown in the following table Single Precision Floating point Elements Operands Operations B Return B Value 127 128 bits F32vec4 RO R1 R2 and R3 F32vec4 returns four packed single precision floating point values RO R 1 R2 and R3 F32vec returns ane single precision floating point value RO Fvec Notation Conventions This reference uses the following convent
119. z2 N a2 N for i lb i lt N it a2 i a2 i x24 If you know that 1b is a multiple of 4 you can align the loop with pragma vector aligned as shown in the example that follows Alignment Due to Assertion of Variable as Multiple of 4 void f int 1b float z2 N a2 N y2 N x2 assert 1b 4 0 pragma vector aligned for i lb i lt N i a2 i a2 i x2 y2 i Loop Interchange and Subscripts Matrix Multiply Matrix multiplication is commonly written as shown in the example below Typical Matrix Multiplication for i 0 i lt N i for j 0 j lt n jtt for k 0 k n k c il l3 7cli I3 ali k b Ik 3 The use ofb k j is nota stride 1 reference and therefore will not normally be vectorizable If the loops are interchanged however all the references will become stride 1 as shown in the Matrix Multiplication With Stride 1 example A Caution Interchanging is not always possible because of dependencies which can lead to different results 131 Intel C Compiler for Linux Systems User s Guide Matrix Multiplication With Stride 1 for i 0 i N i for k 0 k lt n k for j 0 j lt n j i cli j sc i j a i k b k 3 Auto Parallelization The auto parallelization feature of the Intel C Compiler automatically translates serial portions of the input program into equivalent multithreaded code The auto para
120. 0 gt gt count rl re al gt gt count r7 a7 gt gt count __m128i mm sra epil6 m128i a m128i count Shifts the 8 signed 16 bit integers in a right by count bits while shifting in the sign bit r0 a0 gt gt count rl al gt gt count r7 a7 gt gt count m128i mm srai epi32 41m128i a int count Shifts the 4 signed 32 bit integers in a right by count bits while shifting in the sign bit r0 a0 gt gt count rl al gt gt count r2 a2 gt gt count r3 a3 count m128i mm sra epi32 m128i a m128i count Shifts the 4 signed 32 bit integers in a right by count bits while shifting in the sign bit rO a0 gt gt count rl al gt gt count r2 a2 gt gt count r3 i3 gt gt count __mi28i mm srli si128 m1l28i a int imm Shifts the 128 bit value in a right by imm bytes while shifting in zeros imm must be an immediate r srl a imm 8 __m128i mm srli epil6 m128i a int count Shifts the 8 signed or unsigned 16 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r7 srl a7 count m128i mm srl epil6 m128i a m128i count Shifts the 8 signed or unsigned 16 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r7 srl a7 count __m128i mm srli epi32 1m128i a int count Shifts the 4 signed or unsigned 32 bit integers in
121. 03 fp Debugging information produced 03 optimizations enabled fp enabled for IA 32 targeted compilations Debugging and Assembling The assembly file is generated without debugging information but if you produce an object file it will contain debugging information If you link the object file and then use the GDB debugger on it you will get full symbolic representation 59 Using Libraries The Intel C Compiler uses the GNU C Library Dinkumware C Library and the Standard C Library These libraries are documented at the following Internet locations GNU C Library http www gnu org manual glibc 2 2 3 html_chapter libc_toc html Dinkumware C Library http www dinkumware com htm cpl lib cpp html Standard C Library http gcec gnu org onlinedocs libstde Default Libraries The following libraries are supplied with the Intel C Compiler Library Description libguide a For OpenMP implementation libguide so libguide stats a OpenMP static library for the parallelizer tool with performance libguide_stats so statistics and profile information libompstub a Library that resolves references to OpenMP subroutines when OpenMP is not in use libsvml a Short vector math library libirc a Intel support library for PGO and CPU dispatch libircmt a Mulit thread version on 1ibirc a libimf a Intel math library libimf so libcprts a Dinkumware C Library libcprts so libcprts so 3 libunwind
122. 1 R F32vec1 A F32vec1 _mm_mul_ss B F32vecl R F32vecl A N Division 4 floats F32vec4 R F32vec4 A F32vec4 _mm_div_ps B F32vec4 R F32vec4 A m 2 F64vec2 R F64vec2 A F64vec2 _mm_div_pd doubles B F64vec2 R F64vec2 A 1 float F32vecl R F32vecl A F32vecl _mm_div_ss B F32vecl R F32vecl A Advanced Arithmetic Operator Usage The following table shows the return values classes of the advanced arithmetic operators which use the syntax styles described earlier in the Return Value Notation section Advanced Arithmetic Return Value Mapping Operators A F32vec4 F64vec2 F32vec1 RO sqrt rep rsqrt rcp_nr rsqrt_nr AO Rl sqrt rcp rsqrt rcp_nr rsqrt_nr Al N A R2 sqrt rep rsqrt rcp_nr rsqrt_nr A2 N A N A sqrt rcp rsqrt rcp_nr rsqrt_nr A3 N A N A f add horizontal A0 N A N A Al A2 A3 368 Intel C Intrinsics Reference R Operators A F32vec4 F64vec2 F32vec1 d add horizontal AO Al The table below shows examples for advanced arithmetic operators Advanced Arithmetic Operations for Fvec Classes Returns Example Syntax Usage Intrinsic Square Root 4 floats F32vec4 R sqrt F32vec4 A _mm_sqrt_ps 2 doubles F64vec2 R sqrt F64vec2 A _mm_sqrt_pd 1 float F32vecl R sqrt F32vecl A _mm_sqrt_ss Reciprocal 4 floats F32vec4 R rcp F32vec4 A _mm_rcp_ps 2 doubles F64vec2 R rcp F64vec2 A _
123. 114 Compiler Optimizations Example Selectively collect profile information for the portion of the application involved in processing input data input data get input data while input data PGOPTI Prof Reset process data input data PGOPTI Prof Dump input data get input data Resetting the Dynamic Profile Counters void PGOPTI Prof Reset void Description This function resets the dynamic profile counters Recommended Usage Use this function to clear the profile counters prior to collecting profile information on a section of the instrumented application See the example under PGOPTI Prof Dump Dumping and Resetting Profile Information void PGOPTI Prof Dump And Reset void Description This function may be called more than once Each call will dump the profile information to a new dyn file The dynamic profile counters are then reset and execution of the instrumented application continues Recommended Usage Periodic calls to this function allow a non terminating application to generate one or more profile information files These files are merged during the feedback phase of profile guided optimization The direct use of this function allows your application to control precisely when the profile information is generated Interval Profile Dumping void PGOPTI Set Interval Prof Dump int interval Description This function activates Interval Profile Dum
124. 127 T 2 0 10 The scalar element is 1 0 Due to the nature of the instruction some intrinsics require their arguments to be immediates constant integer literals 203 Intel C Compiler for Linux Systems User s Guide Intrinsic Syntax To use an intrinsic in your code insert a line with the following syntax data_type intrinsic_name parameters Where data_type Is the return data type which can be either void int _ m64 0m128 mi128d m128i int64 Intrinsics that can be implemented across all IA may return other data types as well as indicated in the intrinsic syntax definitions intrinsic name Is the name ofthe intrinsic which behaves like a function that you can use in your C code instead of inlining the actual instruction Represents the parameters required by each intrinsic Intrinsics Implementation Across All IA The intrinsics in this section function across all LA 32 and Itanium based platforms They are offered as a convenience to the programmer They are grouped as follows e Integer Arithmetic Related e Floating Point Related e String and Block Copy Related e Miscellaneous Integer Arithmetic Related Intrinsic Description int abs int Returns the absolute value of an integer long labs long Returns the absolute value ofa long integer unsigned long _lrotl unsigned long Rotates bits left for an unsigned value int shi
125. 128i _mm_avg_epu8 __m128i a __m128i b Computes the average of the 16 unsigned 8 bit integers in a and the 16 unsigned 8 bit integers in b and rounds rO a0 b0 2 rl al bl 2 ri5 a15 b15 2 __m128i _mm_avg_epul6 __m128i a m128i b Computes the average of the 8 unsigned 16 bit integers in a and the 8 unsigned 16 bit integers in b and rounds rO a0 b0 2 rl al bl 2 de a7 b7 2 265 Intel C Compiler for Linux Systems User s Guide __m128i mm madd epil6 m128i a m128i b Multiplies the 8 signed 16 bit integers from a by the 8 signed 16 bit integers from b Adds the signed 32 bit integer results pairwise and packs the 4 signed 32 bit integer results ro ze a0 bO al bl rl a2 b2 a3 b3 r2 a4 b i a5 b5 r3 ze a6 bo a7 Di __m128i mm max epil6 m128i a m128i bi Computes the pairwise maxima of the 8 signed 16 bit integers from a and the 8 signed 16 bit integers from b rO max a0 bO rl max al bl E t max a7 b7 m128i mm max epu8 m128i a m128i b Computes the pairwise maxima of the 16 unsigned 8 bit integers from a and the 16 unsigned 8 bit integers from b rO max a0 bO rl max al bl r15 max al5 b15 __m128i mm min epil6 m128i a m128i b Computes the pairwise minima of the 8 signed 16 bit integers from a and the 8 signed 16 bit integers from b rO min a0 bO
126. 16 byte boundaries 52 Building and Debugging Applications Controlling Compilation Output BU MEN Produces an assembly file with the specified file name or the default file name 1f name is not specified Generates assemblable file only with s suffix then stops the compilation Specifying Alternate Tools and Paths You can direct the compiler to specify alternate tools for preprocessing compilation assembly and linking Further you can invoke options specific to your alternate tools on the command line The following sections explain how to use Qlocation and Qoption to do this How to Specify an Alternate Component Use Qlocation to specify an alternate path for a tool This option accepts two arguments using the following syntax prompt gt icpe Qlocation tool path Description Cpp Specifies the compiler front end preprocessor Specifies the C compiler asm Specifies the assembler Specifies the linker path is the complete path to the tool How to Pass Options to Other Programs Use Qopt ion to pass an option specified by opt list toa tool where optlistisa comma separated list of options The syntax for this command is the following prompt gt icpe Qoption tool optlist Description Cpp Specifies the compiler front end preprocessor Specifies the C compiler asm Specifies the assembler Specifies the linker optlist indicates one or more valid argument strings
127. 20Vars http gcc gnu org onlinedocs gcc 3 2 gcc Alternate Keywords html Alternate 20Keywords http gcc gnu org onlinedocs gcc 3 2 gcc Incomplete Enums html Incomplete 20Enums http gcc gnu org onlinedocs gec 3 2 gcc Function Names html Function 20Names http gcc gnu org onlinedocs gec 3 2 gcc Return Address html Return 20Address 69 Intel C4 Compiler for Linux Systems User s Guide pen gcc Language Extension Using Vector Instructions Through Built in Functions Other built in functions provided by GCC Built in Functions Specific to Particular Target Machines Pragmas Accepted by GCC Unnamed struct union fields within structs unions Minimum and Maximum operators in C When is a Volatile Object Accessed Restricting Pointer Aliasing Vague Linkage Declarations and Definitions in One Header i Where s the Template Extracting the function pointer from a bound pointer to member function p C Specific Variable Function and Type Attributes 70 Intel Support Some Most No No Yes Yes No Yes Yes No extern template supported No GNU Description and Examples http gcc gnu org onlinedocs gcc 3 2 gcc Vector Extensions html Vector 20Extensions http gcc gnu org onlinedocs gec 3 2 gcc Other Builtins html Other 20Builtins http gcc gnu org onlinedocs gec 3 2 gcc Target Builtins htmlZTarget o20Builtins
128. 28d bi Compares the lower DP FP value of a and b for a not less than b The upper DP FP value is passed through from a rO a0 lt bO Oxffffffffffffffff 0x0 rl al m128d _mm_cmpnle_sd __m128d a __m128d bi Compares the lower DP FP value of a and b for a not less than or equal to b The upper DP FP value is passed through from a rO a0 lt bO Oxffffffffffffffff 0x0 T3 al 255 Intel C Compiler for Linux Systems User s Guide __m128d _mm_cmpngt_sd __m128d a __m128d bi Compares the lower DP FP value of a and b for a not greater than b The upper DP FP value is passed through from a ro a0 gt bO Oxffffffffffffffff 0x0 rl al m128d _mm_cmpnge_sd __m128d a m128d bi int int int int int int int int int 256 Compares the lower DP FP value of a and b for a not greater than or equal to b The upper DP FP value is passed through from a rO a0 gt bO Oxffffffffffffffff 0x0 rl al mm comieq sd mi128d a __m128d b Compares the lower DP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned r e a0 b0 Oxl 0x0 Jmm comilt sd mi128d a m128d bi Compares the lower DP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 _mm_comile_sd __m128d a m128d bi Compares the lower DP FP value of a and b for a less than o
129. 4 getReg const int whichReg void setReg const int whichReg unsigned __int64 value nsigned _ int64 getIndReg const int hichlndReg _ int64 index void setIndReg const int whichIndReg _ int64 index unsigned __int64 value void __ptr64 rdteb void nsigned _ int64 fetchadd4_acq unsigned int addend const int increment nsigned _ int64 __fetchadd4_rel unsigned int addend const int increment nsigned _ int64 fetchadd8 acq unsigned int64 addend const int increment nsigned _ int64 __fetchadd8_rel unsigned int64 addend const int increment 292 Gets the value from a hardware register based on the index passed in Produces a corresponding mov r instruction Provides access to the following registers See Register Names for getReg and setReg Sets the value for a hardware register based on the index passed in Produces a corresponding mov r instruction See Register Names for getReg and setReg Return the value of an indexed register The index is the 2nd argument the register file is the first argument Copy a value in an indexed register The index is the 2nd argument the register file 1s the first argument Gets TEB address The TEB address is kept in x13 and maps to the move r tp instruction Executes the serialize instruction Maps to the srlz i instruction Serializes the data Maps to the sr1z d instruction Map the fet chadd
130. 4 acq instruction Map the fet chadd4 rel instruction Map the fet chadd8 acq instruction Map the fet chadd8 rel instruction Intel C Intrinsics Reference cu A ci fwb void f8 const int Reg void src fill const int Reg void src fs void dst whichFloatReg void _ mf void void mfa void void __synci void void __thash __int64 void __ttag __int64 void __itcd __int64 pa Flushes the write buffers Maps to the fwb instruction Map the 1dfs instruction Load a single precision value to the specified register Map the 1d d instruction Load a double precision value to the specified register Map the 1dfe instruction Load an extended precision value to the specified register Map the 1d 8 instruction ldf fill instruction sfts instruction t d instruction t fe instruction t 8 instruction tf spill instruction Executes a memory fence instruction Maps to the mf instruction Executes a memory fence acceptance form instruction Maps to the mf a instruction Enables memory synchronization Maps to the sync i instruction Generates a translation hash entry address Maps to the thash r r instruction Generates a translation hash entry tag Maps to the ttag r r instruction Insert an entry into the data translation cache Map itc d instruction 293 Intel C Compile
131. 4 c the high 64 bits of the 128 bit result The result is signed _m64_xmahu __int64 a xma hu Fixed point multiply add using b int64 c the high 64 bits of the 128 bit result The result is unsigned m64 popcnt int64 a popcent Population count int64 _m64_shladd __int64 a shladd Shift left and add int64 b FSR Operations m64 shrp b const int count int64 a shrp Shift right pair LONE CE void _fsetc int amask int omask void _fclrf void int64 _m64_dep_mr Sets the control bits of FPSR sf0 Maps to the fsetc sf0 r rinstruction There is no corresponding instruction to read the control bits Use mm getfpsr Clears the floating point status flags the 6 bit flags of FPSR sf0 Maps to the clrf sf0 instruction int64 r int64 s const int pos const int len The right justified 64 bit value r 1s deposited into the value in s at an arbitrary bit position and the result is returned The deposited bit field begins at bit position pos and extends to the left toward the most significant bit the number of bits specified by len 286 Intel C Intrinsics Reference int64 _m64_dep_mi const int v __int64 s const int p const int len The sign extended value v either all 1s or all 0s is deposited into the value in s at an arbitrary bit position and the result is returned The deposited bit field begins at bit position p and extends to the left tow
132. 64 bits of data loaded from the address p the upper two values are passed through from a rO pO rl pl r2 a2 r3 a3 void _mm_storel_pi __m64 p __m128 a Stores the lower two SP FP values of a to the address p p0 al pl al int mm movemask pat 1m128 a Creates a 4 bit mask from the most significant bits of the four SP FP values r sign a3 lt lt 3 sign a2 lt lt 2 sign al lt lt 1 sign a0 Using Streaming SIMD Extensions on Itanium Architecture The Streaming SIMD Extensions intrinsics provide access to Itanium instructions for Streaming SIMD Extensions To provide source compatibility with the A 32 architecture these intrinsics are equivalent both in name and functionality to the set of A 32 based Streaming SIMD Extensions intrinsics To write programs with the intrinsics you should be familiar with the hardware features provided by the Streaming SIMD Extensions Keep the following issues in mind e Certain intrinsics are provided only for compatibility with previously defined IA 32 intrinsics Using them on Itanium based systems probably leads to performance degradation See section below e Floating point FP data loaded stored as m128 objects must be 16 byte aligned e Some intrinsics require that their arguments be immediates that is constant integers literals due to the nature of the instruction 244 Intel C Intrinsics Reference Data Types The new
133. 8 clog library function ssssseeseeseeseeseeerrsreereee 193 clog2 library function sssssssss 193 code coverage tool 103 compiling and linking ue ace tenu 60 controlling nean 51 from the command line 40 phases of sese enda 39 with alternate tools and paths 53 with Take ien bap om panties nt 42 compiling tee 51 complex limited range option 11 compound library function 184 configuration files 49 conj library function ssssseeseesesseseeeesseeeeseee 193 conventions for class libraries 4 Tor dOCUMENT eon eei n 4 Tor IEAS O soe Ree DURPE 4 copysign library function 190 cos library function 176 cosd library function 176 cosh library function 179 cot library function 00 ec sse 176 cotd library function ssssssseeseeseeseeseeeesseeeeseee 176 CPATH enviroment variable 48 CPLUS INCLUDE PATH enviroment variable Austen eat EE RTE RER iunt 48 cpow library function ssssssssss 193 cproj library function 193 392 cpu dispatch prenerie aiea a 87 creal library function sssssssss 193 create pch option 11 55 csin library function 193 csinh library function 193 csqrt library function 00 0 ce eeeeeeereeeeeeeeees 193 ctan library function 193 ctanh library function 193 Cxxlib gcc option 11 71 cxxlib 1cc option 11 71 data alignment
134. 90 Programming Languages C e ISO IEC 14882 1998 Programming Languages C e The Annotated C Reference Manual Special Edition Ellis Margaret Stroustrup Bjarne Addison Wesley 1991 Provides information on the C programming language e The C Programming Language 3rd edition 1997 Addison Wesley Publishing Company One Jacob Way Reading MA 01867 e The C Programming Language 2nd edition Kernighan Brian W Ritchie Dennis W Prentice Hall 1988 Provides information on the K amp R definition of the C language e CA Reference Manual 3rd edition Harbison Samual P Steele Guy L Prentice Hall 1991 Provides information on the ANSI standard and extensions of the C language e Intel Architecture Software Developer s Manual Volume 1 Basic Architecture Intel Corporation doc number 243190 e Intel Architecture Software Developer s Manual Volume 2 Instruction Set Reference Manual Intel Corporation doc number 243191 e Intel Architecture Software Developer s Manual Volume 3 System Programming Intel Corporation doc number 243192 e Intel amp Itanium amp Assembler User s Guide e Intel Itanium based Assembly Language Reference Manual Intel C Compiler for Linux Systems User s Guide e tanium Architecture Software Developer s Manual Vol 1 Application Architecture Intel Corporation doc number 245317 001 e tanium Architecture Software Developer s Manual Vol 2 System Archi
135. A N A N A N A N A N A 345 Addition and Subtraction Operators The addition and subtraction operators return the class of the nearest common ancestor when the right side operands are of different signs The following code provides examples of usage and miscellaneous exceptions Syntax Usage for Addition and Subtraction Operators Return nearest common ancestor type I16vec4 Isl6vec4 A Iul6vec4 B Il6vec4 C C A B Returns type left hand operand type Isl6vec4 A Iul6vec4 B A B B A Explicitly convert B to Is16vec4 Isl6vec4 A C Iu32vec24 B CoS A Gs C A Isl6vec4 B Addition and Subtraction Operators with Corresponding Intrinsics Operation Symbols Syntax Corresponding Intrinsics Addition R A B _mm_add_epi64 Ri A _mm_add_epi32 _mm_add_epi16 _mm_add_epi8 _mm_add_pi32 mm add pil6 mm add pi8 A B mm sub _epi64 _mm_sub_epi32 _mm_sub_epil _mm_sub_epi8 _mm_sub_pi32 _mm_sub_pil6 _mm_sub_pi8 Subtraction The following table lists addition and subtraction return values for combinations of classes when the right side operands are of different signedness The two operands must be the same size otherwise you must explicitly indicate the typecasting 346 Intel C Intrinsics Reference Addition and Subtraction Operator Overloading R l64vec2 I32vec4 132vec2 I16vec8 Il6vec4 R R R R Add I8vec8 R
136. C ON programs strict ansi Strict ANSI conformance OFF dialect syntax Checks the syntax ofa program OFF and stops the compilation process after the C or C source files and preprocessed source files have been parsed Generates no code and produces no output files Warnings and messages appear on stderr T file Direct linker to read link OFF commands from file Targets optimization for the OFF Itanium processor Targets optimization for the ON Itanium 2 processor Generated code is compatible with the Itanium processor Targets the optimizations for the OFF Pentium processor Targets the optimizations for the OFF Pentium Pro Pentium II and Pentium III processors Targets optimizations for the ON Intel Pentium 4 processors Suppresses any definition of a OFF macro name Equivalent to a undef preprocessing directive unrollO Disable loop unrolling OFF unroll 0 Disable loop unrolling OFF use asm Produce objects through OFF assembler use msasm Accept the Microsoft MASM OFF style inlined assembly format instead of GNU style Option Description Default 27 Intel C Compiler for Linux Systems User s Guide Option Description Default use pch filename Manual use of precompiled OFF header ilename pchi u symbol Pretend the symbol is OFF undefined V Display compiler version OFF information v Show driver tool commands and execute t
137. CCEG specifies the configuration file for customizing compilations when invoking the compiler using icc e ICPCCFG specifies the configuration file for customizing compilations when invoking the compiler using icpc e Several environment variables are supported to specify the location for temporary files The compiler searches for the following variables in the order specified TMP TMPDIR and TEMP If none of these variables are found temporary files are stored in tmp 48 Building and Debugging Applications e IA32ROOT 1A32 based systems points to the directory containing the bin lib include and substitute header directories e IA64ROOT Itanium based systems points to the directory containing the bin lib include and substitute header directories GNU Environment Variables The Intel C Compiler supports the following GNU environment variables e CPATH Path to include directory for C C compilations e C_INCLUDE_PATH Path include directory for C compilations e CPLUS_INCLUDE_PATH Path include directory for C compilations e LIBRARY_PATH The value of LIBRARY_PATH is a colon separated list of directories much like PATH e DEPENDENCIES OUTPUT If this variable is set its value specifies how to output dependencies for Make based on the non system header files processed by the compiler System header files are ignored in the dependency output e SUNPRO_DEPENDENCI
138. Data Alignment Memory Allocation Intrinsics and Inline Assembly 307 Intrinsics Cross processor Implementation 311 Intel C Class NE DUI M 332 Introduction to the Class Libraries enne nnne nnns 332 Integer Vector Classes iii 339 Floating point Vector Classes iii 363 Classes Quick Reference iii 381 Programming Gul 389 TINTON D E 391 Welcome to the Intel C Compiler Welcome to the Intel C Compiler Before you use the compiler see System Requirements Most Linux distributions include the GNU C library assembler linker and others The Intel C Compiler includes the Dinkumware C library See Libraries Overview Please look at the individual sections within each main section of this User s Guide to gain an overview of the topics presented For the latest information visit the Intel Web site http developer intel com What s New in This Release New features for this version of the Intel C Compiler include e New gcc Interoperability Options e Improved gcc Compatibility e Support for Precompiled Header Files e New gcc Built in Functions e New gcc Function Attributes e New optimization support for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 e New Processor specific Run Time Checks for IA 32 e New IA 32 Intrinsics for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 e New Synchronization Primitive intrinsics for
139. Description Computes the Bessel function of the first kind of x with order 0 Calling interface double jO double x float jOf float x Description Computes the Bessel function of the first kind of x with order 1 Calling interface double jl double x float jlf float x Description Computes the Bessel function of the first kind of x with order n Calling interface double jn int n double x float jnf int n float x 185 Intel C Compiler for Linux Systems User s Guide LGAMMA LGAMMA_R TGAMMA Yo Y1 186 Description The 1gamma function returns the value of the logarithm of the absolute value of gamma errno ERANGE for overflow conditions Calling interface double lgamma double x long double lgammal long double x float lgammaf float x Description The 1gamma r function returns the value of the logarithm of the absolute value of gamma The sign of the gamma function is returned in the integer signgam errno ERANGE for overflow conditions x 0 or negative integers Calling interface double lgamma_r double x int signgam long double lgamma_r double x int signgam float lgammaf r float x int signgam Description The t gamma function computes the gamma function of x errno EDOM for x 0 or negative integers Calling interface double tgamma double x long double tgammal long double x float tgammaf float x Description Computes the B
140. E x MM ROUND NEAREST MM GET ROUNDING MODE MM ROUND DOWN Macro Definition MM ROUND UP Write to and read from bits thirteen and fourteen of the control register MM ROUND TOWARD ZERO I The following example tests the rounding mode for round toward zero Rounding Mode with MM ROUND TOWARD ZERO if MM GET ROUNDING MODE MN ROUND TOWARD Zl Rounding mode is round toward zero Gl RO 247 Intel C Compiler for Linux Systems User s Guide Flush to Zero Mode Macro Arguments MM SET FLUSH ZERO MODI ti x Lez MM_FLUSH_ZERO_ON MM GET FLUSH ZERO MODI Lu MM FLUSH ZERO OFF Macro Definition Write to and read from bit fifteen of the control register The following example disables flush to zero mode Flush to Zero Mode with MM FLUSH ZERO OFF ET FLUSH ZERO MODE MM FLUSH ZERO OFF Macro Function for Matrix Transposition The Streaming SIMD Extensions also provide the following macro function to transpose a 4 by 4 matrix of single precision floating point values MM TRANSPOSEA PS row0 rowl row2 row3 The arguments row0 rowl row2 and row3 are m128 values whose elements form the corresponding rows of a 4 by 4 matrix The matrix transposition is returned in arguments row0 rowl row2 and row3 where row0 now holds column 0 of the original matrix row1
141. EG CR ISR IA64 REG CR IIP IA64 REG CR IFA IA64 REG CR ITIR IA64 REG CR IIPA IA64 REG CR IFS IA64 REG CR IIM IA64 REG CR IHA IA64 REG CR LID IA64 REG CR IVR IA64 REG CR TPR 298 Intel C Intrinsics Reference NN ace IA64 REG CR IRR3 REG C IA64 1A64 REG CR IA64 REG CR CV get Reg only Indirect Registers for getlndReg and setIndReg get IndReg only 299 Intel C Compiler for Linux Systems User s Guide Multimedia Additions The prototypes for these intrinsics are in the ia64intrin h header file ER i m64 d uxl m64 a const int n 64 mux2 m64 a const int n 64_paddluus __m64 a X m64 b padd2uus m64 a __m64 b pavgl nraz m64 a __m64 b pavg2 nraz m64 a X m64 b pavgsubl m64 a _ m64 b pavgsub1 Parallel average subtract _m64 __m64 a _m64 __m64 a _m64 __m64 a _m64 __m64 a __m64 _m64_m __m64 a __m64 b m64 m64 m __m64 a m64 b 64 m64 m m64 a m i b 64 m64 m m _pavgsub2 __m64 a X m64 b pavgsub2 Parallel average subtract _pmpy2r __m64 a __m64 b pmpy2 r Parallel multiply _pmpy21 __m64 a __m64 b pmpy2 1 Parallel multiply m64 _m64_pmpyshr2 __m64 a m64 p pmpyshr2 Parallel multiply and
142. ES This variable is the same as DEPENDENCIES OUTPUT except that system header files are not ignored Compilation Environment Options The Intel C Compiler installation includes shell scripts that you can use to set environment variables See Invoking the Compiler from the Command Line for more information Configuration Files You can decrease the time you spend entering command line options and ensure consistency by using the configuration file to automate often used command line entries You can insert any valid command line option into the configuration file The compiler processes options in the configuration file in the order they appear followed by the command line options that you specify when you invoke the compiler S Note Options in the configuration file will be executed every time you run the compiler If you have varying option requirements for different projects see Response Files How to Use Configuration Files The following example illustrates a basic configuration file After you have written the cfg file simply ensure it is in the same directory as the compiler s executable file when you run the compiler The text following the pound character is recognized as a comment The configuration file is icc cfg Sample configuration file Define preprocessor macro MY_PROJECT DMY PROJECT Additional directories to be searched for INCLUDE files before th
143. For routines that contain OpenMP directives only the openmp option is honored 118 Parallel Programming Vectorization IA 32 only The vectorizer is a component of the Intel C Compiler that automatically uses SIMD instructions in the MMX SSE and SSE2 instruction sets The vectorizer detects operations in the program that can be done in parallel and then converts the sequential program to process 2 4 8 or 16 elements in one operation depending on the data type This section provides guidelines option descriptions and examples for the Intel C Compiler vectorization on IA 32 systems only The following list summarizes this section s contents e a quick reference of vectorization functionality and features e descriptions of compiler switches to control vectorization e descriptions of the C language features to control vectorization e discussion and general guidelines on vectorization levels e automatic vectorization e vectorization with user intervention e examples demonstrating typical vectorization issues and resolutions Vectorizer Options Option Description ax KIW N B PJ Enables the vectorizer and generates specialized and generic IA 32 code The generic code is usually slower than the specialized code X KIWINIBIP Turns on the vectorizer and generates processor Specific specialized code vec reportn Controls the vectorizer s level of diagnostic messages e n 0 no diagnostic information is disp
144. I8vecl6 R Sub s LIS u u 64vec2 32vec4 32vec2 16vec8 16vec4 8vec8 A 8vec2 A Return Value Available Operators Right Side Operands A B I I S S u u 64vec2 B 32vec4 B 32vec2 B l6vec8 B l6vec4 B 8vec8 B 8vecl6 B The following table shows the return data type values for operands of the addition and subtraction operators with assignment The left side operand determines the size and signedness of the return value The right side operand must be the same size as the left operand otherwise you must use an explicit typecast Addition and Subtraction with Assignment Return Value R Left Side R Add Sub mm 32vec4 32vec2 R 16vec8 ka l6vec4 8vec16 E 8vec8 I IT I I x x x x x 32vec2 R 32vec2 R 16vec8 4 l6vec4 4 8vec16 8vec8 4 Right Side A 32vec4 32vec2 l6vec4 8vec16 347 Multiplication Operators The multiplication operators can only accept and return data types from the I s u 16vec4 or I s u 16vec8 classes as shown in the following example Syntax Usage for Multiplication Operators Explicitly convert B to Is16vec4 Isl6vec4 A C Iu32vec2 B C A C C A Isl6vec4 B Return nearest common ancestor type I16vec4 Isl6vec4 A Iul6vec4 B Il6vec4 C C A B The mul high and mul add functions take
145. IVDEP directive ensures there is no loop carried dependency for the store into a Example pragma ivdep for j 0 j lt n j a b j alb j 1 117 Parallel Programming For parallel programming the Intel C Compiler supports both the OpenMP 2 0 API and an automatic parallelization capability The following table lists the options that perform OpenMP and auto parallelization support Option openmp openmp_report 01112 openmp_stubs parallel par threshold n par_report 0111213 F Note Description Enables the parallelizer to generate multithreaded code based on the OpenMP directives Default OFF Controls the OpenMP parallelizer s diagnostic levels Default openmp_reportl Enables compilation of OpenMP programs in sequential mode The OpenMP directives are ignored and a stub OpenMP library is linked Default OFF Enables the auto parallelizer to generate multithreaded code for loops that can be safely executed in parallel Default OFF Sets a threshold for the auto parallelization of loops based on the probability of profitable execution of the loop in parallel n 0 to 100 n 0 implies always Default par threshold75 Controls the auto parallelizer s diagnostic levels Default par report1 When both openmp and parallel are specified on the command line the parallel option is honored only in routines that do not contain OpenMP directives
146. In the following descriptions regarding the bits of the MMX register bit 0 is the least significant and bit 63 is the most significant m64 mm setzero si64 PXOR Sets the 64 bit value to zero r 0x0 m64 _mm_set_pi32 int il int i0 composite Sets the 2 signed 32 bit integer values ro i0 rl al m64 mm set pil6 short s3 short s2 short sl short s0 composite Sets the 4 signed 16 bit integer values rO w0 rl wl r2 w2 r3 w3 m64 _mm_set_pi8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 composite Sets the 8 signed 8 bit integer values rO bO ri bl r7 b7 m64 mm setl pi32 int i Sets the 2 signed 32 bit integer values to i r0 i rl i m64 mm setl pil6 short s composite Sets the 4 signed 16 bit integer values to w rO w ri w r2 w r3 w m64 _mm_set1_pi8 char b composite Sets the 8 signed 8 bit integer values to b r0 b rl b r7 b 219 Intel C Compiler for Linux Systems User s Guide m64 _mm_setr_pi32 int il int i0 composite Sets the 2 signed 32 bit integer values in reverse order ro i0 rl il m64 mm setr pil6 short s3 short s2 short sl short s0 composite Sets the 4 signed 16 bit integer values in reverse order rO wO rl wl r2 w2 r3 w3 m64 _mm_setr_pi8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 co
147. Indicate option s required argument s Arguments are separated by comma if more than one are required e Options supported on both IA 32 and Itanium based systems Option alias args toes axB Description Default This option implies arguments alias args may be aliased not aliased Specifies that the application OFF cannot exceed a 32 bit address space which allows the compiler to use 32 bit pointers whenever possible To use this option you must also specify ipo Using the auto i1p32 option on programs that can exceed 32 bit address space 2 32 may cause unpredictable results during program execution Generates specialized code for OFF Intel amp Pentium amp M and compatible Intel processors Intel C4 Compiler for Linux Systems User s Guide complex limited range create pch filename cxxlib gcc cxxlib icc fast fminshared fno common Toro Description Generates specialized code for Intel Pentium 4 and compatible Intel processors Generates specialized code for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Enables the use of delete basic algebraic expansions of some arithmetic operations involving data of type Complex This can cause some performance improvements in programs that use Complex arithmetic but values at the extremes of the exponent range may not compute correctly Default is complex_limited_range
148. Inequality mm cmpnlt pd CMPNLTPD Not Less Than mm cmpnle pd CMPNLEPD Not Less Than or Equal mm cmpngt pd CMPNLTPDr Not Greater Than mm cmpnge pd CMPLEPDr Not Greater Than or Equal mm cmpeq sd CMPEQSD Equality _mm_cmplt_sd CMPLTSD Less Than 252 Intel C Intrinsics Reference Intrinsic Name Inm Inm Inm Inm mm Inm Inm mm mm mm mm mm mm mm cmple sd cmpgt sd cmpge sd cmpord sd cmpunord sd cmpneq sd cmpnlt sd cmpnle sd cmpngt sd cmpnge sd comieq sd comilt sd comile sd comigt sd comige sd comineq sd ucomieq sd ucomilt sd ucomile sd ucomigt sd ucomige sd ucomineq sd Corresponding Instruction CMPLESD CMPLTSDr CMPLESDr CMPORDSD CMPUNORDSD CMPNEOSD CMPNLTSD CMPNLESD CMPNLTSDr CMPNLESDR COMISD COMISD COMISD COMISD COMISD COMISD UCOMISD UCOMISD UCOMISD UCOMISD UCOMISD UCOMISD Compare For Less Than or Equal Greater Than Greater Than or Equal Ordered Unordered Inequality Not Less Than Not Less Than or Equal Not Greater Than Not Greater Than or Equal Equality Less Than Less Than or Equal Greater Than Greater Than or Equal Not Equal Equality Less Than Less Than or Equal Greater Than Greater Than or Eq
149. Intel C Compiler for Linux Systems User s Guide Document Number 253254 018 Disclaimer and Legal Information Information in this document is provided in connection with Intel products No license express or implied by estoppel or otherwise to any intellectual property rights is granted by this document Except as provided in Intel s Terms and Conditions of Sale for such products Intel assumes no liability whatsoever and Intel disclaims any express or implied warranty relating to sale and or use of Intel products including liability or warranties relating to fitness for a particular purpose merchantability or infringement of any patent copyright or other intellectual property right Intel products are not intended for use in medical life saving or life sustaining applications This User s Guide as well as the software described in it is furnished under license and may only be used or copied in accordance with the terms of the license The information in this manual is furnished for informational use only is subject to change without notice and should not be construed as a commitment by Intel Corporation Intel Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined Intel reserves
150. Is16vec4 data only Isl6vec4 A B C D C mul_high A B D mul_add A B Multiplication Operators with Corresponding Intrinsics Syntax Usage Intrinsic mul_high N A R bi pil6 Lhi_epil mm madd pil6 mm madd epil6 The multiplication return operators always return the nearest common ancestor as listed in the table that follows The two operands must be 16 bits in size otherwise you must explicitly indicate typecasting Multiplication Operator Overloading R Mul Il6vec4 R I s u 16vec4 A I s u l6vec4 B Il6vec8 R I s u 16vec8 A I s u 16vec8 B Isl6vec4 R mul add Isl6vec4 A Isl6vec4 B Isl6vec8 mul add Isl6vec8 A Isl6vec8 B 348 Intel C Intrinsics Reference R Mul Is32vec2 R mul_high Isl6vec4 A Isl vec4 B Is32vec4 R mul high sl6vec8 A Isl6vec8 B The following table shows the return values and data type assignments for operands of the multiplication operators with assignment All operands must be 16 bytes in size If the operands are not the right size you must use an explicit typecast Multiplication with Assignment Kn Return Value R Left Side R Mul Right Side A I x 16vec8 I x 16vec8 I s u 16vec8 A m I x 16vec4 I x 16vec4 I s u 16vec4 A Shift Operators The right shift argument can be any integer or Ivec value and is implicitly converted to a M64 data type The first or left operand of a lt lt can
151. L2 Ln Disables diagnostics L1 through Ln Example test c int main int x 0 169 Intel C Compiler for Linux Systems User s Guide If you compile test c above using the Wa11 option enable all warnings the compiler will emit warning 177 prompt icc Wall test c remark 4177 variable x was declared but never referenced To disable warning 177 use the wd option prompt icc Wall wd177 test c Likewise using the we option will result in a compile time error prompt icc Wall we177 test c error 4177 variable x was declared but never referenced compilation aborted for test c Limiting the Number of Errors Reported Use the wnn option to limit the number of error messages displayed before the compiler aborts By default if more than 100 errors are displayed compilation aborts Description Limit the number of error diagnostics that will be displayed prior to aborting compilation to n Remarks and warnings do not count towards this limit For example the following command line specifies that 1f more than 50 error messages are displayed during the compilation of a cpp compilation aborts prompt gt icpe wn50 c a cpp Remark Messages These messages report common but sometimes unconventional use of C or C The compiler does not print or display remarks unless you specify level 4 for the W option as described in Suppressing W
152. N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A Itanium amp Architecture N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A 325 Intel C Compiler for Linux Systems User s Guide Intrinsic Across MMX Streaming Streaming Itanium amp AIl IA Technology SIMD SIMD Architecture Extenions Extensions mm cvtpd ps N A N A N A N A mm cvtps pd N A N A N A N A mm cvtsd ss N A N A N A N A mm cvtss sd N A N A N A N A mm cvtsd si32 N A N A N A N A mm cvttsd si32 N A N A N A N A mm cvtsi32 sd N A N A N A N A mm cvtpd pi32 N A N A N A N A mm cvttpd pi32 N A N A N A N A mm cvtpi32 pd N A N A N A N A mm unpackhi pd N A N A N A N A mm unpacklo pd N A N A N A N A mm unpacklo pd N A N A N A N A mm shuffle pd N A N A N A N A mm load pd N A N A N A N A mm loadl pd N A N A N A N A mm loadr pd N A N A N A N A mm loadu pd N A N A N A N A mm load sd N A N A N A N A mm loadh pd N A N A N A N A mm loadl pd N A N A N A N A mm set sd N A N A N A N A 326 Intel C Intrinsics Reference Intrinsic mm mm _mm mm mm _setl_pd Set pd Setr pd Setzero pd move sd store sd Storel pd Store pd Storeu pd Storer
153. N A N A mm sad epu8 N A N A N A N A mm sub epi8 N A N A N A N A mm sub epil N A N A N A N A mm sub epi32 N A N A N A N A mm sub si64 N A N A N A N A mm sub epi64 N A N A N A N A mm subs epi8 N A N A N A N A mm subs epil6 N A N A N A N A mm subs epu8 N A N A N A N A mm subs epul6 N A N A N A N A mm and si128 N A N A N A N A 328 Intel C Intrinsics Reference Intrinsic mm_andnot_sil28 mm_or_sil28 mm_xor_sil28 mm slli epil6 mm sll epil6 mm slli epi32 mm sll epi32 mm slli epio4 mm sll epi 4 mm srai epil6 mm sra epil mm srai epi32 mm sra epi32 mm srli si128 mm srli epil6 mm srl epil6 mm srli epi32 mm srl epi32 mm srli epi64 mm srl epi64 mm cmpeq epi8 Across MMX All IA N A N A N A N A N A N A Streaming Streaming SIMD Extenions Extensions 2 N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A Itanium amp Architecture N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A 329 330 Intel C Compiler for Linux Systems User s Guide Intrinsic Across MMX Streaming Streaming Itanium amp AIl IA Technology SIMD SIMD Arc
154. ND Computes the bitwise AND of the 128 bit value in a and the 128 bit value in b ri a b __mi28i mm andnot si128 m128i a __m128i b uses PANDN Computes the bitwise AND of the 128 bit value in b and the bitwise NOT of the 128 bit value in a r ra amp b 268 Intel C Intrinsics Reference __mi28i mm or si128 mi128i a __m128i b uses POR Computes the bitwise OR of the 128 bit value in a and the 128 bit value in b ro a b m128i mm xor sil128 m128i a m128i bi uses PXOR Computes the bitwise XOR of the 128 bit value in a and the 128 bit value in b FC ze a P Integer Shift Operations for Streaming SIMD Extensions 2 The shift operation intrinsics for Streaming SIMD Extensions 2 and the description for each are listed in the following table The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file Se MAS Corresponding Instruction cke peser all EE ll Ge RM T pc NM ee wee Cai MN HON ad E T EES pu Se pos Nr HAMM NI penu T 269 Intel C Compiler for Linux Systems User s Guide Intrinsie Shift Direction Shift Type Corresponding Instruction __mi28i mm slli si128 m128i a int imm Shifts the 128 bit value in a left by imm bytes while shifting in zeros imm must be an immediate r a lt lt imm 8 m128i mm slli epil6 m128i a int count S
155. Options Quick Reference filel file2 Indicates one or more files to be processed by the compilation system You can specify more than one file Use a space as a delimiter for multiple files linker Indicates options directed to the linker options Example prompt icpc prec div axW my sourcel cpp my source2 cpp Bstatic 41 Intel C Compiler for Linux Systems User s Guide Invoking the Compiler from the Command Line with make To run make from the command line using Intel C Compiler make sure that usr bin is in your path If you use a C shell you can edit your cshrc file and add setenv PATH usr bin full path to Intel compiler F Note To use the Intel compiler your makefile must include the setting CC icc Use the same setting on the command line to instruct the makefile to use the Intel compiler If your makefile is written for gcc the GNU C compiler you will need to change those command line options not recognized by the Intel compiler Then you can compile prompt make f my makefile Compiler Input Files The Intel C Compiler recognizes the file name extensions listed in the table below fi 42 fil fil fil fil fil fil lenam ET fil fil fil Filename ename ename ilename ename ename ename ename enam enam enam enam 00000 SO CC CPP CXX Interpretation Object library When you invoke the compile
156. PILER ERROR message 168 Diagnostics and Messages Suppressing Warning Messages with lint Comments The UNIX lint program attempts to detect features of a C or C program that are likely to be bugs non portable or wasteful The compiler recognizes three 1int specific comments 1 ARGSUSED 2 NOTREACHED 3 VARARGS Like the lint program the compiler suppresses warnings about certain conditions when you place these comments at specific points in the source Suppressing Warning Messages or Enabling Remarks Use the w or Wn option to suppress warning messages or to enable remarks during the preprocessing and compilation phases You can enter the option with one of the following arguments Option Description w0 Display only errors same as w w1 Display warnings and errors DEFAULT w2 Display remarks warnings and errors For some compilations you might not want warnings for known and benign characteristics such as the K amp R C constructs in your code For example the following command compiles newprog cpp and displays compiler errors but not warnings prompt icpc W0 newprog cpp Use the ww we or wd option to indicate specific diagnostics Option Description wwL1 L2 Ln Changes the severity of diagnostics L1 through Ln to warning weLl L2 Ln Changes the severity of diagnostics L1 through Ln to error wdL1
157. Rl B4 R2 A5 R3 B5 R4 A6 R5 B6 R6 A7 R7 B7 Corresponding intrinsic _mm_unpackhi_pi8 357 Interleave the sixteen 8 bit values from the high half of A with the four 8 bit values from the high half of B I8vecl6 unpack_high I8vecl6 A I8vecl6 B Is8vecl6 unpack_high Is8vecl6 A I8vecl6 B Tu8vecl6 unpack high Iu8vecl 6 A I8vecl6 B RO A8 Rl B8 R2 A9 R3 B9 R4 A10 R5 B10 R6 All R7 B11 R8 A12 R8 B12 R2 A13 R3 B13 RA A14 R5 B14 R6 A15 R7 B15 Corresponding intrinsic _mm_unpackhi_epil6 Interleave the 32 bit value from the low half of A with the 32 bit value from the low half of B RO AO R1 BO Corresponding intrinsic _mm_unpacklo_epi32 Interleave the 64 bit value from the low half of A with the 64 bit values from the low half of B I64vec2 unpack low I64vec2 A I64vec2 B Is64vec2 unpack low Is64vec2 A Is64vec2 B Iu64vec2 unpack low Iu64vec2 A Iu64vec2 B RO AO Rl BO R2 Al R3 Bl Corresponding intrinsic _mm_unpacklo_epi32 Interleave the two 32 bit values from the low half of A with the two 32 bit values from the low half of B T32vec4 unpack_low I32vec4 A I32vec4 B Is32vec4 unpack_low Is32vec4 A Is32vec4 B Tu32vec4 unpack low Iu32vec4 A Iu32vec4 B RO AO R1 BO R2 Al R3 Bl Corresponding intrinsic _mm_unpacklo_epi32 358 Intel C Intrinsics Reference In
158. SSE intrinsics for Class Libraries ia64intrin h ia64regs h Standard header file iso646 h Standard header file ivec h MMX instructions intrinsics for Class Libraries limits h Standard header file mathimf h Principal header file for current Intel Math Library mmintrin h Intrinsics for MMX instructions 166 Key Files File omp h pgouser h proto h sse2mmx h stdarg h stdbool h stddef h syslimits h varargs h xarg h xmmintrin h lib Files File libcprts a libcxa so libirc a libm a libguide a libguide so ibmofl a ibmofl so libunwinder a libintrins a Description Principal header file OpenMP For use in the instrumentation compilation phase of profile guided optimizations Principal header file for Streaming SIMD Extensions 2 intrinsics Replacement header for standard stdarg h Defines _Bool keyword Standard header file Replacement header for standard varargs h Header file used by stdargs h and varargs h Principal header file for Streaming SIMD Extensions intrinsics Description C standard language library C language library indicating I O data location Intel specific library optimizations Math library OpenMP library Shared OpenMP library Multiple Object Format Library used by the Intel assembler Shared Multiple Object Format Library used by the Intel assembler Unwinder library Intrinsic functions library 167 D
159. State Macros Macro Arguments MM SET EXCEPTION STATE x MM EXCEPT INVALID MM GET EXCEPTION STATE MM EXCEPT DIV ZERO M EXCEPT DENORM Macro Definitions MM EXCEPT OVERFLOW Write to and read from the sixth least significant control register bit respectively MM EXCEPT UNDERFLOW MM EXCEPT INEXACT 246 Intel C Intrinsics Reference The following example tests for a divide by zero exception Exception State Macros with MM EXCEPT DIV ZERO if MM LET EXCEPTION STATE x c MM EXCEPT DIU ZERO Exception has occurred OOo Exception Mask Macros Macro Arguments _ _ _MM_SET_EXCEPTION_MASK x ASK INVALID be MM GET EXCEPTION MASK ASK DIV Zl Macro Definitions d ERF LOW Write to and read from the seventh through twelfth control register bits respectively Note All six exception mask bits are always affected Bits not set explicitly are cleared p MM MASK UN MM MASK INEXACT The following example masks the overflow and underflow exceptions and unmasks all other exceptions Exception Mask with MM MASK OVERFLOW and MM MASK UNDERFLOW EPTION MASK MM MASK OVERFLOW MM MASK UNDERFLOW Rounding Mode Macro Arguments MM SET ROUNDING MOD
160. TATIC no chunk size specified Sets the number of threads to use during Number of execution processors Enables TRUE or disables FALSE the dynamic adjustment of the number of threads Enables TRUE or disables FALSI parallelism 143 Intel C Compiler for Linux Systems User s Guide Intel Extension Environment Variables Environment Variable KMP_LIBRARY KMP_STACKSIZE Description Selects the OpenMP run time library throughput The options for the variable value are serial turnaround or throughput indicating the execution mode The default value of throughput is used if this variable is not specified Sets the number of bytes to allocate for each parallel thread to use as its private stack Use the optional suffix b k m g or t to specify bytes kilobytes megabytes gigabytes or terabytes OpenMP Run time Library Routines OpenMP provides several run time library functions to assist you in managing your program in parallel mode Many of these functions have corresponding environment variables that can be set as defaults The run time library functions enable you to dynamically change these factors to assist in controlling your program In all cases a call to a run time library function overrides any corresponding environment variable throughput execution mode IA 32 2m Itanium compiler 4m The following tab
161. Test2 dpi At this step the profmerge tool merges all the dyn files into one file Test2 dpi that represents the total profile information of the application on Test 2 9 ssue command rm PROF DIR dyn Make sure that there are no unrelated dyn files present 10 Issue command myApp data3 This command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF DIR 11 Issue Command profmerge prof dpi Test3 dpi At this step the profmerge tool merges all the dyn files into one file Test 3 dpi that represents the total profile information of the application on Test 3 12 Create a file named tests list with three lines The first line contains Test 1 dpi the second line contains Test 2 dpi and the third line contains Test 3 dpi When these items are available the Test prioritization Tool may be launched from the command line in PROF DIR directory as described in the following examples In all examples the discussion references the same set of data Example 1 Minimizing the Number of Tests tselect dpi list tests list spi pgopti spi where the spi option specifies the path to the spi file Here is a sample output from this run of the Test prioritization Tool number of tests 3 block coverage 52 17 function coverage 50 00 112 Compiler Optimizations Num RatCvrg BI
162. _m128 b Subtracts adjacent vector elements roO a0 al rl a2 a3 r2 b0 bi r3 b2 b3 283 Intel C Compiler for Linux Systems User s Guide extern _ m128 _mm_movehdup_ps __m128 a Duplicates odd vector elements into even vector elements r0 al rl al r2 a3 r3 a3 extern __m128 _mm_moveldup_ps __m128 a Duplicates even vector elements into odd vector elements r0 a0 rl a0 r2 a2 r3 a2 Double precision Floating point Vector Intrinsics extern m128d mm addsub pd m128d a __m128d b Adds upper vector element while subtracting lower vector element r0 a0 H0 rl al bl extern mi28d _mm_hadd_pd __m128d a __m128d b Adds adjacent vector elements ro a0 al rl D bl extern mi28d _mm_hsub_pd __m128d a __m128d b Subtracts adjacent vector elements ro a0 al ri D I pil extern mi28d mm loaddup pd double const dp Duplicates a double value into upper and lower vector elements r0 dp rl dp extern mi28d mm movedup pd m128d a Duplicates lower vector element into upper vector element ro a0 rl a0 Integer Vector Intrinsics The integer vector intrinsic listed below is designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 extern m128i mm lddqu si128 4m128i const p Loads an unaligned 128 bit value This differs from movdqu in that it can provi
163. a Unwinder library libunwind so libunwind so 3 libcxa a Intel run time support for C features libcxa so libcxa so 3 60 Using Libraries Library Description libcxaguard a Used for interoperability support with the cxxlib gcc option libcxaguard so See gcc Interoperability libcxaguard so 3 When you invoke the cxx1ib gcc option the following replacements occur e libcprts is replaced with libstdc from the gcc distribution 3 2 or newer e libcxa and libunwind are replaced by 1ibgcc from the gcc distribution 3 2 or newer If you want to link your program with alternate or additional libraries specify them at the end of the command line For example to compile and link prog cpp with mylib a use the following command prompt gt icpe prog cpp mylib a The mylib a library appears prior to the 1ibimf a library in the command line for the 1d linker A Caution The Linux system libraries and the compiler libraries are not built with the align option Therefore if you compile with the align option and make a call to a compiler distributed or system library and have long long double or long double types in your interface you will get the wrong answer due to the difference in alignment Any code built with align cannot make calls to libraries that use these types in their interfaces unless they are built with align in which case they will not work without a1ign Math Libraries The Intel ma
164. a right by count bits while shifting in zeros rO srl a0 count rl srl al count r2 srl a2 count r3 srl a3 count 271 Intel C Compiler for Linux Systems User s Guide __m128i _mm_srl_epi32 __m128i a m128i count Shifts the 4 signed or unsigned 32 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count r2 srl a2 count r3 srl a3 count __m128i _mm_srli_epi64 __m128i a int count Shifts the 2 signed or unsigned 64 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count __m128i mm srl epi64 m128i a m128i count Shifts the 2 signed or unsigned 64 bit integers in a right by count bits while shifting in zeros rO srl a0 count rl srl al count Integer Comparison Operations for Streaming SIMD Extensions 2 The comparison intrinsics for Streaming SIMD Extensions 2 and descriptions for each are listed in the following table The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file Intrinsic Name Instruction Comparison Elements Size of Elements mm cmpeq epi8 PCMPEQB Equality 16 mm cmpeq epil6 PCMPEOW Equality 8 mm cmpeq epi32 PCMPEQD Equality 4 mm cmpgt epi8 PCMPGTB Greater Than 16 mm cmpgt epil6 PCMPGTW Greater Than 8 mm cmpgt epi32 PCMPGTD Greater Than 4 mm cmplt epi8 PCMPGTBr Less Than 16
165. all Intel provided libraries should be linked dynamically Causes the executable to link all libraries statically as opposed to dynamically When static is not used e lib ld linux so 2 is linked in e all other libs are linked dynamically When static is used e lib ld linux so 2 is not linked in e all other libs are linked statically By default the Intel provided 1ibcxa C library is linked in dynamically Use static libcxa on the command line to link 1ibcxa statically while still allowing the standard libraries to be linked in by the default behavior This option is placed in the linker command line corresponding to its location on the user command line This option is used to control the linking behavior of any library being passed in via the command line This option is placed in the linker command line corresponding to its location on the user command line This option is used to control the linking behavior of any library being passed in via the command line 57 Intel C Compiler for Linux Systems User s Guide Suppressing Linking Use the c option to suppress linking For example entering the following command produces the object files 11e1 0 and file2 o prompt gt icpe c filel cpp file2 cpp 3 Note The preceding command does not link these files to produce an executable file Debugging This section describes the basic command line options that you can use as tools to debug your compilati
166. ally store byte elements of d to address p The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored if sign n0 p 0 d if sign n1 pli di if sign n7 p 7 d7 m pavgb m64 a m64 b m64 Computes the rounded averages of the unsigned bytes in a and b t unsigned short a0 unsigned short b0 rO t gt gt 1 t amp 0x01 t unsigned short a7 unsigned short b7 r7 unsigned char t gt gt 1 t amp Ox01 m pavgw m64 a __m64 b 238 Computes the rounded averages of the unsigned words in a and b t unsigned int a0 unsigned int b0 rO t gt gt 1 t amp 0x01 t unsigned word a7 unsigned word b7 r7 unsigned short t gt gt 1 t amp 0x01 Intel C Intrinsics Reference m64 _m_psadbw __m64 a X m64 b Computes the sum of the absolute differences of the unsigned bytes in a and b returning he value in the lower word The upper three words are cleared r0 abs a0 b0 abs a7 b7 rl r2 r3 0 Memory and Initialization Using Streaming SIMD Extensions This section describes the load set and store operations which let you load and store data into memory The load and set operations are similar in that both initialize _m128 data However the set operations take a float argument and are intended for initialization with constants whereas the Load operations take a floatin
167. alse The prototypes for Streaming SIMD Extensions intrinsics are in the xmmint rin h header file Intrinsic Name Comparison Corresponding Instruction _mm_cmpeq_ss Equal CMPEQSS _mm_cmpeq_ps Equal CMPEQPS _mm_cmplt_ss Less Than CMPLTSS _mm_cmplt_ps Less Than CMPLTPS _mm_cmple_ss Less Than or Equal CMPLESS _mm_cmple_ps Less Than or Equal CMPLEPS _mm_cmpgt_ss Greater Than CMPLTSS _mm_cmpgt_ps Greater Than CMPLTPS _mm_cmpge_ss Greater Than or Equal CMPLESS mm_cmpge_ps Greater Than or Equal CMPLEPS _mm_cmpneq_ss Not Equal CMPNEQSS mm cmpneq ps Not Equal CMPNEQPS mm cmpnlt ss Not Less Than CMPNLTSS mm cmpnlt ps Not Less Than CMPNLTPS mm cmpnle ss Not Less Than or Equal CMPNLESS mm cmpnle ps Not Less Than or Equal CMPNLEPS mm cmpngt SS Not Greater Than CMPNLTSS mm cmpngt ps Not Greater Than CMPNLTPS mm cmpnge ss Not Greater Than or Equal CMPNLESS 226 Intel C Intrinsics Reference Intrinsic Name Comparison _mm_cmpnge_ps Not Greater Than or Equal _mm_cmpord_ss Ordered _mm_cmpord_ps Ordered _mm_cmpunord_ss Unordered _mm_cmpunord_ps Unordered _mm_comieq_ss Equal _mm_comilt_ps Less Than mm comile ss Less Than or Equal mm comigt ss Greater Than mm comige ss Greater Than or Equal mm comineq ss Not Equal mm ucomieq ss Equal mm ucomilt ss Less Than mm ucomile ss Less Than or Eq
168. alues from A and B F64vec2 R unpack high F64vec2 A F64vec2 B Corresponding intrinsic mm unpackhi pd a b Selects and interleaves the lower two single precision floating point values from A and B F32vec4 R unpack low F32vec4 A F32vec4 B Corresponding intrinsic mm unpacklo ps a b Selects and interleaves the higher two single precision floating point values from A and B F32vec4 R unpack high F32vec4 A F32vec4 B Corresponding intrinsic mm unpackhi ps a b 380 Intel C Intrinsics Reference Move Mask Operator Creates a 2 bit mask from the most significant bits of the two double precision floating point values of A as follows int i move mask F64vec2 A sign al lt lt 1 sign a0 0 Corresponding intrinsic _mm_movemask_pd Creates a 4 bit mask from the most significant bits of the four single precision floating point values of A as follows int i move mask F32vec4 A sign a3 lt lt 3 sign a2 2 sign al 1 sign a0 lt lt 0 Corresponding intrinsic mm movemask ps Classes Quick Reference This appendix contains tables listing the class functionality and corresponding intrinsics for each class in the Intel C Class Libraries for SIMD Operations The following table lists all Intel C Compiler intrinsics that are not implemented in the C SIMD classes Logical Operators Corresponding Intrinsics and Classes l64vec F64vec2 F32vec4 F32vec1 I32ve
169. aming SIMD Extensions 2 A a a a ea ea a a a cae ce ee ee ee ee ee ae a Intel C Intrinsics Reference Intrinsic Name m_pand m_pandn m por m pxor _m_pcmpegb _m_pcmpeqw m_pcmpeqd _m_pcmpgtb _m_pcmpgtw _m_pcmpgtd mm setzero si64 mm_set_pi32 _mm_set_pil6 _mm_set_pi8 mm setl pi32 mm setl pil6 mm seti pi8 mm setr pi32 mm setr pil6 mm setr pi8 Alternate Name mm and si64 mm andnot si64 mm or si64 mm xor si64 mm cmpeq pi8 mm cmpeq pil6 mm cmpeq pi32 mm cmpgt pi8 mm cmpgt pil6 mm cmpgt pi32 Across MMX AIIA Technology Architecture Streaming SIMD Extensions Streaming SIMD Extensions 2 N A N A N A N A N A N A N A N A N A N A N A N A A N rele ele ele eye a ee ee Ea N A N A N A N A N A N A N A A Itanium amp al gt l gt gt l gt lalol gt l gt l gt l gt l gt l gt l gt l gt gt l gt l gt gt C mm empty is implemented in Itanium instructions as a NOP for source compatibility only 317 Intel C Compiler for Linux Systems User s Guide Streaming SIMD Extensions Intrinsics Implementation Regular Streaming SIMD Extensions intrinsics work on 4 32 bit single precision values On Itanium based systems basic operations like add or compare will require two SIMD instructions Both can be executed in the same cycle so the throughput is one basic Streaming SIMD Extensions operation per cyc
170. and Debugging Applications Set the Environment Variables Before you can operate the compiler you must set the environment variables to specify locations for the various components The Intel C Compiler installation includes shell scripts that you can use to set environment variables With the default compiler installation these scripts are e opt intel_cc_80 bin iccvars sh e opt intel co 80 bin iccvars csh To run an environment script enter one of the following on the command line prompt source opt intel cc 80 bin iccvars sh or prompt gt source opt intel cc 80 bin iccvars csh If you want the script to run automatically when you start Linux add the same command to the end of your startup file Sample bash profile entry for iccvars sh f set environment vars for Intel C compiler source opt intel cc 80 bin iccvars sh Invoking the Compiler with icc or icpc You can invoke the Intel C Compiler on the command line with either icc or icpc Each invocation includes the C run time libraries and header files Use the no_cpprt option if you do not want the C run time libraries and headers Command line Syntax When you invoke the Intel C Compiler with icc or icpc use the following syntax prompt gt icclicpc options filel file2 linker options Argument Description options Indicates one or more command line options The compiler recognizes one or more letters preceded by a hyphen See the
171. angled triangle with double precision Computes the length of the hypotenuse of a right angled triangle with single precision Computes the integral value represented as double using the IEEE rounding mode Computes the integral value represented with single precision using the IEEE rounding mode Computes the hyperbolic sine of the double precision argument Computes the hyperbolic sine of the single precision argument Computes the square root of the single precision argument Computes the hyperbolic tangent of the double precision argument Computes the hyperbolic tangent of the single precision argument Not implemented on Itanium based systems double in this case is a complex number made up of two single precision 32 bit floating point elements real and imaginary parts 207 Intel C Compiler for Linux Systems User s Guide String and Block Copy Related The following are not implemented as intrinsics on Itanium based platforms Intrinsic Description char strset char _int32 Sets all characters in a string to a fixed value void memcmp const void cs const void Compares two regions of memory ct size t n Return 0 if cs lt ct 0 if cs ct or 0 if cs ct void memcpy void s const void ct Copies from memory Returns s size t n void memset void s int c size t n Sets memory to a fixed value Returns s char strcat char s const char ct Appends to
172. ard the most significant bit the number of bits specified by 1en int64 m i dep zr int604 s const int pos const int len The right justified 64 bit value s is deposited into a 64 bit field of all zeros at an arbitrary bit position and the result is returned The deposited bit field begins at bit position pos and extends to the left toward the most significant bit the number of bits specified by 1en int64 m i dep zi const int v const int pos const int len The sign extended value v either all 1s or all 0s is deposited into a 64 bit field of all zeros at an arbitrary bit position and the result 1s returned The deposited bit field begins at bit position pos and extends to the left toward the most significant bit the number of bits specified by 1en int64 _m64_extr __int64 r const int pos const int len A field is extracted from the 64 bit value r and is returned right justified and sign extended The extracted field begins at position pos and extends 1en bits to the left The sign is taken from the most significant bit of the extracted field int64 _m64_extru __int64 r const int pos const int len A field is extracted from the 64 bit value r and is returned right justified and zero extended The extracted field begins at position pos and extends len bits to the left int64 _m64_xmal __int64 a int64 b int64 c The 64 bit values a and b are treated as signed integers and multiplied to produce
173. ariable list for the taskq It also implies captureprivate on each enclosed task The original object referenced by each variable has an indeterminate value upon entry to the construct must not be modified within the dynamic extent of the construct and has an indeterminate value upon exit from the construct firstprivate The firstprivate clause creates a private copy constructed version for each object in variable list for the taskq It also implies captureprivate on each enclosed task The original object referenced by each variable must not be modified within the dynamic extent of the construct and has an indeterminate value upon exit from the construct lastprivate The lastprivate clause creates a private default constructed version for each object in variable list for the taskq It also implies captureprivate on each enclosed task The original object referenced by each variable has an indeterminate value upon entry to the construct must not be modified within the dynamic extent of the construct and is copy assigned the value of the object from the last enclosed task after that task completes execution reduction The reduction clause performs a reduction operation with the given operator in enclosed task constructs for each object in variable list operator and variable list are defined the same as in the OpenMP Specifications 151 Intel C Compiler for Linux Systems User s Guide ordered The ordered clause performs or
174. arning Messages or Enabling Remarks Remarks do not stop translation or linking Remarks do not interfere with any output files The following are some representative remark messages e function declared implicitly e type qualifiers are meaningless in this declaration e controlling expression is constant 170 Intel Math Library The Intel C Compiler includes a mathematical software library containing highly optimized and very accurate mathematical functions These functions are commonly used in scientific or graphic applications as well as other programs that rely heavily on floating point computations Support for C99 Complex data types is included by using the c99 compiler option The mathimf h header file includes prototypes for the library functions See Using the Intel Math Library For a complete list of the functions available refer to the Function List in this section Math Libraries for 1A 32 and Itanium based Systems The math library linked to an application depends on the compilation or linkage options specified Refer to the table below Default static math library libimf so Default shared math library 171 Intel C Compiler for Linux Systems User s Guide Using the Intel Math Library To use the Intel math library include the header file mathimf h in your program Below are two example programs that illustrate the use of the math library Example Using Real Functions real math
175. as follows prompt icpc responsel txt sourcel cpp response2 txt source2 cpp n Note An at sign 8 must precede the name of the response file on the command line Include Files Include directories are searched in the default system areas and whatever is specified by the Idirectory option For multiple search directories multiple Idirectory commands must be used The compiler searches directories for include files in the following order e directory of the source file that contains the include e directories specified by the I option How to Remove Include Directories Use the X option to prevent the compiler from searching the default system areas You can use the X option with the I option to prevent the compiler from searching the default path for include files and direct it to use an alternate path 50 Building and Debugging Applications For example to direct the compiler to search the path alt include instead of the default path do the following prompt gt icpe X I alt include prog cpp See also Searching for Include Files Searching for Include Files By default the compiler searches for the standard include files in the directories specified in the CPATH C_INCLUDE_PATH and CPLUS_INCLUDE_PATH environment variables You can indicate the location of include files in the configuration file How to Specify an Include Directory Use the Idirectory option to specify an additional directory in which
176. at executables run faster In addition the native intrinsics for the Itantum processor give programmers access to Itanium instructions that cannot be generated using the standard constructs of the C and C languages The Intel C Compiler also supports general purpose intrinsics that work across all IA 32 and Itanium based platforms For more information on intrinsics please refer to the following publications Intel Architecture Software Developer s Manual Volume 2 Instruction Set Reference Manual Intel Corporation doc number 243191 199 Intel C Compiler for Linux Systems User s Guide Intrinsics Availability on Intel Processors i Processors MMXTM Streaming Streaming Itanium Technology SIMD SIMD Processor Intrinsics Extensions Extensions 2 Instructions __ Itanium Processor Pentium 4 Processor Pentium III Processor X X N A X X X X X N A _ Pentium II X N A N A Processor X N A N A N A N A N A N A N A N A Pentium with MMX Technology Pentium Pro Processor x Pentium Processor Benefits of Using Intrinsics The major benefit of using intrinsics is that you now have access to key features that are not available using conventional coding practices Intrinsics enable you to code with the syntax of C function calls and variables instead of assembly language Most MMX technology Streaming SIMD Extensions and Streaming SIMD Extensions
177. atenated to form a 128 bit value and shifted to the right count bits The least significant 64 bits of the result are returned Lock and Atomic Operation Related Intrinsics The prototypes for these intrinsics are in the ia64intrin h header file EE unsigne inter d __int64 ockedExchange8 volatile unsigned char zi unsigne arget d __int64 unsigned __int64 value _InterlockedCompareExchange8_rel volatile unsigned char Destination __int64 Compara unsigne Exchange nd d __int64 u unsigned nsigned _ int64 _InterlockedCompareExchange8_acq volatile unsigned char Destination int64 Compara unsigne _Interl short unsigne Exchange nd d __int64 u unsigned nsigned _ int64 ockedExchange16 volatile unsigned Target d _ int64 unsigned _ int64 value InterlockedCompareExchangel6 rel volatile unsigned short Destination int64 Compara unsigne Exchange nd d _ int64 u unsigned nsigned _ int64 InterlockedCompareExchangel6 acq volatile unsigned short Destination int64 Compara Exchange nd u unsigned nsigned _ int64 int _InterlockedIncrement volatile int addend 288 Map to the xchg1 instruction Atomically write the least significant byte of its 2nd argument to address specified by its 1st argument Compare and exchange atomically the least significant byte at the ad
178. ault By default the Intel compiler OFF creates 64 bit profiling counters dyn and dpi This option creates 32 bit counters for compatibility with the Intel C Compiler 7 0 prof format 32 prof gen x Instruments the program to OFF prepare for instrumented execution and also creates a new static profile information file spi With the x qualifier extra source position is collected which enables code coverage tools prof use Uses dynamic feedback OFF information Qinstall dir Sets dir as root of compiler OFF installation Qlocation tool path Sets path as the location of the OFF tool specified by tool Qoption tool list Passes an argument list to OFF another tool in the compilation sequence such as the assembler or linker Compile and link for function OFF profiling with UNIX prof tool Disables changing of the FPU OFF rounding control Enables fast float to int conversions Generates assemblable files with OFF S suffix then stops the compilation shared Produce a shared object OFF shared libcxa Link Intel 1ibcxa C library ON dynamically sox Enables disables the saving of S0x compiler options and version information in the executable file static Prevents linking with shared OFF libraries 26 Compiler Options Quick Reference static libcxa Link Intel 1ibcxa C library OFF statically std c99 Enable C99 support for
179. b rO al rl bil __m128i mm unpacklo epi8 mi128i a __m128i b Interleaves the lower 8 signed or unsigned 8 bit integers in a with the lower 8 signed or unsigned 8 bit integers in b r0 a0 rl bO r2 al r3 bl r14 a7 r15 b7 __mi28i _mm_unpacklo_epil __m128i a __m128i b Interleaves the lower 4 signed or unsigned 16 bit integers in a with the lower 4 signed or unsigned 16 bit integers in b r0 a0 rl bO r2 al r3 bl r4 a2 r5 Di r6 a3 r7 Di 279 Intel C Compiler for Linux Systems User s Guide __m128i mm unpacklo epi32 m128i a __m128i b Interleaves the lower 2 signed or unsigned 32 bit integers in a with the lower 2 signed or unsigned 32 bit integers in b ro a0 rl bO r2 al r3 bl m128i mm unpacklo epi64 mi128i a m128i b Interleaves the lower signed or unsigned 64 bit integer in a with the lower signed or unsigned 64 bit integer in b r0 a0 rl bO m64 mm movepi64 pi64 m128i a Returns the lower 64 bits of a as an m64 type r0 a0 128i mm movpi64 pi64 m64 a Moves the 64 bits of a to the lower 64 bits of the result zeroing the upper bits r0 a0 rl OXO 128i mm move epi64 128i a Moves the lower 64 bits of the lower 64 bits of the result zeroing the upper bits r0 a0 rl 0X0 Integer Memory and Initialization for Streaming SIMD Extensions 2 The integer Load set and store i
180. b Page Owren cote coverage too code Coverage toot Intal Browsing the Frames The coverage tool creates frames that facilitate browsing through the code to identify uncovered code The top frame displays the list of uncovered functions while the bottom frame displays the list of covered functions For uncovered functions the total number of basic blocks of each function is also displayed For covered functions both the total number of blocks and the number of covered blocks as well as their ratio that is the coverage rate are displayed For example 66 67 4 6 indicates that four out of the six blocks of the corresponding function were covered The block coverage rate of that function is thus 66 67 These lists can be sorted based on the coverage rate number of blocks or function names Function names are linked to the position in source view where the function body starts So just by one click the user can see the least covered function in the list and by another click the browser displays the body of the function The user can then scroll down in the source view and browse through the function body Individual Module Source View Within the individual module source views the tool provides the list of uncovered functions as well as the list of covered functions The lists are reported in two distinct frames that provide easy navigation of the source code The lists can be sorted based on e the number of blocks within uncovered fun
181. b for subtraction suffix Denotes the type of data operated on by the instruction The first one or two letters of each suffix denotes whether the data 1s packed p extended packed ep or scalar s The remaining letters denote the type e Ssingle precision floating point e __ddouble precision floating point 1128 signed 128 bit integer __i64 signed 64 bit integer ___u64 unsigned 64 bit integer i32 signed 32 bit integer u32 unsigned 32 bit integer i16 signed 16 bit integer u16 unsigned 16 bit integer i8 signed 8 bit integer __u8 unsigned 8 bit integer A number appended to a variable name indicates the element of a packed object For example r0 is the lowest word of r Some intrinsics are composites because they require more than one instruction to implement them Intel C Compiler for Linux Systems User s Guide The packed values are represented in right to left order with the lowest value being used for scalar operations Consider the following example operation double a 2 1 0 2 0 __m128d t mm load pd a The result is the same as either of the following __mi28d t mm set pd 2 0 1 0 __m128d t _mm_setr_pd 1 0 2 0 In other words the xmm register that holds the value t will look as follows 127 D 2 0 10 The scalar element is 1 0 Due to the nature of the instruction some intrinsics require their arguments to be immediates constant integer literals See A
182. be of any type except I s u 8vec 8116 Example Syntax Usage for Shift Operators Automatic size and sign conversion Isl6vec4 A C Iu32vec2 B C A A amp B returns I16vec4 which must be cast to Tul 6vec4 to ensure logical shift not arithmetic shift Isl6vec4 A C Iul6vec4 B R R Iul6vec4 A amp B C A amp B returns I16vec4 which must be cast to Is16vec4 to ensure arithmetic shift not logical shift R Isl6vec4 A amp B C 349 Shift Operators with Corresponding Intrinsics Operation Shift Left Shift Right Symbols Syntax Usage lt lt V gt R A lt lt B R amp A R A gt gt B R gt gt A Intrinsic _mm_s 1_si64 mm ol li si64 mm el l pi32 mm s 1 li pi32 _mm_s l pil6 mm sil i pil6 mm srl mm srl mm srl mm srl mm srl li pil6 Im Sr _si64 i si64 pi32 i pi32 pil6 mm sra pi32 mm srai pi32 mm sra pil6 mm srai pil6 Right shift operations with signed data types use arithmetic shifts All unsigned and intermediate classes correspond to logical shifts The table below shows how the return type is determined by the first argument type Shift Operator Overloading Operation R Right Left A Shift Shift I64vecl gt gt gt gt lt lt lt lt 164vecl I64vecl B A Logical I32vec2 gt gt gt gt lt lt lt lt 132vec2 A Arithmetic Is32vec2 gt gt gt gt lt lt
183. bl Oxffff 0x0 r7 a7 gt b7 Oxffff 0x0 __m128i mm cmpgt epi32 m128i a __m128i bi Compares the 4 signed 32 bit integers in a and the 4 signed 32 bit integers in b for greater than rO a0 gt b0 Oxffff 0x0 rl al gt bl Oxffff 0x0 r2 a2 gt b2 Oxffff 0x0 r3 a3 gt b3 Oxffff 0x0 __m128i mm cmplt epi8 __m128i a __m128i b Compares the 16 signed 8 bit integers in a and the 16 signed 8 bit integers in b for less than rO a0 lt b0 Oxff 0x0 rl al lt bl Oxff Ox0 EE a15 lt b15 Oxff 0x0 273 Intel C Compiler for Linux Systems User s Guide __m128i mm cmplt epil6 __m128i a __m128i b Compares the 8 signed 16 bit integers in a and the 8 signed 16 bit integers in b for less than r0 a0 lt b0 Oxffff 0x0 rl al bl Oxffff 0x0 r7 a7 lt b7 Oxffff 0x0 m128i mm cmplt epi32 m128i a m1281i b Compares the 4 signed 32 bit integers in a and the 4 signed 32 bit integers in b for less than rO a0 lt b0 Oxffff 0x0 rl al lt bl Oxffff 0x0 r2 a2 lt b2 Oxffff 0x0 r3 a3 lt b3 Oxffff 0x0 Conversion Operations for Streaming SIMD Extensions 2 The following two conversion intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file __mi28i
184. c I16vec l8vec8 Operators Corresponding Intrinsic 1128vec1 l64vec2 I32vec4 I16vec8 l8vec16 amp amp _mm_and_ x si128 si64 pd ps ps Lk mm or x si128 si64 pd ps ps A A _mm_xor_ x si128 si64 pd ps ps Andnot _mm_andnot_ x si128 si64 pd N A N A Arithmetic Corresponding Intrinsics and Classes Part 1 Operators Corresponding I64vec2 I32vec4 I16vec8 I8vec16 Intrinsic LE _mm_add_ x epi64 _mm_ sub x epi64 mm mullo x N A 7 mm div x mul high mm mulhi x N A mul add mm madd x N A 381 Operators Corresponding Intrinsic sqrt mm sqrt x rcp mm rcp x rcp nr mm rcp x mm add x mm sub x mm mul x rsqrt mm rsqrt x x rsqrt nr mm rsqrt mm sub x mm mul x Arithmetic Corresponding Intri perators Corresponding Intrinsic x e e e ul_add m mm mulhi x x mm madd 382 mm rsqrt mm sub x mm mul x l64vec2 I32vec4 116vec8 l8vec16 E N A N A N A N A N A EE nsics and Classes Part 2 l8vec8 pi32 i16 ig Wd P p TD pi32 i16 i8 n p p Z gt 5 BR N A pil6 n Z gt Wd TD Le D D D D Z gt Z gt 5 n IO D Z gt u IO IO 0 0 Qo 0 Wd a n Wd Wd Wd Wd 0 TD Z Z gt gt Intel C Intrinsic
185. c2 A mm cmpnlt pd 1 float F32vecl R cmpnit F32vecl A mm cmpnlt ss Compare for Not Less Than or Equal 4 floats F32vec4 R cmpnle F32vec4 A mm cmpnle ps 2 doubles F64vec2 R cmpnle F64vec2 A mm cmpnle pd float F32vecl R cmpnle F32vecl A mm cmpnle ss Compare for Not Greater Than 4 floats F32vec4 R cmpngt F32vec4 A mm cmpngt ps 2 doubles F64vec2 R cmpngt F64vec2 A mm cmpngt pd float F32vecl R cmpngt F32vecl A mm cmpngt ss Compare for Not Greater Than or Equal 4 floats F32vec4 R cmpnge F32vec4 A mm cmpnge ps 2 doubles F64vec2 R cmpnge F64vec2 A mm cmpnge pd 375 1 float F32vecl R cmpnge F32vecl A mm cmpnge ss Conditional Select Operators for Fvec Classes Each conditional function compares single precision floating point values of A and B The C and D parameters are used for return value Comparison between objects of any Fvec class returns the same class Conditional Select Operators for Fvec Classes Conditional Select for Operators Syntax Equality select eq R select eq A B Inequality select neq R select neq A B Greater Than select gt R select gt A B Greater Than or Equal To select g R select ge A B Not Greater Than select gt R select gt A B fo Not Greater Than or Equal To select_ge R select_ge A B P _ Less Than select lt R select lt A B Less Than or Equal To sele
186. cation This documentation assumes that you are familiar with the C and C programming languages and with the Intel processor architecture You should also be familiar with the host computer s operating system n Note This document explains how information and instructions apply differently to each targeted architecture If there is no specific indication to either architecture the description is applicable to both architectures Welcome to the Intel C Compiler Conventions This documentation uses the following conventions This type Indicates an element of syntax reserved word keyword filename style computer output or part of a program example The text appears in lowercase unless uppercase is significant This type Indicates the exact characters you type as input style This type Indicates a placeholder for an identifier an expression a string a style symbol or a value Substitute one of these items for the placeholder items Indicates that the items enclosed in brackets are optional iteml Used for option s version for example option x K W B N P has item2 these versions xK xW xB xN and xP ellipses Indicate that you can repeat the preceding item Naming Syntax for the Intrinsics Most intrinsic names use a notational convention as follows mm intrin op suffix intrin Op Indicates the intrinsics basic operation for example add for addition and su
187. ce double creal double Complex z long double creal long double Complex z float crealf float Complex z Description The csin function returns the complex sine of z Calling interface double Complex csin double Complex zi long double Complex csinl long double Complex z float Complex csinf float Complex z Description The csinh function returns the complex hyperbolic sine of z Calling interface double Complex csinh double Complex zi long double Complex csinl long double Complex z float Complex csinhf float Complex z Description The csqrt function returns the complex square root of z Calling interface double Complex csqrt double Complex z long double Complex csqrtl long double Complex z float Complex csqrtf float Complex z Description The ctan function returns the complex tangent of z Calling interface double Complex ctan double Complex z long double Complex ctanl long double Complex z float Complex ctanf float Complex z Description The ct anh function returns the complex hyperbolic tangent of z Calling interface double Complex ctanh double Complex z long double Complex ctanhl long double Complex z float Complex ctanhf float Complex z 197 Intel C Compiler for Linux Systems User s Guide C99 Macros The Intel Math library and mathimf h header file support the following C99 macros int fpclassify x int isfinite x int isgreater x y int isgreat
188. cessing directive Causes all predefined macros and assertions to be inactive Preserves comments in preprocessed source output Defines the macro name and associates it with the specified value The default Dname defines a macro with a value ofl Directs the preprocessor to expand your source module and write the result to standard output Directs the preprocessor to expand your source module and write the result to standard output Does not include line directives in the output Directs the preprocessor to expand your source module and store the result ina i file in the current directory Suppresses any automatic definition for the specified macro name 43 Intel C Compiler for Linux Systems User s Guide Preprocessing Only Using Using Using Use the E P or EP option to preprocess your source files without compiling them When using these options only the preprocessing phase of compilation is activated E When you specify the E option the compiler s preprocessor expands your source module and writes the result to stdout The preprocessed source contains 1ine directives which the compiler uses to determine the source file and line number For example to preprocess two source files and write them to stdout enter the following command prompt icpc E progl cpp prog2 cpp P When you specify the P option the preprocessor expands your source module and directs the output toa i
189. ch path 36 Compiler Options Quick Reference Linux Windows Description Linux Default X KIWINIBIP Qx K W N BIP Generates specialized OFF code for processor specific codes K W N B and P e K Intel Pentium III and compatible Intel processors W Intel Pentium 4 and compatible Intel processors N Intel Pentium 4 and compatible Intel processors B Intel Pentium M and compatible Intel processors P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Z2p 1 2 4 8 16 Packs structures on 1 2 4 8 or 16 byte boundaries Default Compiler Options e Options supported on both IA 32 and Itanium based systems Option Description c99 Enables C99 support for C programs falias Assume aliasing in program ffnalias Assume aliasing within functions 37 Intel C4 Compiler for Linux Systems User s Guide Option mcpu pentium4 mcpu itanium2 prefetch SOX std c99 tn TA 32 only w1 Zp16 38 gcc version 320 Description This option provides compatible behavior with gcc where nnn indicates the gcc version This version of the Intel compiler supports gcc version 320 Default Optimizes for Intel Pentium 4 processor IA 32 systems only Optimizes for Intel Itanium 2 processor Itanium based systems only Same as O1 on IA 32 Same as O on Itanium based systems Enables inlining of functions
190. char const char char strcpy char s const char ct 313 Intel C Compiler for Linux Systems User s Guide size_t strlen const char cs int strncmp char char int int strncpy char char int void __alloca int int setjmp jmp buf exception code void exception info void abnormal termination void void _enable void disable int bswap int int in byte int int in dword int int in word int int inp int int inpd int int inpw int int out byte int int int out dword int int int out word int int int outp int int int _outpd int int int outpw int int 314 Intel C Intrinsics Reference MMX Technology Intrinsics Implementation Key to the table entries A Expected to give significant performance gain over non intrinsic based code equivalent B Non intrinsic based source code would be better the intrinsic s implementation may map directly to native instructions but they offer no significant performance gain performance if used C Requires contorted implementation for particular microarchitecture Will result in very poor Intrinsic Name empty _m_from int to int packsswb packssdw packuswb punpckhbw punpckhwd punpckhdq punpckl m punpckl punpckl paddb paddw paddd paddsb paddsw paddusb paddusw
191. chitectures with MMX technology e fvec h is specific to architectures with Streaming SIMD Extensions e dvec his specific to architectures with Streaming SIMD Extensions 2 Streaming SIMD Extensions 2 intrinsics cannot be used on Itanium based systems The mmclass h header file includes the classes that are usable on the Itanium architecuture This documentation is intended for programmers writing code for the Intel architecture particularly code that would benefit from the use of SIMD instructions You should be familiar with C and the use of C classes 333 Details About the Libraries The Intel C Class Libraries for SIMD Operations provide a convenient interface to access the underlying instructions for processors as specified in Processor Requirements for Use of Class Libraries These processor instruction extensions enable parallel processing using the single instruction multiple data SIMD technique as illustrated in the following figure SIMD Data Flow Performing four operations with a single instruction improves efficiency by a factor of four for that particular instruction These new processor instructions can be implemented using assembly inlining intrinsics or the C SIMD classes Compare the coding required to add four 32 bit floating point values using each of the available interfaces Comparison Between Inlining Intrinsics and Class Libraries Assembly Inlining SIMD Class Libraries m128 a b c includ
192. cn option Set n to one of the following values to round the significand to the indicated number of bits e pc32 24 bits single precision See Caution statement above e pc64 53 bits single precision e pc80 64 bits single precision Default The default value for n is 80 indicating double precision This option allows full optimization Using this option does not have the negative performance impact of using the Op option because only the fractional part of the floating point value is affected The range of the exponent is not affected The pcn option causes the compiler to change the floating point precision control when the main function is compiled The program that uses pcn must use main as its entry point and the file containing main must be compiled with pcn rcd Option The Intel compiler uses the rca option to improve the performance of code that requires floating point to integer conversions The optimization is obtained by controlling the change of the rounding mode The system default floating point rounding mode is round to nearest This means that values are rounded during floating point calculations However the C language requires floating point values to be truncated when a conversion to an integer is involved To do this the compiler must change the rounding mode to truncation before each floating point to integer conversion and change it back afterwards The r cd option disables the change to
193. complex number 1 pi 4 where i is the imaginary unit c32in float Complex c64in Create the float complex value from the double complex value c64out cexp c64in c32out cexpf c32in Call the complex exponential cexp z cexp x iy x i y ex cos y i sin y printf When z 7 7f 7 7f i cexpf z 7 7f 7 7f i n crealf c32in cimagf c32in crealf c320ut cimagf c320ut printf When z 12 12f 12 12f i cexp z 12 12f 12 12f i Mn creal c64in cimag c64in creal c dout cimagf c640ut return 0 prompt icc complex math c The output of a out will look like this When z 1 0000000 0 7853982 i cexpf z 1 9221154 1 9221156 i When z 1 000000000000 0 785398163397 i cexp z 1 922115514080 1 922115514080 i E Note Complex data types are supported in C but not in C programs 173 Intel C Compiler for Linux Systems User s Guide Exception Conditions If you call a math function using argument s that may produce undefined results an error number is assigned to the system variable errno Math function errors are usually domain errors or range errors Domain errors result from arguments that are outside the domain of the function For example acos is defined only for arguments between 1 and 1 inclusive Attempting to evaluate acos 2 or acos 3 results in a domain error where the return value is
194. ct le R select le A B Not Less Than select nlt R select nlt A B Not Less Than or Equal To select nl R select nle A B Conditional Select Operator Usage For conditional select operators the return value is stored in C if the comparison is true or in D if false The following table shows the return values for each class of the conditional select operators using the Return Value Notation described earlier Compare Operator Return Value Mapping Operators F32vec4 F64vec2 F32vec1 BO CO DO X x BO CO DO B1 Cl DI N A I A2 le gt ge B1 Cl DI select ne nlt nle ngt nge 376 Intel C Intrinsics Reference A2 le select ne nle ngt R3 A3 select eq A3 le gt ge select ne nle ngt Operators F32vec4 F64vec2 F32vec1 B2 C2 D2 X N A B2 C2 D2 nlt nge Ie B3 C3 D3 X N A B3 C3 D3 nlt nge The following table shows examples for conditional select operations and corresponding intrinsics Conditional Select Operations for Fvec Classes q F32vec4 A q F64vec2 A q F32vecl A F32vec4 A F64vec2 A F32vecl A t F32vec4 A F64vec2 A t F32vecl A F32vec4 A Compare for Equality 4 floats F32vec4 R select 2 doubles F64vec2 R select 1 float F32vecl R
195. ct of parallel processing is increased data throughput using fewer clock cycles The objective is to improve application performance of complex and computation intensive audio video and graphical data bit streams Hardware and Software Requirements You must have the Intel amp C class libraries The Intel amp C Compiler version 4 0 or higher installed on your system to use the Class Libraries are functions abstracted from the instruction extensions available on Intel processors as specified in the table that follows Processor Requirements for Use of Class Libraries 332 Intel C Intrinsics Reference Header Extension Set Available on These Processors File ivec MMX technology Pentium with MMX technology Pentium II Pentium III Pentium 4 Intel Xeon and Itanium processors Streaming SIMD Pentium III Pentium 4 Intel Xeon and Itanium processors Extensions dvec h Streaming SIMD Pentium 4 and Intel Xeon processors Extensions 2 About the Classes The Intel C Class Libraries for SIMD Operations include e Integer vector Ivec classes e Floating point vector Fvec classes You can find the definitions for these operations in three header files ivec h fvec h and dvec h The classes themselves are not partitioned like this The classes are named according to the underlying type of operation The header files are partitioned according to architecture e ivec nh is specific to ar
196. ctions e the block coverage in the case of covered functions e the function names 105 Intel C Compiler for Linux Systems User s Guide This example shows the coverage source view of SAMPLE C 3 Intel Compilers code coverage information for DACOVERAGEVAIZ O0MPILER SAMPLESAMPLESISAMPLE Microsoft Internet Explorer fe dt Men F ote Took eb SAM 2 5 DEUS VES gt QB sec renge Grete A Ze Ah D D dress 0 lCoveragsliadz comeler sample iameleSiCodeCoverageD COVERAGE JAI COMPILER SAMPLE SAMPLES SAVE C HTML eo 9 void fi int n 10 11 if in ij Cn oi 12 princf 1 of On 13 uncovered functions 14 15 blocks function 16 void f2 int n E az 17 18 if iin on ij Cm OI 23 void gi int m 24 1 25 int j X covered functions 26 27 for ij 0 j lt m j rei 28 a 44 coverage function 29 66 67 4 5 E 30 H 31 E y 83 33 5 6 fl 32 void g2 int m 100 00 5 8 gi el 100 00 15 15 main 34 35 36 37 38 39 zl 405 e fe 1 Kn comparer 4 Setting the Coloring Scheme for the Code Coverage The tool provides a visible coloring distinction of the following coverage categories e covered code e uncovered basic blocks e uncovered functions e partially covered code e unknown The default colors that the tool uses for presenting the coverage information are shown in the tables that follows This color Means Covered code The
197. d only by xild e LINE commandline is the linker command line containing a set of valid arguments to ld 95 Intel C Compiler for Linux Systems User s Guide To place the multifile IPO executable in ipo file use the option o filename for example prompt gt xild oipo file a o b o c o xild calls Intel compiler to perform IPO for objects containing IR and creates a new list of object s to be linked Then xild calls 1d to link the object files that are specified in the new list and produce ipo file executable specified by the o filename option Zi Note The ipo option can reorder object files and linker arguments on the command line Therefore if your program relies on a precise order of arguments on the command line ipo can affect the behavior of your program Usage Rules You must use the Intel linker xild to link your application if e your source files were compiled with multifile IPO enabled Multifile IPO is enabled by specifying the ipo command line option e you normally would invoke 1d to link your application The xild Options The additional options supported by x ild may be used to examine the results of multifile IPO These options are described in the following table Option Description ipo o file s Produces assemblable files for the multifile IPO compilation You may specify an optional name for the listing file or a directory with the backslash in which to place the file The default listing
198. d prior to aborting wn100 compilation to n Changes the severity of OFF diagnostics L1 through LN to remark wwLl L2 Changes severity of diagnostics OFF L1 through LN to warning W1 o1 02 Pass options o1 02 etc tothe OFF linker for processing Wp64 Print diagnostics for 64 bit OFF 29 Intel C4 Compiler for Linux Systems User s Guide xtype X KIW N B P J 30 Option Description Default All source files found OFF subsequent to xt ype will be recognized as one of the following types e c Csource file e c C source file e c header C header file e cpp output C preprocessed file e assembler assemblable file e assembler with cpp Assemblable file that needs to be preprocessed e none Disable recognition and revert to file extension Removes the standard OFF directories from the list of directories to be searched for include files Generates specialized code for OFF processor specific codes K W N B and P e K Intel Pentium III and compatible Intel processors e W Intel Pentium 4 and compatible Intel processors e N Intel Pentium 4 and compatible Intel processors e B Intel Pentium M and compatible Intel processors e P Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Compiler Options Quick Reference Option Description Default Xlinker val Pass va1 directly to the linker for proces
199. data type __m128 is used with the Streaming SIMD Extensions intrinsics It represents a 128 bit quantity composed of four single precision FP values This corresponds to the 128 bit IA 32 Streaming SIMD Extensions register The compiler aligns ___m128 local data to 16 byte boundaries on the stack Global data of these types is also 16 byte aligned To align integer float or double arrays you can use the declspec alignment Because Itanium instructions treat the Streaming SIMD Extensions registers in the same way whether you are using packed or scalar data there is no __m32 data type to represent scalar data For scalar operations use the __m128 objects and the scalar forms of the intrinsics the compiler and the processor implement these operations with 32 bit memory references But for better performance the packed form should be substituting for the scalar form whenever possible The address of a__m128 object may be taken For more information see Intel Architecture Software Developer s Manual Volume 2 Instruction Set Reference Manual Intel Corporation doc number 243191 Implementation on Itanium based systems Streaming SIMD Extensions intrinsics are defined for the 128 data type a 128 bit quantity consisting of four single precision FP values SIMD instructions for Itanium based systems operate on 64 bit FP register quantities containing two single precision floating point values Thus each __m128 operand is actually a pair
200. de higher performance in some cases However it also may provide lower performance than movdqu if the memory value being read was just previously written r pi 284 Intel C Intrinsics Reference Miscellaneous Intrinsics The miscellaneous intrinsics listed below are designed for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 extern void _mm_monitor void const p unsigned extensions unsigned hints Generates the MONITOR instruction This sets up an address range for the monitor hardware using p to provide the logical address and will be passed to the monitor instruction in register eax The extensions parameter contains optional extensions to the monitor hardware which will be passed in ecx The hints parameter will contain hints to the monitor hardware which will be passed in edx A non zero value for extensions will cause a general protection fault extern void _mm_mwait unsigned extensions unsigned hints Generates the MWATT instruction This instruction is a hint that allows the processor to stop execution and enter an implementation dependent optimized state until occurrence of a class of events In future processor designs extensions and hints parameters may be used to convey additional information to the processor All non zero values of extensions and hints are reserved A non zero value for extensions will cause a general protection fault Intrinsics for Itanium Instructions
201. dered constructs in enclosed task constructs in original sequential execution order The taskq directive to which the ordered is bound must have an ordered clause present nowait The nowait clause removes the implied barrier at the end of the taskq Threads may exit the taskq construct before completing all the task constructs queued within it task Construct pragma intel omp task clause clause structured block where clause can be any of the following e private variable list e captureprivate variable list private The private clause creates a private default constructed version for each object in variable list for the task The original object referenced by the variable has an indeterminate value upon entry to the construct must not be modified within the dynamic extent of the construct and has an indeterminate value upon exit from the construct captureprivate The captureprivate clause creates a private copy constructed version for each object in variable list for the task at the time the task is enqueued The original object referenced by each variable retains its value but must not be modified within the dynamic extent of the task construct Combined parallel and taskq Construct pragma intel omp parallel taskq clause clause structured block where clause can be any of the following e if scalar expression e num threads integer expression e copyin variable list e default shared
202. dlers Returns TRUE if the termination handler is invoked as a result of a premature exit of the corresponding try finally region Enables the interrupt Disables the interrupt Intrinsic that maps to the IA 32 instruction BSWAP swap bytes Convert little big endian 32 bit argument to big little endian form Intrinsic that maps to the IA 32 instruction IN Transfer data byte from port specified by argument Intrinsic that maps to the IA 32 instruction IN Transfer double word from port specified by argument Intrinsic that maps to the IA 32 instruction IN Transfer word from port specified by argument Sameas in byte Sameas in dword Sameas in word Intrinsic that maps to the IA 32 instruction OUT Transfer data byte in second argument to port specified by first argument Intrinsic that maps to the IA 32 instruction OUT Transfer double word in second argument to port specified by first argument Intrinsic that maps to the IA 32 instruction OUT Transfer word in second argument to port specified by first argument Sameas out byte Sameas out dword Sameas out word 209 Intel C Compiler for Linux Systems User s Guide MMX Technology Intrinsics Support for MMX Technology MMX technology is an extension to the Intel architecture IA instruction set The MMX instruction set adds 57 opcodes and a 64 bit quadword data type and eight 64 bit registers Each of the eight registers can be directly addres
203. dress specified by its Ist argument Maps to the cmpxchgl1 rel instruction with appropriate setup Same as above but using acquire semantic Map to the xchg2 instruction Atomically write the least significant word of its 2nd argument to address specified by its 1st argument Compare and exchange atomically the least significant word at the address specified by its 1st argument Maps to the cmpxchg2 rel instruction with appropriate setup Same as above but using acquire semantic Atomically increment by one the value specified by its argument Maps to the fetchadd4 instruction Intel C Intrinsics Reference Re Sa int _InterlockedDecrement volatile int addend int _InterlockedExchange volatile int Target long value int _InterlockedCompareExchange volatile int Destination int int Exchange Comparand int InterlockedExchangeAdd volatile int addend int _InterlockedAdd int increment void int increment volatile int addend _InterlockedCompare volatile Destination void Comparand unsigned __int64 ExchangePointer void void Exchange _InterlockedExchangeU volatile unsigned int Target unsigned _ int64 _InterlockedCompare unsigned int Desti unsigned _ int64 value Exchange_rel volatile nation unsigned __int64 Exchange u Comparand unsigned _ int64 _InterlockedCompare unsigned int Des
204. e mmintrin h include __asm movaps xmm0 b ssa __m128 a b c a lt fvec h gt movaps xmml c addps mm add ps b c F32vec4 a b c xmm0 xmml movaps a a b c xmmO The table above shows an addition of two single precision floating point values using assembly inlining intrinsics and the libraries You can see how much easier it is to code with the Intel C SIMD Class Libraries Besides using fewer keystrokes and fewer lines of code the notation is like the standard notation in C making it much easier to implement over other methods C Classes and SIMD Operations The use of C classes for SIMD operations is based on the concept of operating on arrays or vectors of data in parallel Consider the addition of two vectors A and B where each vector contains four elements Using the integer vector Ivec class the elements A i and B i from each array are summed as shown in the following example Typical Method of Adding Elements Using a Loop short a 4 b 4 c 4 for i20 i lt 4 i needs four iterations c i ali b i returns c 0 cll cl2 c 3 The following example shows the same results using one operation with Ivec Classes SIMD Method of Adding Elements Using lvec Classes sIsl6vec4 ivecA ivecB ivec C needs one iteration ivecC ivecA ivecB returns ivecCO ivecCl ivecC2 ivecC3 334 Intel C Intrinsics Reference Available Classes The Intel C SIMD c
205. e compiler does not apply them to any nested loops Each nested loop needs its own pragma preceding it in order for the pragma to be applied You must place a pragma only before the loop control statement pragma vector always Syntax pragma vector always Definition This pragma instructs the compiler to override any efficiency heuristic during the decision to vectorize or not pragma vector always will vectorize non unit strides or very unaligned memory accesses Example pragma ivdep Syntax pragma ivdep Definition This pragma instructs the compiler to ignore assumed vector dependences To ensure correct code the compiler treats an assumed dependence as a proven dependence which prevents vectorization This pragma overrides that decision Only use this when you know that the assumed loop dependences are safe to ignore The loop in this example will not vectorize with the ivdep pragma since the value of k is not known vectorization would be illegal if k lt 0 126 Parallel Programming Example pragma ivdep for i 0 i lt m itt ali k c pragma vector Syntax pragma vector aligned unaligned Definition The vector loop pragma means the loop should be vectorized if it is legal to do so ignoring normal heuristic decisions about profitability When the aligned or unaligned qualifier is used with this pragma the loop should be vectorized using aligned or unaligned operatio
206. e shuffle value imm must be an immediate See Macro Function for Shuffle for a description of shuffle semantics m128i mm shufflehi epil6 m128i a int imm Shuffles the upper 4 signed or unsigned 16 bit integers in a as specified by imm The shuffle value imm must be an immediate See Macro Function for Shuffle for a description of shuffle semantics __m128i mm shufflelo epil6 m128i a int imm Shuffles the lower 4 signed or unsigned 16 bit integers in a as specified by imm The shuffle value imm must be an immediate See Macro Function for Shuffle for a description of shuffle semantics __m128i mm unpackhi epi8 mi128i a m128i b Interleaves the upper 8 signed or unsigned 8 bit integers in a with the upper 8 signed or unsigned 8 bit integers in b r0 a8 rl b8 r2 a9 r3 bY rl4 a15 r15 b15 m128i mm unpackhi epil6 m128i a __m128i b Interleaves the upper 4 signed or unsigned 16 bit integers in a with the upper 4 signed or unsigned 16 bit integers in b r0 a4 rl b4 r2 ab r3 b5 r4 a6 r5 b6 r6 a7 r7 b7 __m128i mm unpackhi epi32 m128i a m128i b Interleaves the upper 2 signed or unsigned 32 bit integers in a with the upper 2 signed or unsigned 32 bit integers in b ro a2 jy rl b2 2 a3 r3 b3 __m128i mm unpackhi epi64 m128i a __m128i b Interleaves the upper signed or unsigned 64 bit integer in a with the upper signed or unsigned 64 bit integer in
207. e 1 command line prompt icpc pch sourcel cpp source2 cpp Example 1 output when pchi files do not exist sourcel cpp creating precompiled header file sourcel pchi source2 cpp creating precompiled header file source2 pchi Example 1 output when pchi files do exist sourcel cpp using precompiled header file sourcel pchi source2 cpp using precompiled header file source2 pchi s Note The pch option will use PCH files created from other sources if the headers files are the same For example if you compile sourcel cpp using pch then sourcel pchi is created If you then compile source2 cpp using pch the compiler will use sourcel pchi ifit detects the same headers create_pch Use the create pch filename option if you want the compiler to create a PCH file called filename Note the following regarding this option e The filename parameter must be specified e The filename parameter can be a full path name e The full path to filename must exist e The pchi extension is not automatically appended to filename e This option cannot be used in the same compilation as use pch filename e The create pch filename option is supported for single source file compilations only 55 Intel C Compiler for Linux Systems User s Guide Example 2 command line prompt gt icpe create_pch pch source32 pchi source cpp Example 2 output source cpp creating precompiled header fil
208. e cache line of data from address a to a location closer to the processor The value sel specifies the type of prefetch operation the constants MM HINT TO MM HINT T1 MM HINT T2 and MM HINT NTA should be used for IA 32 corresponding to the type of prefetch instruction The constants MM HINT T1 MM HINT MII MM HINT NT2 and MM HINT NTA should be used for Itanium based systems mm stream pi m64 p __m64 a uses MOVNTQ Stores the data in a to the address p without polluting the caches This intrinsic requires you to empty the multimedia state for the mmx register See The EMMS Instruction Why You Need It and When to Use It topic void mm stream ps float p _ m128 a see MOVNTP S Stores the data in a to the address p without polluting the caches The address must be 16 byte aligned void mm sfence void float 242 uses SFENCE Guarantees that every preceding store is globally visible before any subsequent store mm cvtss f32 1m128 a This intrinsic extracts a single precision floating point value from the first vector element of an__m128 It does so in the most effecient manner possible in the context used This intrinsic doesn t map to any specific SSE instruction Intel C Intrinsics Reference Miscellaneous Intrinsics Using Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file Instruction
209. e default I project include 49 Intel C Compiler for Linux Systems User s Guide Specifying the Location with ICCCFG You can use the ICCCFG environment variable to specify the location of your configuration file ICCCFG cpp config my_options cfg Each time you invoke the compiler with icc my_options cfg is used as your configuration file The ICPCCFG environment variable is supported for invoking the compiler with icpc See Environment Variables Response Files Use response files to specify options used during particular compilations Response files are invoked as an option on the command line Options in a response file are inserted in the command line at the point where the response file is invoked Sample Response Files response file responsel txt compile with these options axW pch end of responsel fil response file response2 txt f compile with these options mp1 strict ansi f end of response2 fil Use response files to decrease the time spent entering command line options and to ensure consistency by automating command line entries Use individual response files to maintain options for specific projects to avoid editing the configuration file when changing projects Any number of options or file names can be placed on a line in the response file Several response files can be referenced in the same command line The syntax for using response files is
210. e default option e O3 enables O2 with more aggressive optimizations These options behave similarly on IA 32 and Itanium architectures with some specifics that are detailed in the sections that follow Setting Optimization Levels The following table details the effects of the O0 01 02 03 and fast options The table first describes the characteristics shared by both IA 32 and Itanium architectures and then explicitly describes the specifics if any of the On options behavior on each architecture Option Effect 00 Disables optimizations 01 Optimizes to favor code size and code locality Disables loop unrolling May improve performance for applications with very large code size any branches and execution time not dominated by code within loops In most cases O2 is recommended over O1 IA 32 systems Disables intrinsics inlining to reduce code size Itanium based systems Disables software pipelining and global code scheduling 02 O ON by default Optimizes for code speed This is the generally recommended optimization level Itanium based systems Enables software pipelining 03 Enables 02 optimizations and more aggressive optimizations such as loop and memory access transformations The O3 optimizations may slow down code in some cases compared to 02 optimizations Recommended for applications that have loops that heavily use floating point calculations and process large data sets IA 32 sys
211. e efficient GP relative addressing mode when accessing the symbol 66 gcc Compatibility C language object files created with the Intel C Compiler are binary compatible with the GNU gcc compiler and glibc the GNU C language library C language object files can be linked with either the Intel compiler or the gcc compiler However to correctly pass the Intel libraries to the linker use the Intel compiler See Linking and Default Libraries for more information GNU C includes several non standard features not found in ISO standard C Some of these extensions to the C language are supported in this version of the Intel C Compiler See http www gnu org for more information p gcc Language Extension Statements and Declarations in Expressions Labels as Values Nested Functions Constructing Function Calls Naming an Expression s Type typeof Generalized Lvalues Conditionals with Omitted Operands Complex Numbers Hex Floats mm Locally Declared Labels po Referring to a Type with c Double Word Integers m Intel Support Yes Yes Yes No No Yes Yes Yes GNU Description and Examples http gcc gnu org onlinedocs gec 3 2 1 gec Statement Exprs html Statement 20Exprs http gcc gnu org onlinedocs gec 3 2 gcc Local Labels html Local 20Labels http gcc gnu org onlinedocs gcc 3 2 gcc Labels as Values html Labels 20as 20Values http gcc gnu org onlinedocs gec
212. e gt val lt type gt sync_fetch_ and nand lt type gt ptr lt type gt val lt type gt sync fetch and or type ptr lt type gt val type sync fetch and sub type ptr type val type sync fetch and xor type ptr type val Atomic Op and fetch Operations type sync add and fetch type ptr type val type sync sub and fetch type ptr type val type sync or and fetch type ptr type val type sync and and fetch type ptr type val type sync nand and fetch type ptr type val type sync xor and fetch type ptr type val Atomic Compare and swap Operations type _ sync val compare and swap type ptr type old val type new val int sync bool compare and swap type ptr type old val type new val Atomic Synchronize Operation void sync synchronize void Atomic Lock test and set Operation type sync lock test and set type ptr lt type gt val Atomic Lock release Operation void sync lock release type ptr 306 Intel C Intrinsics Reference Miscellaneous Intrinsics void get return address unsigned int level This intrinsic yields the return address of the current function The level argument must be a constant value A value of 0 yields the return address of the current function Any other value yields a zero return address On Linux sy
213. e inserted in a library prompt xild lib cru user a a o b o See Creating a Multifile IPO Executable Using xild Analyzing the Effects of Multifile IPO The ipo c and ipo S options are useful for analyzing the effects of multifile IPO or when experimenting with multifile IPO between modules that do not make up a complete program Use the ipo c option to optimize across files and produce an object file This option performs optimizations as described for ipo but stops prior to the final link stage leaving an optimized object file The default name for this file is ipo out o Use the ipo S option to optimize across files and produce an assemblable file This option performs optimizations as described for ipo but stops prior to the final link stage leaving an optimized assemblable file The default name for this file is ipo out s See also Inline Expansion of Functions Inline Expansion of Functions Controlling Inline Expansion of User Functions The compiler enables you to control the amount of inline function expansion with the options shown in the following summary ip no inlining This option is only useful if ip is also specified In this case ip no inlining disables inlining that would result from the ip interprocedural optimizations but has no effect on other interprocedural optimizations ip no pinlining Disables partial inlining can be used if ip or ipo is also specified 97 In
214. e is used to represent the contents of an MMX register which is the register that is used by the MMX technology intrinsics The __m64 data type can hold eight 8 bit values four 16 bit values two 32 bit values or one 64 bit value __m128 Data Types The m128 data type is used to represent the contents of a Streaming SIMD Extension register used by the Streaming SIMD Extension intrinsics The __m128 data type can hold four 32 bit floating values The __m128d data type can hold two 64 bit floating point values The m1281i data type can hold sixteen 8 bit eight 16 bit four 32 bit or two 64 bit integer values The compiler aligns _m128 local and global data to 16 byte boundaries on the stack To align integer float or double arrays you can use the declspec statement Data Types Usage Guidelines Since these new data types are not basic ANSI C data types you must observe the following usage restrictions e Use new data types only on either side of an assignment as a return value or as a parameter You cannot use it with other arithmetic expressions etc e Use new data types as objects in aggregates such as unions to access the byte elements and structures e Use new data types only with the respective intrinsics described in this documentation The new data types are supported on both sides of an assignment statement as parameters to a function call and as a return value from a function call 202 Intel C Int
215. e order rO wO rl wl r7 w7 __m128i mm setr epi8 char b15 char b14 char b13 char b12 char b11 char b10 char b9 char b8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 Sets the 16 signed 8 bit integer values in reverse order rO bO rl bl r15 b15 __mi28i mm setzero si128 Sets the 128 bit value to zero r 0x0 Integer Store Operations for Streaming SIMD Extensions 2 The following store operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file void _mm_store_sil28 __m128i p __m128i b uses MOVDQA Stores 128 bit value Address p must be 16 byte aligned p Ze a void mm storeu sil128 m128i p m128i b uses MOVDQU Stores 128 bit value Address p need not be 16 byte aligned p a void _mm_maskmoveu_sil28 __m128i d m128i n char p uses MASKMOVDQU Conditionally store byte elements of d to address p The high bit of each byte in the selector n determines whether the corresponding byte in d will be stored Address p need not be 16 byte aligned if n0 7 ein do if n1 7 p 1 di if n15 7 1 p 15 dis void mm storel epi64 4m128i p _ m128i q uses MOVO Stores the lower 64 bits of the value pointed to by p p 63 0 a0 282 Intel C Intrinsics Reference New IA 32 Intrinsics The Intel C
216. e preemptable 65 Intel C Compiler for Linux Systems User s Guide Other Visibility related Command line Options fminshared The fminshared option specifies that the compilation unit will be part of a main program component and will not be linked as part of a shareable object Since symbols defined in the main program cannot be preempted this allows the compiler to treat symbols declared with default visibility as though they have protected visibility Oe fminshared implies fvisibility protected Also the compiler need not generate position independent code for the main program It can use absolute addressing which may reduce the size of the global offset table GOT and may reduce memory traffic fpic The fpic option specifies full symbol preemption Global symbol definitions as well as global symbol references get default 1 e preemptable visibility unless explicitly specified otherwise fno common Normally a C C file scope declaration with no initializer and without the extern or static keyword int X is represented as a common symbol Such a symbol is treated as an external reference except that If no other compilation unit has a global definition for the name the linker allocates memory for it The no common option causes the compiler to treat what otherwise would be common symbols as global definitions and to allocate memory for the symbol at compile time This may permit the compiler to use the mor
217. e to be put into the result word View of Original and Result Words with Shuffle Function Macro pns ere S Mae ea er m mm shuffle pd ml m MM SHUFFLE 1 0 GU ES EE Cacheability Support Operations for Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file void _mm_stream_pd double p __m128d a uses MOVNTPD Stores the data in a to the address p without polluting caches The address p must be 16 byte aligned If the cache line containing address p is already in the cache the cache will be updated p 0 a0 pli al void _mm_stream_sil28 __m128i p __m128i a Stores the data in a to the address p without polluting the caches If the cache line containing address p is already in the cache the cache will be updated Address p must be 16 byte aligned FE re void _mm_stream_si32 int p int a Stores the data in a to the address p without polluting the caches If the cache line containing address p is already in the cache the cache will be updated pr esa void mm clflush void const p Cache line containing p is flushed and invalidated from all caches in the coherency domain void mm lfence void Guarantees that every load instruction that precedes in program order the load fence instruction is globally visible before any load instruction which follows the fence in program order 275 Intel C Compiler for Linux Syste
218. e w composite Sets the lower DP FP value to w and sets the upper DP FP value to Zero rO w ri 0 0 __m128d mm setl_ pd double w composite Sets the 2 DP FP values to w r0 w ri w __mi28d mm set pd double w double x composite Sets the lower DP FP value to x and sets the upper DP FP value to W rO x rl w __mi28d mm setr pd double w double x composite Sets the lower DP FP value to w and sets the upper DP FP value to x rO w ri x __mi28d mm setzero pd void uses XORPD Sets the 2 DP FP values to zero rO 0 0 rl 0 0 __mi28d mm move Sdt m128d a m128d b uses MOVSD Sets the lower DP FP value to the lower DP FP value of b The upper DP FP value is passed through from a rO DU rl ze al Store Operations for Streaming SIMD Extensions 2 The following store operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file void mm store sd double dp __m128d a uses MOVSD Stores the lower DP FP value of a The address dp need not be 16 byte aligned dp a0 void _mm_storel_pd double dp __m128d a uses MOVAPD shuffling Stores the lower DP FP value of a twice The address dp must be 16 byte aligned dp 0 a0 dp 1 a0 261 Intel C Compiler for Linux Systems User s Guide void _mm_store_pd double d
219. ections successfully parallelized openmp_report2 same as openmp report plus diagnostics indicating MASTER constructs SINGLE constructs CRITICAL constructs ORDERED constructs ATOMIC directives etc are successfully handled The default is openmp report1 OpenMP Directives and Clauses OpenMP Directives parallel for sections single master barrier atomic flush Directive Name parallel for parallel sections critical lock Description Defines a parallel region Identifies an iterative work sharing construct that specifies a region in which the iterations of the associated loop should be executed in parallel Identifies a non iterative work sharing construct that specifies a set of constructs that are to be divided among threads in a team Identifies a construct that specifies that the associated structured block is executed by only one thread in the team A shortcut for a parallel region that contains a single for directive The parallel or for OpenMP directive must be immediately followed by a or statement If you place other statement or an OpenMP directive between the parallel or for directive and the for statement the Intel C Compiler issues a syntax error Provides a shortcut form for specifying a parallel region containing a single sections directive Identifies a construct that specifies a structured block that is executed by the master thread of the team
220. ectively utilize all of the processors all of the time The turnaround mode is designed to keep active all of the processors involved in the parallel computation in order to minimize the execution time of a single job In this mode the worker threads actively wait for more parallel work without yielding to other threads FP Note Avoid over allocating system resources This occurs if either too many threads have been specified or if too few processors are available at run time If system resources are over allocated this mode will cause poor performance The throughput mode should be used instead if this occurs Throughput In a multi user environment where the load on the parallel machine is not constant or where the job stream is not predictable it may be better to design and tune for throughput This minimizes the total time to run multiple jobs simultaneously In this mode the worker threads will yield to other threads while waiting for more parallel work The throughput mode is designed to make the program aware of its environment that is the system load and to adjust its resource usage to produce efficient execution in a dynamic environment Throughput mode is the default OpenMP Environment Variables This topic describes the OpenMP environment variables with the OMP_ prefix and Intel specific environment variables with the KMP_ prefix Standard Environment Variables Sets the runtime schedule type and chunk size S
221. ectorization would be illegal if k lt 0 Example of ivdep Directive pragma ivdep for i 0 i lt m i a li a itk c vector aligned Directive The vector aligned directive means the loop should be vectorized if it is legal to do so ignoring normal heuristic decisions about profitability When the aligned or unaligned qualifier is used the loop should be vectorized using aligned or unaligned operations Specify either aligned or unaligned but not both A Caution If you specify aligned as an argument you must be absolutely sure that the loop will be vectorizable using this instruction Otherwise the compiler will generate incorrect code The loop in the example below uses the aligned qualifier to request that the loop be vectorized with 157 Intel C Compiler for Linux Systems User s Guide aligned instructions as the arrays are declared in such a way that the compiler could not normally prove this would be safe to do so Example of vector aligned Directive void foo float a pragma vector al for i 0 i lt m i a i a il o The compiler includes several alignment strategies in case the alignment of data structures is not known at compile time A simple example is shown below but several other strategies are supported as well If in the loop shown below the alignment of a is unknown the compiler will generate a prelude loop that iterates until the array reference that
222. ed with the borrow bits of the subtraction m64 pmpy2l m64 a m i bi Two signed 16 bit data elements of a starting with the most significant data element are multiplied by the corresponding two signed 16 bit data elements of b and the two 32 bit results are returned as shown in Figure 9 Intel C Intrinsics Reference __m64 _m64_pmpy2r __m64 a __m64 b Two signed 16 bit data elements of a starting with the least significant data element are multiplied by the corresponding two signed 16 bit data elements of b and the two 32 bit results are returned as shown in Figure 10 Fig 10 m64 m64 pmpyshr2 m64 a __m64 b const int count The four signed 16 bit data elements of a are multiplied by the corresponding signed 16 bit data elements of b yielding four 32 bit products Each product is then shifted to the right count bits and the least significant 16 bits of each shifted product form 4 16 bit results which are returned as one 64 bit word m64 _m64 pmpyshr2u m64 a __m64 b const int count The four unsigned 16 bit data elements of a are multiplied by the corresponding unsigned 16 bit data elements of b yielding four 32 bit products Each product is then shifted to the right count bits and the least significant 16 bits of each shifted product form 4 16 bit results which are returned as one 64 bit word __m64 m64 pshladd2 m64 a const int count X m64 b a is shifted to the left by count bits and then is added t
223. edocs gec 3 2 gcc Zero Length html Zero 20Length http gcc gnu org onlinedocs gcc 3 2 gcc Variable Length html Variable 20Length http gcc gnu org onlinedocs gec 3 2 gcc Variadic Macros html Variadic 20Macros http gcc gnu org onlinedocs gec 3 2 gcc Escaped Newlines html Escaped 20Newlines http gcc gnu org onlinedocs gcc 3 2 gcc Multi line Strings html Multi line 20Strings http gcc gnu org onlinedocs gec 3 2 gcc Subscripting html Subscripting http gcc gnu org onlinedocs gec 3 2 gcc Pointer Arith html P ointer 20Arith http gcc gnu org onlinedocs gec 3 2 gcc Pointer Arith html P ointer 20Arith http gcc gnu org onlinedocs gec 3 2 gcc Initializers html Initializers http gcc gnu org onlinedocs gcc 3 2 gcc Compound Literals html Compound 20Literals http gcc gnu org onlinedocs gec 3 2 gcc Designated Inits html Designated 20Inits http gcc gnu org onlinedocs gec 3 2 gcc Cast to Union html Cast 20to 20Union http gcc gnu org onlinedocs gec 3 2 gcc Case Ranges html Case 20Ranges http gcc gnu org onlinedocs gcc 3 2 gcc Mixed Declarations html Mixed 20Declarations http gcc gnu org onlinedocs gcc 3 2 gcc Function Attributes html Function 20Attributes http gcc gnu org onlinedocs gec 3 2 gcc Attribute Syntax html Attribute 620Syntax gcc Compatibility gcc Language Extension Prototypes and Old Style Function Definitions C Style Comments Kach Dollar Signs in Iden
224. ent provides the minimal summary and max produces the full report The default is opt report levelmin e opt report routinefileroutine substring generates reports from all routines with names containing the subst ring as part of their name If not specified reports from all routines are generated By default the compiler generates reports for all routines Specifying Optimizations to Generate Reports The compiler can generate reports for an optimizer you specify in the phase argument of the opt report phasephase option The option can be used multiple times on the same command line to generate reports for multiple optimizers Currently the following optimizer reports are supported Optimizer Optimizer Full Name When one of the above logical names for optimizers is specified all reports from that optimizer are generated For example opt report phaseipo opt report phaseecg generates reports from the interprocedural optimizer and the code generator 159 Intel C Compiler for Linux Systems User s Guide Each of the optimizers can potentially have specific optimizations within them Each of these optimizations are prefixed with one of the optimizer logical names For example Optimizer_optimization Full Name ipo_inline Interprocedural Optimizer inline expansion of functions ipo_constant_propagation Interprocedural Optimizer constant propagation ipo_function_reorder Interprocedural Optimizer function reorder
225. epi32 int i3 int i2 int il int i0 Sets the 4 signed 32 bit integer values rO i0 rl il Tasse 42 r3 i3 m128i mm set epil 6 short w7 short w6 short w5 short w4 short w3 short w2 short wl short w0 Sets the 8 signed 16 bit integer values rO wO rl wl r7 w7 __m128i _mm_set_epi8 char b15 char b14 char b13 char b12 char b11 char b10 char b9 char b8 char b7 char b6 char b5 char b4 char b3 char b2 char bl char b0 Sets the 16 signed 8 bit integer values rO bO ri bl ri5 b15 __m128i _mm_setl_epi64 __m64 q Sets the 2 64 bit integer values to q r0 q rl q m128i mm setl epi32 int i Sets the 4 signed 32 bit integer values to i ro i rl i r2 i r3 i m128i mm setl epil6 short wi Sets the 8 signed 16 bit integer values to w rO w ri w r7 w __m128i _mm_set1_epi8 char b Sets the 16 signed 8 bit integer values to b rO b rl b r15 b 281 Intel C Compiler for Linux Systems User s Guide __m128i _mm_setr_epi64 __m64 q0 __m64 ql Sets the 2 64 bit integer values in reverse order ro q0 rl ql __m128i mm setr epi32 int i0 int il int i2 int i3 Sets the 4 signed 32 bit integer values in reverse order rO iO rl il r2 i2 r3 i3 m128i mm setr epil6 short w0 short wl short w2 short w3 short w4 short w5 short w6 short w7 Sets the 8 signed 16 bit integer values in revers
226. er the inner construct is started the threads from the outer construct can easily migrate to the inner construct to help finish the request Since the workqueuing model is designed to preserve sequential semantics synchronization is inherent in the semantics of the taskq block There is an implicit team barrier at the completion of the taskq block for the threads that encountered the taskq construct to ensure that all of the tasks specified inside of the taskq block have finished execution This taskq barrier enforces the sequential semantics of the original program Just like the OpenMP worksharing constructs it is assumed you are responsible for ensuring that either no dependences exist or that dependencies are appropriately synchronized between the task blocks or between code in a task block and code in the taskq block outside of the task blocks The syntax semantics and allowed clauses are designed to resemble OpenMP worksharing constructs Most of the clauses allowed on OpenMP worksharing constructs have a reasonable meaning when applied to the workqueuing pragmas taskq Construct pragma intel omp taskq clause clause structured block where clause can be any of the following private variable list e firstprivate variable list e lastprivate variable list e reduction operator variable list e ordered e nowait private The private clause creates a private default constructed version for each object in v
227. ered r0 a0 unord b0 Oxffffffff 0x0 r1 al r2 a2 r3 a3 m128 mm cmpunord ps m128 a m128 b int int int int int int int int int 230 Compare for unordered rO a0 unord b0 Oxffffffff 0x0 rl al unord b1 Oxffffffff OXO r2 a2 unord b2 Oxffffffff 0x0 r3 a3 unord b3 Oxffffffff 0x0 mm comieq ss m128 a m128 b Compares the lower SP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned r ze a0 b0 Ox1 0x0 mm comilt ss m128 a m128 bi Compares the lower SP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt b0 Ox1 0x0 mm comile ss m128 a __m128 bi Compares the lower SP FP value of a and b for a less than or equal to b If a is less than or equal to b is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 _mm_comigt_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than b If a is greater than b are equal 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 _mm_comige_ss __m128 a __m128 b Compares the lower SP FP value of a and b for a greater than or equal to b Ifa is greater than or equal to b is returned Otherwise 0 is returned r re a0 gt b0 0x1 0x0 mm comineq ss m128 a m128 b Compares the lower SP FP value of a and
228. erequal x y int isinf x int isless x y int islessequal x y int islessgreater x y int isnan x int isnormal x int isunordered x y int signbit x See also Miscellaneous Functions 198 Intel C Intrinsics Reference Introduction The Intel Pentium 4 processor and other Intel processors have instructions to enable development of optimized multimedia applications The instructions are implemented through extensions to previously implemented instructions This technology uses the single instruction multiple data SIMD technique By processing data elements in parallel applications with media rich bit streams are able to significantly improve performance using SIMD instructions The Intel Itanium processor also supports these instructions The most direct way to use these instructions is to inline the assembly language instructions into your source code However this can be time consuming and tedious and assembly language inline programming is not supported on all compilers Instead Intel provides easy implementation through the use of API extension sets referred to as intrinsics Intrinsics are special coding extensions that allow using the syntax of C function calls and C variables instead of hardware registers Using these intrinsics frees programmers from having to program in assembly language and manage registers In addition the compiler optimizes the instruction scheduling so th
229. eriod errno ERANGE for underflow and overflow conditions Calling interface double annuity double x double y long double annuity double x double y float annuityf float x double y Description The compound function computes the compound interest factor 1 x where x is a rate and y is a period errno ERANGE for underflow and overflow conditions Calling interface double compound double x double y long double compound double x double y float compoundf float x double y Description The erf function returns the error function value Calling interface double erf double x long double erfl long double x float erff float x Intel Math Library ERFC GAMMA GAMMA_R JO Ji JN Description The erfc function returns the complementary error function value errno ERANGE for underflow conditions Calling interface double erfc double x long double erfcl long double x float erfcf float x Description The gamma function returns the value of the logarithm of the absolute value of gamma errno ERANGE for overflow conditions Calling interface double gamma double x float gammaf float x Description The gamma_r function returns the value of the logarithm of the absolute value of gamma The sign of the gamma function is returned in the integer signgam Calling interface double gamma_r double x int signgam float gammaf r float x int signgam
230. erl long double x long double y float remainderf float x float y REMQUO Description The remquo function returns the value of x REM y In the object pointed to by quo the function stores a value whose sign is the sign of x y and whose magnitude is congruent modulo 2 to the magnitude of the integral quotient of x y where n is an implementation defined integer greater than or equal to 3 Calling interface double remquo double x double y int quo long double remquol long double x long double y int D quo float remquof float x float y int quo 189 Intel C Compiler for Linux Systems User s Guide Miscellaneous Functions The Intel Math library supports the following miscellaneous functions COPYSIGN Description The copysign function returns the value with the magnitude of x and the sign of y Calling interface double copysign double x double y long double copysignl long double x long double y float copysignf float x float y FABS Description The fabs function returns the absolute value of x Calling interface double fabs double x long double fabsl long double x float fabsf float x FDIM Description The dim function returns the positive difference value x y for x gt y or 0 for x y errno ERANGE for values too large Calling interface double fdim double x double y long double fdiml long double x long double y float fdimf floa
231. ern of separated work iteration and work creation and are naturally parallelized with the workqueuing model Some common cases are e while loops e Cr iterators recursive functions while Loops If the computation in each iteration of a while loop is independent the entire loop becomes the environment for the taskq pragma and the statements in the body of the while loop become the units of work to be specified with the task pragma The conditional in the while loop and any modifications to the control variables are placed outside of the task blocks and executed sequentially to enforce the data dependencies on the control variables C Iterators C Standard Template Library STL iterators are very much like the while loops just described whereby the operations on the data stored in the STL are very distinct from the act of iterating over all the data If the operations are data independent they can be done in parallel as long as the iteration over the work is sequential This type of while loop parallelism is a generalization of the standard OpenMP worksharing for loops In the worksharing for loops the loop increment operation is the iterator and the body of the loop is the unit of work However because the or loop iteration variable frequently has a closed form solution it can be computed in parallel and the sequential step avoided Recursive Functions Recursive functions also can be used to specify parallel iteration spaces
232. es symmetric multiprocessing SMP with the following major features e Relieves the user from having to deal with the low level details of iteration space partitioning data sharing and thread scheduling and synchronization e Provides the benefit of the performance available from shared memory multiprocessor systems The Intel C Compiler performs transformations to generate multithreaded code based on the user s placement of OpenMP directives in the source program making it easy to add threading to existing software The Intel compiler supports all of the current industry standard OpenMP directives except WORKSHARE and compiles parallel programs annotated with OpenMP directives In addition the Intel C Compiler provides Intel specific extensions to the OpenMP C version 2 0 specification including run time library routines and environment variables 136 Parallel Programming P Note As with many advanced features of compilers you must properly understand the functionality of the OpenMP directives in order to use them effectively and avoid unwanted program behavior See parallelization options summary for all of the options of the OpenMP feature in the Intel C Compiler For complete information on the OpenMP standard visit the OpenMP Web site at http www openmp org For OpenMP C version 2 0 API specifications see http www openmp org specs Parallel Processing with OpenMP To compile with OpenMP you need
233. essel function of the second kind of x with order 0 errno EDOM for x lt 0 Calling interface double yO double x float yOf float x Description Computes the Bessel function of the second kind of x with order 1 errno EDOM for x lt 0 Calling interface double yl double x float ylf float x Intel Math Library YN Description Computes the Bessel function of the second kind of x with order n errno EDOM for x lt 0 Calling interface double yn int n double x float ynf int n float x Nearest Integer Functions The Intel Math library supports the following nearest integer functions CEIL FLOOR LLRINT LLROUND Description The ceil function returns the smallest integral value not less than x as a floating point number This function may be inlined with the Itanium compiler Calling interface double ceil double x long double ceill long double x float ceilf float x Description The floor function returns the largest integral value not greater than x as a floating point value This function may be inlined with the Itanium compiler Calling interface double floor double x long double floorl long double x float floorf float x Description The 11rint function returns the rounded integer value according to the current rounding direction asa long long int errno ERANGE for values too large Calling interface long lon
234. essesresse 11 shared libcxa option 11 sin library function 2 0 ee eeeeeeeeeeteeeteeeeeeees 176 sincos library function 176 sincosd library function 176 sind library function ssssssssss 176 sinh library function ssssssssss 179 sinhcosh library function usse 179 software pipelining sssssseese 154 sox option 11 sqrt library function 180 static OPTION ien de weh 11 Static libcxa option 11 std c99 option 11 strict ansi OptiON 11 76 Strip MINING coc ore tede 124 structure tag alignments 54 SUDDOIEL 25 terris reete etes 2 symbol preemption sssssesssesessesersesseessssrseesss 64 Syntax Option 00 ceeceeecesecseecseeceeeseeeeeeeeeeerees 11 lr e aeee i RAG eR 11 tan library function sss 176 tand library function sssssssss 176 tanh library function essssssss 179 test prioritization oo 109 tgamma library function 184 threshold control 135 timing application 160 TMP enviroment variable 48 Appl optione ed eco eod Res 11 APPZ Option zc Anette Se 11 pp Option 4 eee erts 11 tpp6 OPTION ee 11 tpp7 option ee 11 trunc library function 187 HEET 11 U option iie ee edes 11
235. even if no optimization is specified 75 Language Conformance Conformance Options Option ansi strict ansi ansi_alias Description Equivalent to GNU ANSI Strict ANSI conformance dialect ansi alias directs the compiler to assume the following e arrays are not accessed out of bounds e pointers are not cast to non pointer types and vice versa e references to objects of two different scalar types cannot alias For example an object of type int cannot alias with an object of type float or an object of type float cannot alias with an object of type double If your program satisfies the above conditions setting the ansi alias flag will help the compiler better optimize the program However 1f your program does not satisfy one of the above conditions the ansi alias flag may lead the compiler to generate incorrect code Conformance to the C Standard You can set the Intel C Compiler to accept either e ANSI conformance equivalent to GNU ANSI with the ansi option or e Strict ANSI conformance dialect with the strict ansi option The compiler is set by default to accept extensions and not be limited to the ANSI ISO standard Understanding the ANSI ISO Standard C Dialect The Intel C Compiler provides conformance to the ANSI ISO standard for C language compilation ISO IEC 9899 1990 This standard requires that conforming C compilers accept minimum translation limits This compiler exceeds all of
236. ew seconds run several timings to ensure that the results are not misleading Certain overhead functions like loading external programs might influence short timings considerably e If your program displays a lot of text consider redirecting the output from the program Redirecting output from the program will change the times reported because of reduced screen I O 160 Optimization Support Features The following program illustrates a model for program timing Sample Timing include lt stdio h gt include lt stdlib h gt include lt time h gt int main void clock_t start finish long loop double duration loop_calc start clock for loop 0 loop lt 2000 loop loop_calc 123 456 789 printf inculded to facilitate example printf nThe value of loop is d loop finish clock duration double finish start CLOCKS_PER_ printf n 2 3f seconds n duration 161 Compiler Limits The table below shows the size or number of each item that the compiler can process All capacities shown in the table are tested values the actual number can be greater than the number shown Item Control structure nesting block nesting Conditional compilation nesting Declarator modifiers Parenthesis nesting levels Significant characters internal identifier External identifier name length Number of external identifiers file Number of iden
237. f gen x with the x qualifier extra source position is collected which enables code coverage tools such as the Intel C Compiler Code coverage Tool Without such tools prof genx does not provide better optimization and may slow parallel compile times 99 Intel C Compiler for Linux Systems User s Guide Basic PGO Options Description prof gen x Instructs the compiler to produce instrumented code in your object files in preparation for instrumented execution prof use Instructs the compiler to produce a profile optimized executable and merges available dynamic information dyn files into a pgopti dpi file In cases where your code behavior differs greatly between executions you have to ensure that the benefit of the profile information is worth the effort required to maintain up to date profiles In the basic profile guided optimization the following options are used in the phases of the PGO Generating Instrumented Code The prof gen x option instruments the program for profiling to get the execution count of each basic block It is used in Phase 1 of the PGO to instruct the compiler to produce instrumented code in your object files in preparation for instrumented execution Parallel make is automatically supported for prof genx compilations Generating a Profile optimized Executable The prof use option is used in Phase 3 of the PGO to instruct the compiler to produce a profile optimized execu
238. f the tool usage can be summarized as follows e Minimizing the number of tests that are required to achieve a given overall coverage for any subset of the application the tool defines the smallest subset of the application tests that achieve exactly the same code coverage as the entire set of tests e Reducing the turn around time of testing instead of spending a long time on finding a possibly large number of failures the tool enables the users to quickly find a small number of tests that expose the defects associated with regressions caused by a change set e Selecting and prioritizing the tests to achieve certain level of code coverage in a minimal time based on the data of the tests execution time Command line Syntax The syntax for this tool is as follows tselect dpi list file where dpi list isa required tool option that sets the path to the DPI list file that contains the list ofthe dpi files of the tests you need to prioritize Tool Options The tool uses options that are listed in the table that follows Option Description help Prints all the options of the test prioritization tool spi file Sets the path name of the static profile information file spi Default is pgopti spi dpi list file Sets the path name of the file that contains the name of the dynamic profile information dpi files Each line of the file should contain one dpi name optionally followed by its execution time The name must uniquely ide
239. ficient manner possible in the context used This intrinsic does not map to any specific SSE2 instruction Streaming SIMD Extensions 2 Floating point Memory and Initialization Operations This section describes the load set and store operations which let you load and store data into memory The load and set operations are similar in that both initialize __m128d data However the set operations take a double argument and are intended for initialization with constants while the 1oad operations take a double pointer argument and are intended to mimic the instructions for loading data from memory The store operation assigns the initialized data to the address FP Note There is no intrinsic for move operations To move data from one register to another a simple assignment A B suffices where A and B are the source and target registers for the move operation The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file 259 Intel C Compiler for Linux Systems User s Guide Load Operations for Streaming SIMD Extensions 2 The following load operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file __m128d mm load pd double const dp uses MOVAPD Loads two DP FP values The address p must be 16 byte aligned rO p 0 rl p 1 m128d mm load1
240. float acosdf float x Description The asin function returns the principal value of the inverse sine of x in the range pi 2 pi 2 radians for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double asin double x long double asinl long double x float asinf float x Description The asind function returns the principal value of the inverse sine of x in the range 90 90 degrees for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double asind double x long double asindl long double x float asindf float x Description The at an function returns the principal value of the inverse tangent of x in the range pi 2 pi 2 radians Calling interface double atan double x long double atanl long double x float atanf float x Intel Math Library ATAN2 ATAND ATAN2D COS COSD COT Description The at an2 function returns the principal value of the inverse tangent of y x in the range pi pi radians errno EDOM for x 0 and y 0 Calling interface double atan2 double x double y long double atan21l1 long double x long double y float atan2f float x float y Description The atand function returns the principal value of the inverse tangent of x in the range 90 90 degrees Calling interface double atand double x long double atandl long double x float atandf float x Description The at an2d func
241. float sinf float x Description The sincos function returns both the sine and cosine of x measured in radians This function may be inlined with the Itantum compiler Calling interface void sincos double x double sinval double cosval void sincosl long double x long double sinval long double cosval void sincosf float x float sinval float cosval Description The sincosd function returns both the sine and cosine of x measured in degrees Calling interface void sincosd double x double sinval double cosval void sincosdl long double x long double sinval long double cosval void sincosdf float x float sinval float cosval Description The sind function computes the sine of x measured in degrees Calling interface double sind double x long double sindl long double x float sindf float x Description The tan function returns the tangent of x measured in radians Calling interface double tan double x long double tanl long double x float tanf float x Intel Math Library TAND Description The t and function returns the tangent of x measured in degrees errno ERANGE for overflow conditions Calling interface double tand double x long double tandl long double x float tandf float x Hyperbolic Functions The Intel Math library supports the following hyperbolic functions ACOSH ASINH ATANH COSH Description T
242. for a greater than b If a is greater than b are equal 1 is returned Otherwise 0 is returned r a0 gt b0 Oxl 0x0 int _mm_ucomige_sd __m128d a __m128d bi Compares the lower DP FP value of a and b for a greater than or equal to b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 int _mm_ucomineq_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned r e a0 b0 0x1 0x0 Conversion Operations for Streaming SIMD Extensions 2 Each conversion intrinsic takes one data type and performs a conversion to a different type Some conversions such as mm cvtpd ps result in a loss of precision The rounding mode used in such cases is determined by the value in the MXCSR register The default rounding mode is round to nearest Note that the rounding mode used by the C and C languages when performing a type conversion is to truncate The mm cvttpd epi32 and_mm_cvttsd_si32 intrinsics use the truncate rounding mode regardless of the mode specified by the MXCSR register The conversion operation intrinsics for Streaming SIMD Extensions 2 are listed in the following table followed by detailed descriptions The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file Intrinsic Corresponding Return Parameters Name Instruc
243. for library a files If you want the linker to search additional libraries you can add their names to LD_LIBRARY_PATH to the command line or to a response file see Note below In each case the names of these libraries are passed to the linker before the names of the Intel libraries that the driver always specifies Elte Response files are processed at the location they appear on the command line If libraries are specified in the response file references from object files seen after the response file will not be resolved in those libraries Modifying LD_LIBRARY_PATH If you want to add a directory 1ibs for example to the LD_LIBRARY_PATH you can do either of the following e prompt export LD LIBRARY PATH libs LD LIBRARY PATH e startup file export LD LIBRARY PATH libs LD LIBRARY PATH To compile ile cpp and link it with the library mylib a enter the following command prompt gt icpe file cpp mylib a The compiler passes file names to the linker in the following order 1 the object file 2 any objects or libraries specified on the command line in a response file or in a configuration file 3 the Intel Math Library 1ibimf a 62 Using Libraries Compiling for Non shared Libraries This section includes information on e Global Symbols and Visibility Attributes e Symbol Preemption e Specifying Symbol Visibility Explicitly e Other Visibility related Command line Options Global Symbols and Visibility Attribu
244. ft long integer pt unsigned long _lrotr unsigned long Rotates bits right for an unsigned value int shift long integer unsigned int rotl unsigned int Rotates bits left for an unsigned value int shift integer eS unsigned int rotr unsigned int Rotates bits right for an unsigned value int shift integer 204 Intel C Intrinsics Reference 3 Note Passing a constant shift value in the rotate intrinsics results in higher performance Floating point Related Intrinsic double fabs double double log double float logf float double log10 double float log10f float double exp double float expf float double pow double double float powf float float double sin double float sinf float double cos double float cosf float double tan double float tanf float double acos double float acosf float Description Returns the absolute value of a floating point value Returns the natural logarithm In x x gt 0 with double precision Returns the natural logarithm In x x gt 0 with single precision Returns the base 10 logarithm log10 x x gt 0 with double precision Returns the base 10 logarithm log10 x x gt 0 with single precision Returns the exponential function with double precision Returns the exponential function with single precision Returns the value of x to the power y with double precision Returns the val
245. g int llrint double x long long int llrintl long double x long long int llrintf float x Description The 11round function returns the rounded integer value as a long long int errno ERANGE for values too large Calling interface long long int llround double x long long int llroundl long double x long long int llroundf float x 187 Intel C Compiler for Linux Systems User s Guide LRINT Description The 1rint function returns the rounded integer value according to the current rounding direction asa long int Calling interface long int lrint double x long int lrintl long double x long int lrintf float x LROUND Description The 1round function returns the rounded integer value as a long int Halfway cases are rounded away from zero errno ERANGE for values too large Calling interface long int lround double x long int lroundl long double x long int lroundf float x MODF Description The modf function returns the value of the signed fractional part of x and stores the integral part in floating point format in iptr Calling interface double modf double x double iptr long double modfl long double x long double iptr float modff float x float iptr NEARBYINT Description The nearbyint function returns the rounded integral value as a floating point number using the current rounding direction Calling interface double nearby
246. g point arrays are not automatically aligned To get 16 byte alignment you can use the alignment __declspec __declspec align 16 float A 4 Conversions All Fvec object variables can be implicitly converted to _m128 data types For example the results of computations performed on F32vec4 or F32vec1 object variables can be assigned to ___m128 data types mi28d mm A amp B where A B are F64vec2 object variables m128 mm A amp B where A B are F32vec4 object variables m128 mm A amp B where A B are F32vecl object variables 364 Intel C Intrinsics Reference Constructors and Initialization The following table shows how to create and initialize F32vec objects with the Fvec classes Constructors and Initialization for Fvec Classes Example Intrinsic Returns Constructor Declaration F64vec2 A N A N A F32vec4 B F32vecl C m128 Object Initialization F64vec2 A m128d mm N A N A F32vec4 B m128 mm F32vecl C m128 mm Double Initialization Initializes two doubles F64vec2 A double d0 double d1 F64vec2 A F64vec2 double d0 double dl _mm_set_pd AO d0 Al dl F64vec2 A double d0 mm setl pd AO d0 Initializes both return values Al d0 with the same double precision value Float Initialization F32vec4 A float 3 float f2 _mm_set_ps AO f0 float fl float f0 Al fl F32vec4 A F32vec4 fl
247. g point argument and are intended to mimic the instructions for loading data from memory The store operation assigns the initialized data to the address The intrinsics are listed in the following table Syntax and a brief description are contained the following topics The prototypes for Streaming SIMD Extensions intrinsics are in the xmmint rin h header file Name Intrinsic Alternate Operation Corresponding Name Instruction Load the low value and OVSS clear the three high values mm loadl ps Load one value into all four MOVSS words Shuffling Load four values address OVAPS aligned Load four values address MOVUPS unaligned Load four values in MOVAPS reverse order Shuffling Set the low value and clear Composite the three high values mm setl ps Set all four words with the Composite same value Set four values address Composite aligned Set four values in reverse Composite order mm setzero ps Clear all four values Composite mm store ss Store the low value MOVSS 239 Intel C Compiler for Linux Systems User s Guide Intrinsic Name Alternate Operation Corresponding Name Instruction mm_store_psl mm storel ps Store the low value across Shuffling all four words The address MOVSS must be 16 byte aligned mm_store_ps Store four values address MOVAPS aligned mm_storeu_ps Store four values add
248. gcc documentation at http www gnu org manual gas html_chapter as_16 html gcc Interoperability C compilers are interoperable if object files and libraries generated by one compiler can be linked with object files and libraries generated by the second compiler and the resulting executable runs successfully The Intel C Compiler 8 0 has made significant improvements towards interoperability and compatibility with the GNU gcc compiler This section describes new interoperability options See gcc Compatibility for a detailed list of compatibility features Interoperability Compiler Options The Intel C Compiler options that affect gcc interoperability include cxxlib gcc gcc name gcc version cxxlib gcc option The cxxlib gcc option lets you to build your applications using the C libraries and header files included with the gcc compiler They include e libstdc standard C header files e libstdc standard C library e libgcec C language support When you compile and link your application using the cxx1ib gcc option the resulting C object files libraries and executables can interoperate with C object files libraries and executables generated by gcc 3 2 This means that third party C libraries built with gcc 3 2 will work with C code generated by the Intel Compiler 8 0 The cxxlib gcc option can only be used on Linux distributions that include gcc 3 2 Thi
249. generally 16 byte aligned e Some intrinsics require that their argument be immediates that is constant integers literals due to the nature of the instruction e The result of arithmetic operations acting on two NaN Not a Number arguments is undefined Therefore FP operations using NaN arguments will not match the expected behavior of the corresponding assembly instructions 221 Intel C Compiler for Linux Systems User s Guide Arithmetic Operations for Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file E e NN _mm_ dd SS ADD p S NE SUBSS Subtraction b 8 D F EM SORTSS Squared Root rsqrt ss RSORTSS Reciprocal Square Root RSORTPS Reciprocal Squared Root INSS Computes Minimum 222 Intel C Intrinsics Reference m Se FP Computes op Minimum a0 56 eu D b2 b1 _mm_max_ss MAXSS Computes op a2 Maximum a0 b 0 _mm_max_ps MAXPS Computes op op Maximum a0 b i al b1 m128 _mm_add_ss __m128 a m128 b Adds the lower SP FP single precision floating point values of a and b the upper 3 SP FP values are passed through from a r0 a0 b rl al r2 a2 r3 a3 m128 mm add ps m128 a m128 b Adds the four SP FP values of a and b rO a0 bO rl al bl r2 a2 b2 r3 a3 b3 m128 mm sub ss m128 a m128
250. gy and Streaming SIMD Extensions provide SIMD instructions for most arithmetic and logical operators on 32 bit 16 bit and 8 bit integer data types Vectorization may proceed if the final precision of integer wrap around arithmetic will be preserved A 32 bit shift right operator for instance is not vectorized if the final stored value is a 16 bit integer Also note that because the MMX instructions and Streaming SIMD Extensions instruction sets are not fully orthogonal byte shifts for instance are not supported not all integer operations can actually be vectorized 123 Intel C Compiler for Linux Systems User s Guide For loops that operate on 32 bit single precision and 64 bit double precision floating point numbers the Streaming SIMD Extensions provide SIMD instructions for the arithmetic operators and Also the Streaming SIMD Extensions provide SIMD instructions for the binary MIN MAX and unary SORT operators SIMD versions of several other mathematical operators like the trigonometric functions SIN COS TAN are supported in software in a vector mathematical run time library that is provided with the Intel C Compiler Strip Mining and Cleanup Strip mining also known as loop sectioning is a loop transformation technique for enabling SIMD encodings of loops as well as providing a means of improving memory performance By fragmenting a large loop into smaller segments or strips this technique transform
251. h the processor type specified by the x option Option Optimizes Your Code for L axK Intel Pentium III and compatible Intel processors 7 axW Intel Pentium A and compatible Intel processors axN Intel Pentium A and compatible Intel processors This option also enables new optimizations in addition to Intel processor specific optimizations axB Intel Pentium M and compatible Intel processors This option also enables new optimizations in addition to Intel processor specific optimizations axP Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 This option also enables new optimizations in addition to Intel processor specific optimizations Example The compilation below will generate a single executable that includes e a generic version for use on any IA 32 processor e aversion optimized for Intel Pentium III processors as long as there is a likely performance benefit e aversion optimized for Intel Pentium 4 processors as long as there is a likely performance benefit prompt gt icpe axKW prog cpp 86 Compiler Optimizations Manual CPU Dispatch IA 32 only Use declspec cpu specific and declspec cpu dispatch in your code to generate instructions specific to the Intel processor on which the application is running and also to execute correctly on other IA 32 processors 5 Note Manual CPU dispatch cannot be used to recognize Intel Itanium processors The s
252. hanges then the tool changes the color corresponding to the coverage condition of that portion of the code and the coverage tool inserts the appropriate color change in the HTML files FP Note You need to interpret the colors in the context of the code For instance comment lines that follow a basic block that was never executed would be colored in the same color as the uncovered blocks Another example is the closing brackets in C C applications Coverage Analysis of a Modules Subset One of the capabilities of the Intel compiler Code coverage Tool is efficient coverage analysis of an application s subset of modules This analysis is accomplished based on the selected option comp of the tool s execution You can generate the profile information for the whole application or a subset of it and then divide the covered modules into different components and use the coverage tool to obtain the coverage information of each individual component If only a subset of the application modules is compiled with the prof gens option then the coverage information is generated only for those modules that are involved with this compiler option thus avoiding the overhead incurred for profile generation of other modules To specify the modules of interest use the tool s comp option This option takes the name of a file as its argument That file must be a text file that includes the name of modules or directories you would like to analyze codec
253. he default optimizations O2 for phase 1 and specify more advanced optimizations with ipo for phase 3 This example used O2 in phase 1 and 02 ipo in phase 3 3 Note The compiler ignores the ipo options with prof gen x With the x qualifier extra information is gathered PGO Environment Variables The table below describes environment values to determine the directory to store dynamic information files or whether to overwrite pgopti dpi Refer to your operating system documentation for instructions on how to specify environment values 101 Intel C Compiler for Linux Systems User s Guide Profile guided Optimization Environment Variables Variable Description PROF_DIR Specifies the directory in which dynamic information files are created This variable applies to all three phases of the profiling process PROF NO CLOBBER Alters the feedback compilation phase slightly By default during the feedback compilation phase the compiler merges the data from all dynamic information files and creates a new pgopti dpi file if dyn files are newer than an existing pgopt i dpi file When this variable is set the compiler does not overwrite the existing pgopti dpi file Instead the compiler issues a warning and you must remove the pgopt i dpi file if you want to use additional dynamic information files Using profmerge to Relocate the Source Files The compiler uses the full path to the source file to look up pr
254. he acosh function returns the inverse hyperbolic cosine of x errno EDOM for x lt 1 Calling interface double acosh double x long double acoshl long double x float acoshf float x Description The asinh function returns the inverse hyperbolic sine of x Calling interface double asinh double x long double asinhl long double x float asinhf float x Description The at anh function returns the inverse hyperbolic tangent of x errno EDOM for x lt 1 errno ERANGE for x 1 Calling interface double atanh double x long double atanhl long double x float atanhf float x Description The cosh function returns the hyperbolic cosine of x e e 2 errno ERANGE for overflow conditions Calling interface double cosh double x long double coshl long double x float coshf float x 179 Intel C Compiler for Linux Systems User s Guide SINH Description The sinh function returns the hyperbolic sine of x e e 2 errno ERANGE for overflow conditions Calling interface double sinh double x long double sinhl long double x float sinhf float x SINHCOSH Description The sinhcosh function returns both the hyperbolic sine and hyperbolic cosine of x errno ERANGE for overflow conditions Calling interface void sinhcosh double x float sinval float cosval void sinhcoshl long double x long double sinval long double cosval vo
255. he if scalar_logical_expression clause is present the enclosed code block is executed in parallel only if the scalar_logical_expression evaluates to TRUE Otherwise the code block is serialized Specifies how iterations of the for loop are divided among the threads of the team Provides a mechanism to assign the same name to threadprivate variables for each thread in the team executing the parallel region OpenMP Support Libraries The Intel C Compiler with OpenMP support provides a production support library libguide a This library enables you to run an application under different execution modes It is used for normal or performance critical runs on applications that have already been tuned Elte The Libguide 1ib library is linked dynamically regardless of command line options to avoid performance issues that are hard to debug 142 Parallel Programming Execution Modes The Intel compiler with OpenMP enables you to run an application under different execution modes that can be specified at run time The libraries support the serial turnaround and throughput modes These modes are selected by using the KMP_LIBRARY environment variable at run time Serial The serial mode forces parallel applications to run on a single processor Turnaround In a dedicated batch or single user parallel environment where all processors are exclusively allocated to the program for its entire run it is most important to eff
256. he number of threads used to execute a parallel region If dynamic threadsis TRUE dynamic threads are enabled If dynamic threadsis FALSE dynamic threads are disabled Dynamics threads are disabled by default Returns TRUE if dynamic thread adjustment is enabled otherwise returns FALSE Enables or disables nested parallelism If nestedis TRUE nested parallelism is enabled If nested is FALSE nested parallelism is disabled Nested parallelism is disabled by default Returns TRUE if nested parallelism is enabled otherwise returns FALSE Description Initializes the lock associated with lock for use in subsequent calls Causes the lock associated with lock to become undefined Forces the executing thread to wait until the lock associated with lock is available The thread is granted ownership of the lock when it becomes available Releases the executing thread from ownership of the lock associated with lock The behavior is undefined if the executing thread does not own the lock associated with lock Attempts to set the lock associated with lock If successful returns TRUE otherwise returns FALSI GI 145 Intel C Compiler for Linux Systems User s Guide Function Description omp init nest lock lock Initializes the nested lock associated with lock for use in the subsequent calls omp destroy nest lock lock Causes the nested lock associated with lock to become undefined
257. hifts the 8 signed or unsigned 16 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al lt lt count r7 a7 lt lt count __m128i mm sll epil6 m128i a m128i count Shifts the 8 signed or unsigned 16 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al lt lt count r7 a7 lt lt count m128i mm slli epi32 1m128i a int count Shifts the 4 signed or unsigned 32 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al lt lt count r2 a2 lt lt count r3 a3 lt lt count __m128i mm sll epi32 m128i a m128i count Shifts the 4 signed or unsigned 32 bit integers in a left by count bits while shifting in zeros rO a0 lt lt count rl al lt lt count 12 a2 count r3 a3 count __m128i mm slli epi64 1m128i a int count Shifts the 2 signed or unsigned 64 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al ss count m128i mm sll epi64 m128i a m128i count Shifts the 2 signed or unsigned 64 bit integers in a left by count bits while shifting in zeros r0 a0 lt lt count rl al lt lt count 270 Intel C Intrinsics Reference __m128i mm srai epil6 m128i a int count Shifts the 8 signed 16 bit integers in a right by count bits while shifting in the sign bit r0 a
258. hitecture Extenions Extensions mm cmpeq epil6 N A N A N A N A mm cmpeq epi32 N A N A N A N A mm cmpgt epi8 N A N A N A N A mm cmpgt epil6 N A N A N A N A mm cmpgt epi32 N A N A N A N A mm cmplt epi8 N A N A N A N A mm cmplt epil6 N A N A N A N A mm cmplt epi32 N A N A N A N A mm cvtsi32 sil128 N A N A N A N A mm cvtsil128 si32 N A N A N A N A mm packs epil6 N A N A N A N A mm packs epi32 N A N A N A N A mm packus epil6 N A N A N A N A mm extract epil N A N A N A N A mm insert epil6 N A N A N A N A mm movemask epi8 N A N A N A N A mm shuffle epi32 N A N A N A N A mm shufflehi epil6 N A N A N A N A mm shufflelo epil6 N A N A N A N A mm unpackhi epi8 N A N A N A N A mm unpackhi epil6 N A N A N A N A mm unpackhi epi32 N A N A N A N A Intel C Intrinsics Reference Intrinsic mm unpack l mm unpack l mm unpack l mm unpackhi epi64 lo epi8 lo epil6 lo epi32 mm unpack l lo epi64 mm mm mm mm move epi64 movpi64 epi64 movepi64 pi64 load si128 loadu si128 loadl_epi64 Set epi64 Set epi32 Set epil6 Set epi8 mm set mm set mm set mm set 1 epi64 1 epi32 1 epil6 1 epi8 Setr epi64 Setr epi32 Setr epil6 Across MMX All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A
259. iagnostics and Messages This section describes the various messages that the compiler produces These messages include the sign on message and diagnostic messages for remarks warnings or errors The compiler always displays any diagnostic message along with the erroneous source line on the standard output This section also describes how to control the severity of diagnostic messages Diagnostic Messages EE Display errors same as w Display warnings and errors DEFAULT Display remarks warnings and errors Language Diagnostics These messages describe diagnostics that are reported during the processing of the source file These diagnostics have the following format filename linenum type nn message Indicates the name of the source file currently being processed Indicates the source line where the compiler detects the condition Indicates the severity of the diagnostic message warning remark error or catastrophic error The number assigned to the error or warning message Hard errors or catastrophes are not assigned a number Describes the diagnostic The following is an example of a warning message tantst cpp 3 warning 328 Local variable increment never used The compiler can also display internal error messages on the standard error If your compilation produces any internal errors contact your Intel representative Internal error messages are in the following form FATAL COM
260. icates the difference between serial regions and parallel regions 138 Parallel Programming main pragma omp parallel pragma omp sections pragma omp section or pragma omp section pragma omp for nowait for gi6 i pragma omp critical Begin serial execution Only the master thread executes Begin a Parallel Construct form a team This is Replicated Code each team member executes the same code Begin a Worksharing Construct One unit of work Another unit of work Wait until both units of work complete More Replicated Code Begin a Worksharing Construct each iteration is unit of work Work is distributed among the team members End of Worksharing Construct nowait was specified so threads proceed Begin a Critical Section pragma omp barrier Replicated Code but only one thread can execute it at a given time More Replicated Code Wait for all team members to arrive 139 Intel C Compiler for Linux Systems User s Guide Ef Ay ore Replicated Code End of Parallel Construct disband team and continue serial execution Possibly more Parallel constructs End serial execution Compiling with OpenMP Directive Format and Diagnostics To run the Intel C Compiler in OpenMP
261. id sinhcoshf float x float sinval float cosval TANH Description The t anh function returns the hyperbolic tangent of x e e e e Calling interface double tanh double x long double tanhl long double x float tanhf float x Exponential Functions The Intel Math library supports the following exponential functions CBRT Description The cbrt function returns the cube root of x Calling interface double cbrt double x long double cbrtl long double x float cbrtf float x EXP Description The exp function returns e raised to the x power e This function may be inlined by the Itanium compiler errno ERANGE for underflow and overflow conditions Calling interface double exp double x long double expl long double x float expf float x 180 Intel Math Library EXP10 EXP2 EXPM1 FREXP HYPOT Description The exp10 function returns 10 raised to the x power 10 errno ERANGE for underflow and overflow conditions Calling interface double exp10 double x long double explOl long double x float explOf float x Description The exp2 function returns 2 raised to the x power 2 errno ERANGE for underflow and overflow conditions Calling interface double exp2 double x long double exp21 long double x float exp2f float x Description The expm1 function returns e raised to the x power minus 1 e 1 errno ERANGE for overflow conditi
262. igned char A15 unsigned char A0 343 Assignment Operator Any Ivec object can be assigned to any other Ivec object conversion on assignment from one Ivec object to another is automatic Assignment Operator Examples Isl6vec4 A Is8vec8 B I64vecl C A B assign Is8vec8 to Isl6vec4 B C assign I64vecl to Is8vec8 B A amp C assign M64 result of amp to Is8vec8 Logical Operators The logical operators use the symbols and intrinsics listed in the following table Bitwise Operator Symbols Operation Standard w assign Standard R A amp B R A B XOR ud N A R A andnot B Logical Operators and Miscellaneous Exceptions Corresponding Intrinsic Syntax Usage w assign mm and si64 mm and sil128 mm and si64 mm and sil128 mm and si64 mm and sil128 _mm and si64 mm and sil128 andnot A and B converted to M64 Result assigned to Tu8vec8 I64vecl A Is8vec8 B Iu8vec8 C C amp B Same size and signedness operators return the nearest common ancestor I32vec2 R Is32vec2 A Iu32vec2 B A amp B returns M64 which is cast to Iu8vec8 C Iu8vec8 A amp B C When A and B are of the same class they return the same type When A and B are of different classes the return value is the return type of the nearest common ancestor The logical operator returns values for combinations of classes listed
263. ile the selected tests would achieve the same total block coverage in only 41 minutes FJ Note The order of tests when prioritization is based on minimizing time first Test 2 then Test 3 could be different than when prioritization is done based on minimizing the number of tests See example above first Test 3 then Test 2 In Example 2 Test 2 is the test that gives the highest coverage per execution time So it is picked as the first test to run 113 Intel C Compiler for Linux Systems User s Guide Using Other Options The cutoff option enables the Test prioritization Tool to exit when it reaches a given level of basic block coverage tselect dpi list tests list spi pgopti spi cutoff 85 00 If the tool is run with the cutoff value of 85 00 in the above example only Test3 will be selected as it achieves 45 65 block coverage which corresponds to 87 50 of the total block coverage that is reached from all three tests The Test prioritization Tool does an initial merging of all the profile information to determine the total coverage that is obtained by running all the tests The nototal option enables you to skip this step In such a case only the absolute coverage information will be reported as the overall coverage remains unknown PGO API Profile Information Generation Support Profile Information Generation Support lets you control of the generation of profile information during the instrumented execution phase
264. ility only use mp set stacksize s for compatibility across different families of Intel processors Description Allocate memory block of size bytes from thread local heap Allocate array of nelem elements of size elsize from thread local heap Reallocate memory block at address pt rand size bytes from thread local heap Free memory block at address pt r from thread local heap Memory must have been previously allocated with kmp_malloc kmp_calloc or kmp_realloc 149 Intel C Compiler for Linux Systems User s Guide Workqueuing Constructs taskq Pragma The taskq pragma specifies the environment within which the enclosed units of work tasks are to be executed From among all the threads that encounter a taskq pragma one is chosen to execute it initially Conceptually the taskq pragma causes an empty queue to be created by the chosen thread and then the code inside the taskq block is executed single threaded All the other threads wait for work to be enqueued on the conceptual queue The task pragma specifies a unit of work potentially executed by a different thread When a task pragma is encountered lexically within a taskq block the code inside the task block is conceptually enqueued on the queue associated with the taskq The conceptual queue is disbanded when all work enqueued on it finishes and when the end of the taskq block is reached Control Structures Many control structures exhibit the patt
265. ilo_constant_propagation Intermediate Language Scalar Optimizer constant propagation ilo copy propagation Intermediate Language Scalar Optimizer copy propagation ecg software pipelining Code Generator software pipelining All optimization reports that have a matching prefix with the specified optimizer are generated For example if opt report phase ilo cois specified a report from both the constant propagation and the copy propagation are generated The Availability of Report Generation The opt report help option lists the logical names of optimizers available for report generation Timing Your Application How fast your application executes is one indication of performance When timing the speed of applications consider the following circumstances e Run program timings when other users are not active Your timing results can be affected by one or more CPU intensive processes also running while doing your timings e Try to run the program under the same conditions each time to provide the most accurate results especially when comparing execution times of a previous version of the same program Use the same system processor model amount of memory version of the operating system and so on if possible e If you do need to change systems you should measure the time using the same version of the program on both systems so you know each system s effect on your timings e For programs that run for less than a f
266. ination to solve the data dependence problem in all dimensions Loop Constructs Loops can be formed with the usual for and while constructs However the loops must have a single entry and a single exit to be vectorized Correct Usage while i lt n If branch is inside body of loop bli eils i 0 0 Incorrect Usage while i lt n if condition break 2nd exit i 122 Parallel Programming Loop Exit Conditions Loop exit conditions determine the number of iterations that a loop executes For example fixed indexes for loops determine the iterations The loop iterations must be countable that is the number of iterations must be expressed as one of the following e a constant e a loop invariant term e a linear function of outermost loop indices Loops whose exit depends on computation are not countable Examples below show countable and non countable loop constructs Correct Usage for Countable Loop Exit condition specified by N 1b 1 nt N while count 1b 1b is not affected within loop a il b i x b i i sqart d il count Correct Usage for Countable Loop Exit condition is n m 2 2 l m Lenz 1 2 ij b i x ij c i tsqrt d il ti Incorrect Usage for Non Countable Loop i 0 Iterations dependent on ali while a i gt 0 0 a i b i c i i Types of Loops Vectorized For integer loops MMX technolo
267. initiate Interval Profile Dumping in an instrumented application See the Recommended Usage of PGOPTI Set Interval Prof Dump for more information High level Language Optimizations HLO High level optimizations HLO exploit the properties of source code constructs such as loops and arrays in the applications developed in high level programming languages such as C They include loop interchange loop fusion loop unrolling loop distribution unroll and jam blocking data prefetch scalar replacement data layout optimizations and others The option that turns on the high level optimizations is O3 1A 32 and Itanium based applications 03 Enable 02 option plus more aggressive optimizations for example loop transformation and prefetching 03 optimizes for maximum speed but may not improve performance for some programs IA 32 applications 03 In addition in conjunction with the vectorization options ax K W N B P and x K W N B P 03 causes the compiler to perform more aggressive data dependency analysis than for O2 This may result in longer compilation times FP Note The fast option enhances execution speed across the entire program by including the following options that can improve run time performance e 03 maximum speed and high level optimizations e ipo enables interprocedural optimizations across files e static prevents linking with shared libraries To override one of the options se
268. instrumented executables on Test_n Merge Dynamic Profile Information Merge Dynamic Profile Information dyn files dyn files Test_1 dpi Test_2 dpi Test_ dpi Test n dpi Step 3 Run Test Prioritizer Here are the steps for a simple example myApp c for IA 32 systems 1 Set PROF DIR myApp prof dir 2 Issue command prompt gt icpe prof genx myApp c This command compiles the program and generates an instrumented binary as well as the corresponding static profile information pgopti spi 3 Issue command rm PROF DIR dyn Make sure that there are no unrelated dyn files present 4 Issue command myApp datal Invocation of this command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF DIR 111 Intel C Compiler for Linux Systems User s Guide 5 Issue command profmerge prof dpi Testl dpi At this step the profmerge tool merges all the dyn files into one file Test1 dpi that represents the total profile information of the application on Test 1 6 Issue command rm PROF DIR dyn Make sure that there are no unrelated dyn files present 7 Issue command myApp data2 This command runs the instrumented application and generates one or more new dynamic profile information files that have an extension dyn in the directory specified by PROF DIR 8 Issue command profmerge prof dpi
269. int double x long double nearbyintl long double x float nearbyintf float x RINT Description The rint function returns the rounded integral value as a floating point number using the current rounding direction Calling interface double rint double x long double rintl long double x float rintf float x ROUND Description The round function returns the nearest integral value as a floating point number Halfway cases are rounded away from zero Calling interface double round double x long double roundl long double x float roundf float x 188 Intel Math Library TRUNC Description The t runc function returns the truncated integral value as a floating point number Calling interface double trunc double x long double truncl long double x float truncf float x Remainder Functions The Intel Math library supports the following remainder functions FMOD Description The mod function returns the value x n y for integer n such that if y is nonzero the result has the same sign as x and magnitude less than the magnitude of y errno EDOM for x 0 Calling interface double fmod double x double y long double fmodl long double x long double y float fmodf float x float y REMAINDER Description The remainder function returns the value of x REM y as required by the IEEE standard Calling interface double remainder double x double y long double remaind
270. int in hex format 7 A7 6 A6 5 A5 4 A4 3 A3 2 A2 1 A1 0 A0 Corresponding Intrinsics none The four 16 bit values of A are placed in the output buffer and printed in the following format default in decimal cout lt lt Isl6vec4 A cout lt lt Iul6vec4 A cout lt lt hex lt lt Iul6vec4 A print in hex format TESTAS 2 A2 IAE AL Tea Corresponding Intrinsics none The sixteen 8 bit values of A are placed in the output buffer and printed in the following format default is decimal cout lt lt Is8vecl6 A cout lt lt Iu8vecl6 A cout lt lt hex lt lt Iu8vec8 A print in hex format instead of decimal 15 A15 14 A14 13 A13 12 A12 11 A11 10 A10 9 8 1 A8 7 A7 6 A6 5 A5 4 A4 3 A3 2 A2 1 A1 A9 0 AO Corresponding Intrinsics none The eight 8 bit values of A are placed in the output buffer and printed in the following format default is decimal cout Is8vec8 A cout Iu8vec8 A cout hex Iu8vec8 A print in hex format instead of decimal 7 A7 6 A6 5 A5 4 A4 3 A3 2 A2 1 A1 0 A0 Corresponding Intrinsics none Element Access Operators int R Is64vec2 A i unsigned int R Iu64vec2 Ali int R Is32vec4 A i unsigned int R Iu32vec4 A i int R Is32vec2 A i unsigned int R Iu32vec2 A i short R Isl6vec8 Ali unsigned short R Iul6vec8 Ali short R Isl6vec4 A i unsigned shor
271. invoked with the same compiler that produced the IL for the library then the compiler can extract the IL from the library and use it to optimize the program Creating a Multifile IPO Executable This topic describes how to create a multifile IPO executable for compilations targeted for IA 32 and Itanium based systems If you separately compile and link your source modules with ipo 1 Compile with ipo as follows prompt gt icpe ipo c a cpp b cpp c cpp 2 Use the c option to stop compilation after generating o files Each object file has the IR for the corresponding source file With preceding results you can now optimize interprocedurally prompt gt icpe ipo a o b o c o Multifile IPO is applied only to modules that have an IR otherwise the object file passes to the link stage For efficiency combine steps 1 and 2 prompt gt icpe ipo a cpp b cpp c cpp See Using Profile Guided Optimization An Example for a description of how to use multifile IPO with profile information for further optimization Creating a Multifile IPO Executable with xild The Intel linker xild performs the following steps e invokes the Intel compiler to perform multifile IPO if objects containing IR are found e invokes the GNU linker 14d to link the application The command line syntax for x ild is prompt xild lt options gt LINK commandline where e lt options gt optional may include any gcc linker options or options supporte
272. ion a j is used within a loop by placing prefetch a in front of the loop the compiler will insert prefetches for a j d within the loop where d is determined by the compiler This directive is supported when option O3 is on 156 Optimization Support Features Example of prefetch Directive pragma noprefetch b pragma prefetch a for i 0 i lt m i Vectorization Support IA 32 The vector directives control the vectorization of the subsequent loop in the program but the compiler does not apply them to nested loops Each nested loop needs its own directive preceding it You must place the vector directive before the loop control statement vector always Directive The vector always directive instructs the compiler to override any efficiency heuristic during the decision to vectorize or not and will vectorize non unit strides or very unaligned memory accesses Example of vector always Directive fpragma vector a for i 0 i lt N i a 32 i b 99 i ivdep Directive The ivdep directive instructs the compiler to ignore assumed vector dependences To ensure correct code the compiler treats an assumed dependence as a proven dependence which prevents vectorization This directive overrides that decision Use ivdep only when you know that the assumed loop dependences are safe to ignore The loop in the example below will not vectorize with the ivdep since the value of k is not known v
273. ionmode e IPF flt eval methodO e IPF fltacc Default IPF fltacc Flush Denormal Results to Zero Use the ft z option to flush denormal results to zero Contraction of FP Multiply and Add Subtract Operations IPF fma enables disables the contraction of floating point multiply and add subtract operations into a single operation Unless mp is specified the compiler contracts these operations whenever possible The mp option disables the contractions Use IPF_fma and IPF fma to override the default compiler behavior For example a combination of mp and IPF fma enables the compiler to contract operations on prompt gt icpe mp IPF fma prog cpp FP Speculation IPF fp speculationmode sets the compiler to speculate on floating point operations in one of the following modes e fast sets the compiler to speculate on floating point operations e safe enables the compiler to speculate on floating point operations only when it is safe e strict disables the speculation of floating point operations e off disables the speculation on floating point operations E Note IPF fp speculationsafe is the default when O0 is specified FP Operations Evaluation IPF flt eval method0 directs the compiler to evaluate the expressions involving floating point operands in the precision indicated by the variable types declared in the program Controlling Accuracy of the FP Results IPF fltacc enables disables optim
274. ions for syntax and return values Fvec Classes Syntax Notation Fvec classes use the syntax conventions shown the following examples Fvec Class R Fvec Class A operator Ivec Class B Example 1 F64vec2 R F64vec2 A amp F64vec2 B Fvec Class R operator Fvec Class A Fvec Class B Example2 F64vec2 R andnot F64vec2 A F64vec2 B Fvec Class R operator Fvec Class A Example 3 F64vec2 R amp F64vec2 A where operator is an operator for example amp or Fvec Class is any Fvec class F64vec2 F32vec4 or F32vec1 R A B are declared Fvec variables of the type indicated 363 Return Value Notation Because the Fvec classes have packed elements the return values typically follow the conventions presented in the Return Value Convention Notation Mappings table below F32vec4 returns four single precision floating point values RO R1 R2 and R3 F64vec2 returns two double precision floating point values and F32vec1 returns the lowest single precision floating point value RO Return Value Convention Notation Mappings F32vec1 Example 1 Example 2 RO AO andnot Rl Al andnot R2 A2 andnot Data Alignment Memory operations using the Streaming SIMD Extensions should be performed on 16 byte aligned data whenever possible F32vec4 and F64vec2 object variables are properly aligned by default Note that floatin
275. isables the insertion of software prefetching by the compiler Default is prefetch By default the Intel compiler creates 64 bit profiling counters dyn and dpi This option creates 32 bit counters for compatibility with the Intel C Compiler 7 0 Link Intel 1ibcxa C library dynamically Link Intel 1ibcxa C library statically Strict ANSI conformance dialect Direct linker to read link commands from file Manual use of precompiled header filename pchi Enable a mode in which a shorter form of the diagnostic output is used When enabled the original source line is not displayed and the error message text is not wrapped when too long to fit on a single line Default OFF OFF OFF OFF OFF ON OFF ON OFF OFF OFF OFF OFF Compiler Options Quick Reference Option Description Default Wcheck Performs compile time code OFF checking for code that exhibits non portable behavior represents a possible unintended code sequence or possibly affects operation of the program because of a quiet change in the ANSI C Standard Print diagnostics for 64 bit porting On D Generates specialized code for OFF Intel Pentium M and compatible Intel processors Generates specialized code for OFF Intel Pentium 4 and compatible Intel processors xP Generates specialized code for the OFF Intel Pentium 4 processor with Streaming SIMD Extensions 3
276. isted in the following table followed by a description of each intrinsic with the most recent mnemonic naming convention The alternate name is provided in case you have used these intrinsics before The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file Intrinsic Name Alternate Name Corresponding Instruction mm Cut ss2si mm cvtss si32 CVTSS2SI mm cvt ps2pi mm cvtps pi32 CVTPS2PI mm cvtt ss2si mm cvttss si32 CVTTSS2SI mm cvtt ps2pi mm cvttps pi32 CVTTPS2PI mm Cut si2ss mm cvtsi32 ss CVTSI2SS mm cvt pi2ps mm cvtpi32 ps CVTTPS2PI mm cvtpil6 ps composite mm cvtpul6 ps composite mm cvtpi8 ps composite mm cvtpu8 ps composite mm cvtpi32x2 ps composite mm cvtps pil6 composite mm cvtps pi8 composite 231 Intel C Compiler for Linux Systems User s Guide int _mm_cvt_ss2si __m128 a Convert the lower SP FP value of a to a 32 bit integer according to the current rounding mode r re int a0 m64 _mm_cvt_ps2pi __m128 a Convert the two lower SP FP values of a to two 32 bit integers according to the current rounding mode returning the integers in packed form rO int aO rl int al int mm cvtt ss2si 4m128 a Convert the lower SP FP value of a to a 32 bit integer with truncation r i int aO m64 _mm_cvtt_ps2pi __m128 a Convert the two lower SP FP values of a to two 32 bit integer with truncation returning the integers in packed
277. istics functions that will not return whether exceptions are thrown the stack needs alignment or alignment of arguments Multifile optimization Affects the same aspects as ip but across multiple files IA 32 applications only Optimization Affected Aspect of Program Passing arguments in registers Calls register usage Inline function expansion is one of the main optimizations performed by the interprocedural optimizer For function calls that the compiler believes are frequently executed the compiler might decide to replace the instructions of the call with code for the function itself inline the call With ip the compiler performs inline function expansion for calls to functions defined within the current source file However when you use ipo to specify multifile IPO the compiler performs inline function expansion for calls to functions defined in separate files For this reason it is important to compile the entire application or multiple related source files together when you specify ipo The IPO optimizations are disabled by default 91 Intel C Compiler for Linux Systems User s Guide Interprocedural Optimization Options ip in no ip no inlining ipo ipo c ipo obj ipo S line debug info lib inline Description Enables interprocedural optimizations for single file compilation Disables inlining that would result from the ip interprocedural optim
278. itional Select Operators Corresponding Intrinsics and Classes Part 1 Operators Corresponding I32vec4 I16vec8 I8vec16 I32vec2 I16vec4 l8vec8 Intrinsic select_eq mm cmpeq x epil epi8 pi32 pil6 pi8 mm and y i si128 si128 si64 si64 si64 mm andnot y i si128 si128 si64 si64 si64 _mm_or_ y i si128 si128 si64 si64 si64 select neq mm cmpeq x epil epi8 pi32 pil6 pi8 mm and y si128 si128 si64 si64 si64 mm andnot y si128 si128 si64 si64 si64 _mm_or_ y si128 si128 si64 si64 si64 select_gt mm cmpgt x epil epi8 pi32 pil6 pi8 mm and y si128 si128 si64 si64 si64 mm andnot y si128 si128 si64 si64 si64 _mm_or_ y si128 si128 si64 si64 si64 select_ge mm cmpge x mm and y mm andnot y mm or pil6 epi8 pi32 pil6 pi8 1128 si128 si64 si64 si64 1128 si128 si64 si64 si64 1128 si128 si64 si64 si64 Qoo select lt mm cmpl epil epi8 pi32 pil6 pi8 mm and si128 si128 s1i64 s1i64 s1i64 mm andnot y si128 si128 si64 si64 si64 Jmm or si128 si128 si64 s1i64 si64 select_le epil6 epi8 pi32 pil pis si128 sil28 si64 si64 si64 si128 sil28 si64 si64 si64 si128 sil28 si64 si64 si64 select ngt mm cmpgt x N A N A N A N A N A N A select nge mm cmpge x N A N A N A N A N A N A select nlt mm cmplt x N A N A N A N A N A N A select nle mm cmple x N A N A N A N A N A N A
279. ity you can code parallel constructs at the top levels of your program and use directives to control execution in any of the called routines For example int main void pragma omp parallel phasel void phasel void pragma omp for private i shared n for i 0 i lt n itt some_work i This is an orphaned directive because the parallel region is not lexically present Data Environment Directive A data environment directive controls the data environment during the execution of parallel constructs You can control the data environment within parallel and worksharing constructs Using directives and data environment clauses on directives you can e Privatize scope variables by using the THREADPRIVATE directive e Control data scope attributes by using the THREADPRIVATE directive s clauses The data scope attribute clauses are e COPYIN e DEFAULT e PRIVATE e FIRSTPRIVATE e LASTPRIVATE e REDUCTION e SHARED You can use several directive clauses to control the data scope attributes of variables for the duration of the construct in which you specify them If you do not specify a data scope attribute clause on a directive the default is SHARED for those variables affected by the directive Pseudo Code of the Parallel Processing Model A sample pseudo program using some of the more common OpenMP directives is shown in the code example that follows This example also ind
280. ization but has no effect on other interprocedural optimizations Enables interprocedural optimizations across files Generates a multifile object file that can be used in further link steps Forces the compiler to create real object files when used with ipo Generates a multifile assemblable file named ipo_out asm that can be used in further link steps Preserve the source position of inlined code instead of assigning the call site source position to inlined code Disables inline expansion of standard library functions Using ip or ipo with Qoption Specifiers Use Qoption with the applicable keywords to select particular inline expansions and loop optimizations The option must be entered with a ip or ipo specification as follows prompt gt icpe ip Qoption tool opts where tool is C c and opts are Qopt ion specifiers see below option Specifiers If you specify ip or ipo without any Qopt ion qualification the compiler 92 expands functions in line propagates constant arguments passes arguments in registers monitors function level static variables Compiler Optimizations You can refine interprocedural optimizations by using the following Qopt ion specifiers To have an effect the Dopt ion option must be entered with either ip or ipo also specified as in this example prompt gt icpe ip Qoption c ip specifier where ip specifieris one ofthe specifiers described in the table below
281. izations that affect floating point accuracy By default IPF_fltacc the compiler may apply optimizations that reduce floating point accuracy You may use IPF fltacc or mp to improve floating point accuracy but at the cost of disabling some optimizations 83 Intel C Compiler for Linux Systems User s Guide Optimizing for Specific Processors Processor Optimization for IA 32 only The tpp 51617 options optimize your application s performance for a specific Intel processor The resulting binary will also run on the other processors listed in the table below The Intel C Compiler includes gcc compatible versions of the t pp options These options are listed in the gcc Version column Option gcc Version Optimizes for tpp5 mcpu pentium Intel Pentium processors tpp6 mcpu pentiumpro Intel Pentium Pro Intel Pentium II and Intel Pentium III processors tpp7 mcpu pentium4 Intel Pentium 4 processors Intel Pentium M processors and Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 F Note The tpp7 option is ON by default Example The invocations listed below all result in a compiled binary optimized for Pentium 4 The same binary will also run on Pentium Pentium Pro Pentium II and Pentium III processors prompt gt icpe prog cpp prompt gt icpe tpp7 prog cpp prompt gt icpc mcpu pentium4 prog cpp Processor Optimization Itanium based Systems only The tpp 112 options o
282. izer by following these coding guidelines Auto paral Expose the trip count of loops whenever possible Specifically use constants where the trip count is known and save loop parameters in local variables Avoid placing structures inside loop bodies that the compiler may assume to carry dependent data for example function calls ambiguous indirect references or global references lelization Data Flow For auto parallelization processing the compiler performs the following steps 1 Qy qw de ai Ss Data flow analysis Loop classification Dependence analysis High level parallelization Data partitioning Multi threaded code generation 133 Intel C Compiler for Linux Systems User s Guide These steps include e Data flow analysis compute the flow of data through the program e Loop classification determine loop candidates for parallelization based on correctness and efficiency as shown by threshold analysis e Dependence analysis compute the dependence analysis for references in each loop nest e High level parallelization e analyze dependence graph to determine loops which can execute in parallel e compute run time dependency e Data partitioning examine data reference and partition based on the following types of access shared private and firstprivate e Multi threaded code generation e modify loop parameters e generate entry exit per threaded task e generate calls to parallel runtime routines for
283. k bits of PSR Maps to the rsm imm24 instruction 295 Intel C Compiler for Linux Systems User s Guide Conversion Intrinsics The prototypes for these intrinsics are in the ia64intrin h header file LI IEEE cco int64 m to int64 m64 a Convert a of type m64 to type int64 Translates to nop since both types reside in the same register on Itanium based systems m64 m from int64 int64 a Convert a of type__int 64 to type ___m64 Translates to nop since both types reside in the same register on Itanium based systems int64 Convert its double precision argument to a round double to int64 double signed integer d unsigned int64 Map the get f exp instruction and return getf exp double d the 16 bit exponent and the sign of its operand Register Names for getReg and setReg The prototypes for getReg and setReg intrinsics are in the ia64regs h header file General Integer Registers 296 Intel C Intrinsics Reference Application Registers R_BSP R_BSPSTORE R_RNAT A Tj z TD 3 D Wei J E R CCV R_UNAT R_FPSR 297 Intel C Compiler for Linux Systems User s Guide Control Registers Ls TA64_REG_CR_DCR IA64 REG CR ITM IA64 REG CR IVA IA64 REG CR PTA IA64 REG CR IPSR IA64 R
284. kCvrg FncCvrg Test Name 9 Options 1 ez 50 45 65 37 50 Test3 dpi I 100 00 52 17 50 00 Test2 dpi In this example the Test prioritization Tool has provided the following information e By running all three tests we achieve 52 17 block coverage and 50 00 function coverage e Test3 covers 45 65 of the basic blocks of the application which is 87 50 of the total block coverage that can be achieved from all three tests e By adding Test2 we achieve a cumulative block coverage of 52 17 or 100 of the total block coverage of Test 1 Test2 and Test3 e Elimination of Test 1 has no negative impact on the total block coverage Example 2 Minimizing Execution Time Suppose we have the following execution time of each test in the tests list file Testl dpi 00 00 60 35 Test2 dpi 00 00 10 15 Test3 dpi 00 00 30 45 The following command executes the Test prioritization Tool to minimize the execution time with the mintime option tselect dpi list tests list spi pgopti spi mintime Here is a sample output Total number of tests 3 Total block coverage 52 17 Total function coverage 50 00 Total execution time 1 41 35 num elapsedTime RatCvrg eBlkCvrg FncCvrg Test Name 9 Options 1 10 15 75 00 39 13 25 00 Test2 dpi 2 41 00 100 00 52 17 50 00 Test3 dpi In this case the results indicate that the running all tests sequentially would require one hour 45 minutes and 35 seconds wh
285. lasses provide parallelism which is not easily implemented using typical mechanisms of C The following table shows how the Intel C SIMD classes use the classes and libraries SIMD Vector Classes EE Instruction Set Signedness Data Size Elements Header Type File MMX I64vecl unspecified m64 64 1 ivec h technology available for IA 32 and Itantum based systems I32vec2 unspecified int 32 2 ivec h Is32vec2 signed int 32 2 ivec h Iu32vec2 unsigned int 32 2 ivec h Il6vec4 unspecified short 16 4 ivec h Isl6vec4 signed short 16 4 ivec h Iul6vec4 unsigned short 16 4 ivec h I8vec8 unspecified char 8 8 ivec h Is8vec8 signed char 8 8 ivec h Iu8vec8 unsigned char 8 8 ivec h m Streaming SIMD F32vec4 signed float 32 4 fvec h Extensions available for IA 32 and Itanium based systems m F32vec1 signed float 32 1 fvec h Streaming SIMD F64vec2 signed double 64 2 dvec h Extensions 2 available for IA 32 based systems only 335 WT Instruction Set Class Signedness Data Size Elements Header Type File T 1128vecl unspecified __mi28i 128 1 dvec h I64vec2 unspecified long 64 4 dvec h int Is64vec2 long 64 4 dvec h int Iu64vec2 unsigned long 32 4 dvec h int I32vec4 unspecified int 32 4 dvec h Is32vec4 signed int 32 4 dvec h Iu32vec4 unsigned int 32 4 dvec h I16vec8 unspecified int 16 8 dvec h Isl6vec8 signed int 16 8 dvec h l
286. layed e n display diagnostics indicating loops successfully vectorized default e n 2sameas n 1 plus diagnostics indicating loops not successfully vectorized e n 3sameasn 2 plus additional information about any proven or assumed dependences Usage If you use c ipo with vec report n option or c x K W N B P or ax K W N B P with vec report n the compiler issues a warning and no report is generated To produce a report when using the aforementioned options you need to add the ipo obj option The combination of c and ipo ob produces a single file compilation and hence does generate object code and eventually a report is generated 119 Intel C Compiler for Linux Systems User s Guide The following commands generate a vectorization report e prompt gt icpe x K W N B P vec report3 file cpp e prompt gt icpe x K W N B P ipo ipo obj vec report3 file cpp e prompt gt icpe c x K W N B P ipo ipo obj vec report3 file cpp The following commands do not generate a vectorization report e prompt gt icpe c x K W M B P vec report3 file cpp e prompt gt icpe x K W N B P ipo vec report3 file cpp e prompt gt icpe c x K W N B P ipo vec report3 file cpp Loop Parallelization and Vectorization Combining the parallel and x KIW N B PJ options instructs the compiler to attempt both automatic loop parallelization and automatic loop vectorization in the same compilation In
287. le or 4 32 bit single precision operations per cycle Key to the table entries e A Expected to give significant performance gain over non intrinsic based code equivalent e B Non ntrinsic based source code would be better the intrinsic s implementation may map directly to native instructions but they offer no significant performance gain e C Requires contorted implementation for particular microarchitecture Will result in very poor performance if used Intrinsic Name mm_add_ss mm_add_ps mm_sub_ss mm_sub_ps mm_mul_ss _mm_mul_ps mm_div_ss mm div ps mm sqrt ss mm sqrt ps mm rcp ss mm rcp ps mm rsqrt ss mm rsqrt ps mm min ss mm min ps mm max sS mm max ps 318 Alternate Name Across MMX TM All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Itanium amp Technology SIMD Architecture Extensions Streaming SIMD Extensions 2 N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A Intel C Intrinsics Reference Intrinsic Name mm mm _and_ps _andnot_ps _or_ps _XOr_ps __cmpeq_ss _cmpeq_ps cmplt_ss cmplt_ps cmple_ss cmple_ps cmpgt_ss cmpgt_ps cmpge_ss cmpge_ps cmpneq_ss cmpneq_ps cmpnit_ss cmpnit_ps cmpnle ss cmpnle ps cmpngt s
288. le specifies the interfaces to these routines The names for the routines are in user name space The omp h and omp_1lib h header files are provided in the INCLUDE directory of your compiler installation There are definitions for two different locks omp_lock_kind and omp nest lock kind which are used by the functions in the table that follows Execution Environment Routines Function omp set num t omp get num t omp get max t Description hreads nthreads Sets the number of threads to use for subsequent parallel regions hreads Returns the number of threads that are being used in the current parallel region omp get thread num hreads Returns the maximum number of threads that are available for parallel execution 144 omp get num procs Returns the unique thread number of the thread currently executing this section of code to the program Returns the number of processors available Parallel Programming Function omp in parallel Description Returns TRUE if called within the dynamic extent of a parallel region executing in parallel otherwise returns FALSE omp set dynamic dynamic threads Enables or disables dynamic adjustment of omp get dynamic omp set nested nested omp get nested Lock Routines Function omp init lock lock omp destroy lock lock omp set lock lock omp unset lock lock omp test lock lock t
289. lect nle F64vec2 A mm cmpnle pd float F32vecl R select nl F32vec1 A mm cmpnle ss Compare for Not Greater Than 4 floats F32vecl R select ngt F32vec4 A mm cmpngt ps 2doubles FO4vec2 R select ngt F64vec2 A mm cmpngt pd float F32vecl R select ngt F32vecl A mm cmpngt ss Compare for Not Greater Than or Equal 4 floats F32vecl R select nge F32vec4 A mm cmpnge ps 2 doubles F64vec2 R select nge F64vec2 A mm cmpnge pd 378 Intel C Intrinsics Reference 1 float F32vecl R select nge F32vecl A mm cmpnge ss Cacheability Support Operations Stores non temporal the two double precision floating point values of A Requires a 16 byte aligned address void store nta double p F64vec2 A Corresponding intrinsic mm stream pd Stores non temporal the four single precision floating point values of A Requires a 16 byte aligned address void store nta float p F32vec4 A Corresponding intrinsic mm stream ps Debugging The debug operations do not map to any compiler intrinsics for MMX TM technology or Streaming SIMD Extensions They are provided for debugging programs only Use of these operations may result in loss of performance so you should not use them outside of debugging Output Operations The two single double precision floating point values of A are placed in the output buffer and printed in decimal format as follo
290. li al pi2 a2 p 3 a3 void _mm_storeu_ps float p __m128 a Stores four SP FP values The address need not be 16 byte aligned h header file p 0 a0 p 1 al p 2 a2 p 3 a3 void _mm_storer_ps float p __m128 a Stores four SP FP values in reverse order The address must be 16 byte aligned p 0 a3 pli a2 pi2 al p 3 a0 m128 _mm_move_ss m128 a m128 b Sets the low word to the SP FP value of b The upper 3 SP FP values are through from a rO bO ri al r2 a2 r3 a3 passed Cacheability Support Using Streaming SIMD Extensions The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin void _mm_pause void h header file The execution of the next instruction is delayed an implementation specific amount of time The instruction does not modify the architectural state This intrinsic provides especially significant performance gain and described in more detail below PAUSE Intrinsic The PAUSE intrinsic is used in spin wait loops with the processors implementing dynamic execution especially out of order execution In the spin wait loop PAUSE improves the speed at which the code detects the release of the lock For dynamic scheduling the PAUS reduces the penalty of exiting from the spin loop E instruction 235 Intel C Compiler for Linux Systems User s Guide Example of loop with the PAUSE instruction
291. ll Zeros __m64 m pcmpeqd m64 ml m64 m2 If the respective 32 bit values in m1 are equal to the respective 32 bit values in m2 set the respective 32 bit resulting values to all ones otherwise set them to all Zeros m64 m pcmpgtb m64 ml __m64 m2 If the respective 8 bit values in m1 are greater than the respective 8 bit values in m2 set the respective 8 bit resulting values to all ones otherwise set them to all Zeros m64 _m_pcmpgtw __m64 ml m i m2 If the respective 16 bit values in m1 are greater than the respective 16 bit values in m2 set the respective 16 bit resulting values to all ones otherwise set them to all zeros m64 _m_pcmpgtd __m64 ml __m64 m2 If the respective 32 bit values in m1 are greater than the respective 32 bit values in m2 set the respective 32 bit resulting values to all ones otherwise set them all to zeros MMX Technology Set Intrinsics The prototypes for MMX technology intrinsics are in the mmintrin h header file Operation Number of Element Reverse Elements Bit Size Order set pi32 set integer values 2 No No E No do mm_ mm_ S es Ech est mm mm m m setl pil6 set integer values N mm seti pi8 set integer values mm setr pi32 set integer values EE KE EE 218 Intel C Intrinsics Reference Operation Number of Element Reverse Elements Bit Size Order nd ini nidi bid E Note
292. llelizer analyzes the dataflow of the program s loops and generates multithreaded code for those loops which can be safely and efficiently executed in parallel This enables the potential exploitation of the parallel architecture found in symmetric multiprocessor SMP systems Automatic parallelization relieves the user from e having to deal with the details of finding loops that are good worksharing candidates e performing the dataflow analysis to verify correct parallel execution e partitioning the data for threaded code generation as is needed in programming with OpenMP directives The parallel run time support provides the same run time features found in OpenMP such as handling the details of loop iteration modification thread scheduling and synchronization While OpenMP directives enable serial applications to transform into parallel applications quickly the programmer must explicitly identify specific portions of the application code that contain parallelism and add the appropriate compiler directives Auto parallelization triggered by the parallel option automatically identifies those loop structures which contain parallelism During compilation the compiler automatically attempts to decompose the code sequences into separate threads for parallel processing No other effort by the programmer is needed The following example illustrates how a loop s iteration space can be divided so that it can be executed concurrently on two th
293. loat operations typically on arrays Supported arithmetic operations include addition subtraction multiplication division negation square root max and min Operation on double precision types is not permitted unless optimizing for a Pentium 4 processor system using the xW or axW compiler option Integer Array Operations The statements within the loop body may contain char unsigned char short unsigned short int and unsigned int Calls to functions such as sqrt and fabs are also supported Arithmetic operations are limited to addition subtraction bitwise AND OR and XOR operators division 16 bit only multiplication 16 bit only min and max You can mix data types only if the conversion can be done without a loss of precision Some example operators where you can mix data types are multiplication shift or unary operators Other Operations No statements other than the preceding floating point and integer operations are allowed In particular note that the special _m64 and __m128 datatypes are not vectorizable The loop body cannot contain any function calls Use of the Streaming SIMD Extensions intrinsics _mm_add_ps are not allowed Language Support and Directives This topic addresses language features that better help to vectorize code The declspec align n declaration enables you to overcome hardware alignment constraints The restrict qualifier and the pragmas address the stylistic issues due to lexical sco
294. lso Naming Syntax and Usage for intrinsics Naming Syntax for the Class Libraries The name of each class denotes the data type signedness bit size number of elements using the following generic format lt type gt lt signedness gt lt bits gt vec lt elements gt F I s u 64 32 16 8 vec 8 4 2 1 where lt type gt Indicates floating point F or integer I lt signedness gt Indicates signed s or unsigned u For the Ivec class leaving this field blank indicates an intermediate class There are no unsigned Fvec classes therefore for the Fvec classes this field is blank lt bits gt Specifies the number of bits per element elements Specifies the number of elements Compiler Options Quick Reference Conventions Used in the Options Quick Guide Tables New Convention n Values in with vertical bars inj Words in this style following an option Options Definition If an option includes as part of the definition then the option can be used to enable or disable the feature For example the c99 option can be used as c99 enable c99 support or c99 disable c99 support Indicates that the value n in can be omitted or have various values Used for option s version for example option x amp W N B P has these versions xK xW xN xB and xP Indicates that option must include one of the fixed values for n
295. lt in Functions This version of the Intel C compiler supports the following gcc built in functions __bu __bu ltin frame address IA 32 only ltin return address IA 32 only builtin abs builtin labs builtin cos builtin cosf builtin fabs builtin fabsf builtin memcmp builtin memcpy builtin sin builtin sinf builtin sqrt builtin sqrtf builtin strcemp builtin strlen builtin strncmp builtin abort builtin prefetch builtin constant p builtin printf builtin fprintf builtin fscanf builtin scanf builtin fputs builtin memset builtin strcat builtin strcpy builtin strncpy builtin exit builtin strchr builtin strspn builtin strcspn builtin strstr builtin strpbrk builtin strrchr builtin strncat builtin alloca builtin ffs builtin index builtin rindex builtin bcmp builtin bzero builtin sinl builtin cos builtin sqrtl builtin fabsl i i 74 gcc Compatibility gcc Function Attributes This version of the Intel C Compiler supports the following gcc function attributes e noinline prevents a function from being inlined e always_inline inlines the function even if no optimization is specified e used code must be emitted for the function even if the function is not referenced Example int round sqrt int __attribute__ always inline In this example the function round sqrt is inlined
296. lt result is 4 int64 m64 czx2r m i a m64 The 64 bit value a 1s scanned for a zero element from the least significant element to the most significant element and the index of the first zero element is returned The element width is 16 bits so the range of the result is from 0 3 If no zero element is found the default result is 4 _m64 mixl1l m64 a m64 b Interleave 64 bit quantities a and b in 1 byte groups starting from the left as shown in Figure 1 and return the result ENEE rn bette tet Figi 301 Intel C Compiler for Linux Systems User s Guide __m64 m64 mixlr __m64 a __m64 b Interleave 64 bit quantities a and b in 1 byte groups starting from the right as shown in Figure 2 and return the result EEEN nr uhr ro m64 m64 mix21 1m64 a m64 b Interleave 64 bit quantities a and b in 2 byte groups starting from the left as shown in Figure 3 and return the result m m of Ex a im Figs __m64 m64 mix2r m64 a __m64 b Interleave 64 bit quantities a and b in 2 byte groups starting from the right as shown in Figure 4 and return the result i BH ri E m rae at Fig4 m64 _m64 mix4l m64 a __m64 b Interleave 64 bit quantities a and b in 4 byte groups starting from the left as shown in Figure 5 and return the result mmm E mmm EE Fig 5 m64 m64 mix4r m64 a _ m64 b Interleave 64 bit quantities a and b in 4 byte groups starting from the right as
297. lt with unsigned saturation m64 m punpckhbw m64 ml m64 m2 Interleave the four 8 bit values from the high half of m1 with the four values from the high half of m2 The interleaving begins with the data from m1 m64 m punpckhwd _m64 ml m64 m2 Interleave the two 16 bit values from the high half of m1 with the two values from the high half of m2 The interleaving begins with the data from m1 m64 m punpckhdq m64 ml m64 m2 Interleave the 32 bit value from the high half of m1 with the 32 bit value from the high half of m2 The interleaving begins with the data from m1 m64 m punpcklbw m64 ml m64 m2 Interleave the four 8 bit values from the low half of m1 with the four values from the low half of m2 The interleaving begins with the data from m1 m64 m punpcklwd m64 ml __m64 m2 Interleave the two 16 bit values from the low half of m1 with the two values from the low half of m2 The interleaving begins with the data from m1 212 Intel C Intrinsics Reference __m64 m punpcklaq m64 mil m64 m2 Interleave the 32 bit value from the low half of m1 with the 32 bit value from the low half of m2 The interleaving begins with the data from m1 MMX Technology Packed Arithmetic Intrinsics The prototypes for MMX technology intrinsics are in the mmint rin h header file Alternate Name m64 64 ml m paddb 64 _m_paddw __m64 ml n bs
298. m unpacklo epi8 Interleave the four 8 bit values from the high low of A with the four 8 bit values from the low half of B I8vec8 unpack low I8vec8 A Is8vec8 unpack Iu8vec8 unpack 1 RO R1 R2 R3 R4 R5 R6 R7 I8vec8 B low Is8vec8 A Is8vec8 B low Iu8vec8 A Iu8vec8 B A0 BO Al Bl A2 B2 A3 B3 Corresponding intrinsic mm unpacklo pi8 Pack Operators Pack the eight 32 bit values found in A and B into eight 16 bit values with signed saturation Isl6vec8 pack sat Is32vec2 A Is32vec2 B Corresponding intrinsic mm packs epi32 Pack the four 32 bit values found in A and B into eight 16 bit values with signed saturation Isl6vec4 pack sat Is32vec2 A Is32vec2 B Corresponding intrinsic mm packs pi32 Pack the sixteen 16 bit values found in A and B into sixteen 8 bit values with signed saturation Is8vecl6 pack sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic mm packs epil6 Pack the eight 16 bit values found in A and B into eight 8 bit values with signed saturation Is8vec8 pack_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic mm packs pil6 360 Intel C Intrinsics Reference Pack the sixteen 16 bit values found in A and B into sixteen 8 bit values with unsigned saturation Tu8vecl6 packu_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic mm packus epil6 Pack the eight 16 bit values found in A and B into eight 8 bit values with unsigned saturatio
299. m64 m2 64 m64 m64 m64 m64 m64 m64 214 Subtract the two 32 bit values in m2 from the two 32 bit values in m1 m psubsb m64 ml m64 m2 Subtract the eight signed 8 bit values in m2 from the eight signed 8 bit values in m1 using saturating arithmetic m psubsw m64 ml m64 m2 Subtract the four signed 16 bit values in m2 from the four signed 16 bit values in m1 using saturating arithmetic m psubusb m64 ml m64 m2 Subtract the eight unsigned 8 bit values in m2 from the eight unsigned 8 bit values in m1 using saturating arithmetic m psubusw m64 ml m64 m2 Subtract the four unsigned 16 bit values in m2 from the four unsigned 16 bit values in m1 using saturating arithmetic _m_pmaddwd __m64 ml m64 m2 Multiply four 16 bit values in m1 by four 16 bit values in m2 producing four 32 bit intermediate results which are then summed by pairs to produce two 32 bit results m pmulhw m64 ml __m64 m2 Multiply four signed 16 bit values in m1 by four signed 16 bit values in m2 and produce the high 16 bits of the four results m pmullw m64 ml m64 m2 Multiply four 16 bit values in m1 by four 16 bit values in m2 and produce the low 16 bits of the four results Intel C Intrinsics Reference MMX Technology Shift Intrinsics The prototypes for MMX technology intrinsics are in the mmint rin h header file Alternate Shift Corresponding Name Direction
300. maskmove si64 Store the data in A to the address p without polluting the caches A can be any Ivec type void store nta 1m64 p M64 A Corresponding intrinsic mm stream pi Compute the element wise average of the respective unsigned 8 bit integers in A and B Iu8vec8 simd avg Iu8vec8 A Iu8vec8 B Corresponding intrinsic mm avg pu8 Compute the element wise average of the respective unsigned 16 bit integers in A and B Iul6vec4 simd_avg Iul6vec4 A Iul6vec4 B Corresponding intrinsic mm avg pul6 361 Conversions Between Fvec and lvec Convert the lower double precision floating point value of A to a 32 bit integer with truncation int F64vec2ToInt F64vec42 A r int A0 Convert the four floating point values of A to two the two least significant double precision floating point values F64vec2 F32vecAToF64vec2 F32vec4 A rO double AO0 rl double A1 Convert the two double precision floating point values of A to two single precision floating point values F32vec4 F64vec2ToF32vec4 F64vec2 A 10 float A0 rl float A1 Convert the signed int in B to a double precision floating point value and pass the upper double precision value from A through to the result F64vec2 InttoF64vec2 F64vec2 A int B rO double B El el Convert the lower floating point value of A to a 32 bit integer with truncation int F32vec4ToInt F32vec4 A r int AO Convert the two lower floating point
301. mbol in a different component the GP relative address is not known at compile time Symbol preemption is a very rarely used feature that has drastic negative consequences for compiler optimization For this reason by default the compiler treats all global symbol definitions as non preemptable 1 e protected visibility Global references to symbols defined in other compilation units are assumed by default to be preemptable i e default visibility In those rare cases when you need all global definitions as well as references to be preemptable specify the fpic option to override this default Specifying Symbol Visibility Explicitly You can explicitly set the visibility of an individual symbol using the visibility attribute on a data or function declaration For example int i attribute visibility default void attribute visibility hidden x extern void vil attribute visibilty protected The visibility declaration attribute accepts one of the five keywords e external e default e protected e hidden e internal The value of the visibility declaration attribute overrides the default set by the fvisibility fpic or fno common attributes 64 Using Libraries If you have a number of symbols for which you wish to specify the same visibility attribute you can set the visibility using one of the five command line options e fvisibility external file e
302. me Generates dynamic execution counts Treats partially covered code as fully covered code Sets the filename that contains the list of files of interest Finds the differential coverage with respect to ref dpi file Demangles both function names and their arguments Sets the name of the web page owner Sets the email address of the web page owner Sets the html color name or code of the uncovered blocks Default pgopti spi pgopti dpi Tffff99 103 Intel C Compiler for Linux Systems User s Guide Option Description Default fcolor Sets the html color name or code of the uncovered tffcccc functions pcolor Sets the html color name or code of the partially covered ffafad2 code ccolor Sets the html color name or code of the covered code Iffffff ucolor Sets the html color name or code of the unknown code Iffffff Visual Presentation of the Application s Code Coverage Based on the profile information collected from running the instrumented binaries when testing an application the Intel compiler creates HTML files using a code coverage tool These HTML files indicate portions of the source code that were or were not exercised by the tests When applied to the profile of the performance workloads the code coverage information shows how well the training workload covers the application s critical code High coverage of performance critical modules is essential to taking full advantage of profile guided op
303. me aliasing within ON functions l4 Compiler Options Quick Reference fminshared Compilation is for the main OFF executable Absolute addressing can be used and non position independent code generated for symbols that are at least protected fno alias Assume no aliasing in program OFF fno common Enables the compiler to treat OFF common variables as if they were defined allowing the use of gprel addressing of common data variables fno fnalias Assume no aliasing within OFF functions but assume aliasing across calls fno rtti Disable RTTI support OFF Enables disables function OFF splitting Default is ON with prof use To disable function splitting when you use prof use also specify fnsplit Disable using the EBP register OFF as general purpose register fpic fPIC For IA 32 this option generates OFF position independent code For Itanium based systems this option generates code allowing full symbol preemption fp port Round fp results at assignments OFF 14 32 only and casts Some speed impact f 3 stkchk Generates extra code after every OFF function call to assure the FP stack is in the expected state Use only lower 32 floating point OFF registers fshort enums Allocate as many bytes as OFF needed for enumerated types fsource asm Produce assemblable file with OFF optional code annotations Requires S fsyntax only
304. ming SIMD Extensions 2 you need only to include the dvec h file Usage Precautions When using the C classes you should follow some general guidelines More detailed usage rules for each class are listed in Integer Vector Classes and Floating point Vector Classes Clear MMX Registers If you use both the Tvec and Fvec classes at the same time your program could mix MMX instructions called by Ivec classes with Intel x87 architecture floating point instructions called by Fvec classes Floating point instructions exist in the following Fvec functions e fvec constructors e debug functions cout and element access e rsqrt_nr FP Note MMX registers are aliased on the floating point registers so you should clear the MMX state with the EMMS instruction intrinsic before issuing an x87 floating point instruction as in the following example ivecA ivecA amp Ivec logical operation that uses MMX ivecB instructions clear state cout lt lt f32vec4a F32vec4 operation that uses x87 floating point instructions A Caution Failure to clear the MMX registers can result in incorrect execution or poor performance due to an incorrect register state 337 Follow EMMS Instruction Guidelines Intel strongly recommends that you follow the guidelines for using the EMMS instruction Refer to this topic before coding with the Ivec classes Capabilities The fundamental capabilities of each C SIMD cla
305. mm cvtsi32 sil28 int a uses MOVD Moves 32 bit integer a to the least significant 32 bits of an 128i object Copies the sign bit of a into the upper 96 bits of the __m128i object r0 a rl 0x0 r2 0x0 r3 0x0 int mm cvtsil128 si32 1m128i a uses MOVD Moves the least significant 32 bits of a to a 32 bit integer r e a __m128 mm cvtepi32 ps m128i a Converts the 4 signed 32 bit integer values of a to SP FP values rO float a0 rl float al r2 float a2 r3 float a3 m128i mm cvtps epi32 41m128 a Converts the 4 SP FP values of a to signed 32 bit integer values rO int a0 rl int al r2 int a2 r3 int a3 __m128i _mm_cvttps_epi32 __m128 a 274 Converts the 4 SP FP values of a to signed 32 bit integer values using truncate rO int a0 rl int al r2 int a2 r3 int a3 Intel C Intrinsics Reference Macro Function for Shuffle The Streaming SIMD Extensions 2 provide a macro function to help create constants that describe shuffle operations The macro takes two small integers in the range of 0 to 1 and combines them into an 2 bit immediate value used by the SHUFPD instruction See the following example Shuffle Function Macro MM SHUFFLE x y expands to the value of x lt lt l y You can view the two integers as selectors for choosing which two words from the first input operand and which two words from the second ar
306. mm_rcp_pd 1 float F32vecl R rcp F32vecl A _mm_rcp_ss Reciprocal Square Root 4 floats F32vec4 R rsqrt F32vec4 A _mm_rsqrt_ps 2 doubles F64vec2 R rsqrt F64vec2 A _mm_rsqrt_pd 1 float F32vecl R rsqrt F32vecl A _mm_rsqrt_ss Reciprocal Newton Raphson F32vec4 R Sub ps add ps mul ps mm rcp ps 4 floats rcp nr F32vec4 A m 388838 Sub pd add pd mul pd rcp pd 2 doubles F64vec2 R rcp nr F64vec2 A m float F32vecl R rcp nr F32vecl A mm sub ss mm add ss mm mul ss 369 _mm_rcp_ss Reciprocal Square Root Newton Raphson F32vec4 R rsqrt_nr F32vec4 A mm_sub_pd mm_mul_pd mm_rsqrt_ps 2 doubles F64vec2 R rsqrt nr F64vec2 A mm sub pd mm mul pd mm rsqrt pd F32vecl R rsqrt nr F32vecl A mm sub ss mm mul ss mm rsqrt ss Horizontal Add float f add horizontal F32vec4 A mm add ss mm shuffle ss 1 double double d add horizontal F64vec2 A mm add sd mm shuffle sd Minimum and Maximum Operators Compute the minimums of the two double precision floating point values of A and B F64vec2 R simd min F64vec2 A F64vec2 B RO min A0 B0 R1 min A1 B1 Corresponding intrinsic mm min pd Compute the minimums of the four single precision floating point values of A and B F32vec4 R simd min F32vec4 A F32vec4 B RO min A0O B0 R1 min A1 B1 R2 min A2 B2 R3 min A3 B3 Corresp
307. mpiler errno EDOM for x lt 0 errno ERANGE for x 0 Calling interface double log double x long double logl long double x float logf float x Description The 10910 function returns the base 10 log of x logjo x This function may be inlined by the Itanium compiler errno EDOM for x lt 0 errno ERANGE for x 0 Calling interface double logi0 double x long double logl0l long double x float loglOf float x Description The 1og1p function returns the natural log of x41 1n x EI errno EDOM for x lt 1 errno ERANGE for x 1 Calling interface double loglp double x long double loglpl long double x float loglpf float x Intel Math Library LOG2 LOGB POW SCALB SCALBN Description The 1092 function returns the base 2 log of x log x errno EDOM for x lt 0 errno ERANGE for x 0 Calling interface double log2 double x long double log21 long double x float log2f float x Description The 1ogb function returns the signed exponent of x errno EDOM for x 0 Calling interface double logb double x long double logbl long double x float logbf float x Description The pow function returns x raised to the power of y x Calling interface errno EDOM for x 0 and y lt 0 errno EDOM for x lt 0 and y is a non integer errno ERANGE for overflow conditions double pow double x double y long double powl
308. mposite Sets the 8 signed 8 bit integer values in reverse order rO bO ri bl r7 b7 MMX Technology intrinsics on Itanium Architecture MMX technology intrinsics provide access to the MMX technology instruction set on Itanium based systems To provide source compatibility with the A 32 architecture these intrinsics are equivalent both in name and functionality to the set of IA 32 based MMX intrinsics Some intrinsics have more than one name When one intrinsic has two names both names generate the same instructions but the first is preferred as it conforms to a newer naming standard The prototypes for MMX technology intrinsics are in the mmint rin h header file Data Types The C data type _m64 is used when using MMX technology intrinsics It can hold eight 8 bit values four 16 bit values two 32 bit values or one 64 bit value The __ m64 data type is not a basic ANSI C data type Therefore observe the following usage restrictions e Use the new data type only on the left hand side of an assignment as a return value or as a parameter You cannot use it with other arithmetic expressions and so on e Use the new data type as objects in aggregates such as unions to access the byte elements and structures the address of an __m64 object may be taken e Use new data types only with the respective intrinsics described in this documentation For complete details of the hardware instructions see the In
309. ms User s Guide void _mm_mfence void Guarantees that every memory access that precedes in program order the memory fence instruction is globally visible before any memory instruction which follows the fence in program order void _mm_pause void The execution of the next instruction is delayed an implementation specific amount of time The instruction does not modify the architectural state This intrinsic provides especially significant performance gain and described in more detail below PAUSE Intrinsic The PAUSE intrinsic is used in spin wait loops with the processors implementing dynamic execution especially out of order execution In the spin wait loop PAUSE improves the speed at which the code detects the release of the lock For dynamic scheduling the PAUSE instruction reduces the penalty of exiting from the spin loop Example of loop with the PAUSE instruction spin loop pause cmp eax A jne spin_loop In the above example the program spins until memory location A matches the value in register eax The code sequence that follows shows a test and test and set In this example the spin occurs only after the attempt to get a lock has failed get_lock mov eax 1 xchg eax A Try to get lock cmp eax 0 Test if successful jne spin_loop critical_section code mov A 0 Release lock jmp continue spin_loop pause Spin loop hint cmp 0 A Check lock availability jne spin_loop jmp ge
310. n Iu8vec8 packu_sat Isl6vec4 A Isl6vec4 B Corresponding intrinsic mm packs pu16 Clear MMX TM Instructions State Operator Empty the MMX TM registers and clear the MMX state Read the guidelines for using the EMMS instruction intrinsic void empty void Corresponding intrinsic mm empty Integer Intrinsics for Streaming SIMD Extensions 50 Note You must include fvec h header file for the following functionality Compute the element wise maximum of the respective signed integer words in A and B Isl6vec4 simd_max Isl6vec4 A Isl6vec4 B Corresponding intrinsic mm max pil6 Compute the element wise minimum of the respective signed integer words in A and B Isl6vec4 simd_min Isl6vec4 A Isl6vec4 B Corresponding intrinsic mm min pil6 Compute the element wise maximum of the respective unsigned bytes in A and B Iu8vec8 simd max Iu8vec8 A Iu8vec8 B Corresponding intrinsic mm max pu8 Compute the element wise minimum of the respective unsigned bytes in A and B Iu8vec8 simd min Iu8vec8 A Iu8vec8 B Corresponding intrinsic mm min pu8 Create an 8 bit mask from the most significant bits of the bytes in A int move mask I8vec8 A Corresponding intrinsic mm movemask pi8 Conditionally store byte elements of A to address p The high bit of each byte in the selector B determines whether the corresponding byte in A will be stored void mask move I8vec8 A I8vec8 B signed char p Corresponding intrinsic mm
311. n prompt gt icpe ip Qoption c ip ninl max stats 5 source cpp 93 Intel C Compiler for Linux Systems User s Guide Multifile IPO Multifile IPO obtains potential optimization information from individual program modules of a multifile program Using the information the compiler performs optimizations across modules Building a program is divided into two phases compilation and linkage Multifile IPO performs different work depending on whether the compilation linkage or both are performed Compilation Phase As each source file is compiled multifile IPO stores an intermediate representation IR of the source code in the object file which includes summary information used for optimization By default the compiler produces mock object files during the compilation phase of multifile IPO Generating mock files instead of real object files reduces the time spent in the multifile IPO compilation phase Each mock object file contains the IR for its corresponding source file but no real code or data These mock objects must be linked using the ipo option or using the xild tool F Note Failure to link mock objects with ipo or xild will result in linkage errors There are situations where mock object files cannot be used See Compilation with Real Object Files for more information Linkage Phase When you specify ipo the compiler is invoked a final time before the linker The compiler performs multifile IPO acro
312. n 187 log library function 180 log10 library function 180 log1p library function 180 log2 library function ssssssesseeseeseeseeeesseseesse 180 logb library function sssessesseeseeeeeseeeesseeesee 180 Jong double option 11 81 loop transformation 117 lrint library function 187 Iround library function 187 Moption eese ti e eet 11 nr eer ee eege Ar Rene en 42 makefile SRI us He iat ut 42 march cpu option 11 85 math library see etes 60 matrix multiplication esses 131 mcpu cpu OptOn 11 Intel C Intrinsics Reference MD option 11 MF Optom i oe o na mine net 11 MG 6ptiOn rare hein deest 11 MM option seen 11 AMMD option eene 11 mno relax option csse 11 mno serialize volatile option 11 modf library function 187 Smp Optlon n Ie deti 11 81 mpl option sse 11 80 81 mrelax option essssssesseeeeereeeenenee 11 mserialize volatile option 11 MX OPTION iiine 11 nearbyint library function 187 nextafter library function 190 nexttoward library function 190 no cpprt option 11 nobss init Option 11 nodefaultlibs option 11 no gecc OPHION ssssessseessseseeseesersresreresseseesee 11 71 nolib inline optton 11 81 92 nostartfiles option 11 nostdinc Option 11 nostdlib option 11 OC Optl Onis nent 11 95 OO o
313. n b from the 2 signed or unsigned 64 bit integers in a r0 a0 bO rl al bl __m128i _mm_subs_epi8 __m128i a m128i b Subtracts the 16 signed 8 bit integers of b from the 16 signed 8 bit integers of a using saturating arithmetic r0 SignedSaturate a0 bO E1 SignedSaturate al b1 r15 SignedSaturate al5 b15 __m128i mm subs epil6 m128i a m128i b Subtracts the 8 signed 16 bit integers of b from the 8 signed 16 bit integers of using saturating arithmetic rO SignedSaturate a0 bO ri SignedSaturate al b1 r7 SignedSaturate a7 b7 m128i mm subs epu8 m128i a m128i b Subtracts the 16 unsigned 8 bit integers of b from the 16 unsigned 8 bit integers of a using saturating arithmetic r0 UnsignedSaturate a0 bO rl UnsignedSaturate al bl r15 UnsignedSaturate al5 b15 m128i mm subs epul6 m128i a m128i b Subtracts the 8 unsigned 16 bit integers of b from the 8 unsigned 16 bit integers of a using saturating arithmetic r0 UnsignedSaturate a0 bO rl UnsignedSaturate al bl r7 UnsignedSaturate a7 b7 Integer Logical Operations for Streaming SIMD Extensions 2 The following four logical operation intrinsics and their respective instructions are functional as part of Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file m128i mm and si128 m128i a m128i bi uses PA
314. n that the corresponding definition will have that visibility Symbol Preemption Sometimes you may need to use some of the functions or data items from a shareable object but may wish to replace others with your own definitions For example you may want to use the standard C runtime library shareable object Libc so but to use your own definitions of the heap management routines malloc and free In this case it is important that calls to malloc and free within 1ibc so call your definition of the routines and not the definitions present in 1ibc so Your definition should override or preempt the definition within the shareable object This feature of shareable objects is called symbol preemption When the runtime loader loads a component all symbols within the component that have default visibility are subject to preemption by symbols of the same name in components that are already loaded Since the main program image is always loaded first none of the symbols it defines will be preempted The possibility of symbol preemption inhibits many valuable compiler optimizations because symbols with default visibility are not bound to a memory address until runtime For example calls to a routine with default visibility cannot be inlined because the routine might be preempted if the compilation unit 1s linked into a shareable object A preemptable data symbol cannot be accessed using GP relative addressing because the name may be bound to a sy
315. n the C classes due to their immediate arguments However the C class implementation enables you to mix shuffle intrinsics with the other C functions For example F32vec4 fveca fvecb fvecd fveca fvecb fvecd mm shuffle ps fveca fvecb 0 Typically every instruction with horizontal data flow contains some inefficiency in the implementation If possible implement your algorithms without using the horizontal capabilities Branch Compression Elimination Branching in SIMD architectures can be complicated and expensive possibly resulting in poor predictability and code expansion The SIMD C classes provide functions to eliminate branches using logical operations max and min functions conditional selects and compares Consider the following example short a 4 b 4 c 4 for i20 i 4 i epil a i gt bli afi b il 338 Intel C Intrinsics Reference This operation is independent of the value of i For each i the result could be either A or B depending on the actual values A simple way of removing the branch altogether is to use the select_gt function as follows Isl6vec4 a b c c select_gt a b a b Caching Hints Streaming SIMD Extensions provide prefetching and streaming hints Prefetching data can minimize the effects of memory latency Streaming hints allow you to indicate that certain data should not be cached This results in higher performance for data that should be cached I
316. n type _Bool keyword n Note The restrict option enables the recognition of the restrict keyword as defined by the ANSI standard By qualifying a pointer with the rest rict keyword the user asserts that an object accessed via the pointer is only accessed via that pointer in the given scope It is the user s responsibility to use the rest rict keyword only when this assertion is true In these cases the use of restrict will have no effect on program correctness but may allow better optimization 77 Intel C Compiler for Linux Systems User s Guide These features are not supported e pragma STDC FP CONTRACT e pragma STDC FENV ACCESS e pragma STDC CX LIMITED RANGE e long double 128 bit representations Conformance to the C Standard The Intel C Compiler conforms to the ANSI ISO standard ISO IEC 14882 1998 for the C language however the export keyword for templates is not implemented 78 Compiler Optimizations Optimization Levels This section discusses the command line options 00 01 02 and 03 The 00 option disables optimizations Each of the other three turns on several compiler capabilities To specify one of these optimizations take into consideration the nature and structure of your application as indicated in the more detailed description of the options In general terms O1 O2 and 03 optimize as follows e O1 code size and locality e 02 code speed this is th
317. name is ipo out s ipo o file o Produces object file for the multifile IPO compilation You may specify an optional name for the object file or a directory with the backslash in which to place the file The default object file nameis ipo out o ipo fcode asm Add code bytes to assemblable files ipo fsource asm Add high level source code to assemblable files ipo fverbose asm Enable and disable respectively inserting comments containing ipo_fnoverbose asn version and options used in the assemblable file for xild 96 Compiler Optimizations Creating a Library from IPO Objects Normally libraries are created using a library manager such as ar Given a list of objects the library manager will insert the objects into a named library to be used in subsequent link steps prompt gt xiar cru user a a o b o A library named user a will be created containing a o and b o If however the objects have been created using ipo c then the objects will not contain a valid object but only the intermediate representation IR for that object file For example prompt gt icpe ipo c a cpp b cpp will produce a o and b o that only contains IR to be used in a link time compilation The library manager will not allow these to be inserted in a library In this case you must use the Intel library driver xild ar This program will invoke the compiler on the IR saved in the object file and generate a valid object that can b
318. name option The gcc name name option used with cxx1ib gcc lets you specify the location of g if the compiler cannot locate the gcc C libraries Use this option when referencing a non standard gcc installation gcc version The gcc version nnn option provides compatible behavior with gcc where nnn indicates the gcc version This version of the Intel compiler supports gcc version 320 ON by default Default Libraries and Headers The cxxlib icc option directs the Intel compiler to use the C run time libraries and C header files included with the Intel compiler They include e libcprts standard C headers e libcprts standard C library e libcxa and libunwind C language support The cxxlib icc option is ON by default and can be used with any supported Linux distribution See Release Notes 72 gcc Compatibility Summary of Corresponding Libraries and Headers Intel Library Header gcc Library Header libcprts libstdc libcxa libunwind gcc Predefined Macros The Intel C Compiler 8 0 includes new predefined macros also supported by gcc e GNUC e GNUC_MINOR e __GNUC_PATCHLEVEL__ You can specify the no gcc option if you do not want these macros defined If you need gcc interoperability cxxlib gcc do not use the no gcc compiler option See also GNU Environment Variables 73 Intel C Compiler for Linux Systems User s Guide gcc Bui
319. ng cpu dispatch stub unless the cpu specific function is declared static The inline attribute is disabled forall cpu specific and cpu dispatch functions Must have a stub for cpu specific function If a function f is defined as declspec cpu specific p thena cpu dispatch stub must also appear for within the program and p must be in the cpuid list of that stub otherwise that cpu specific definition cannot be called nor generate an error condition Overrides command line settings When a cpu dispatch stub is compiled its body is replaced with code that determines the processor on which the program is running then dispatches the best cpu specific implementation available as defined by the cpuid list The cpu specific function optimizes to the specified Intel processor regardless of command line option settings Compiler Optimizations Processor Dispatch Example Here is an example of how these features can be used include lt mmintrin h gt Pentium processor function does not use intrinsics to add two arrays __declspec cpu_specific pentium void array_sum int r int a int b size_t 1 for length gt 0 1 result att b Implementation for a Pentium processor with MMX technology uses an MMX instruction intrinsic to add four elements simultaneously __declspec cpu_specific pentium MMX void array_sum int r int const a int b size_t 1
320. none e shared variable list e private variable list e firstprivate variable list e lastprivate variable list e reduction operator variable list e ordered Clause descriptions are the same as for the OpenMP parallel construct or the taskq construct above as appropriate 152 Parallel Programming Example Function The test1 function below is a natural candidate to be parallelized using the workqueuing model You can express the parallelism by annotating the loop with a parallel taskq pragma and the work in the loop body with a task pragma The parallel taskq pragma specifies an environment for the while loop in which to enqueue the units of work specified by the enclosed task pragma Thus the loop s control structure and the enqueuing are executed single threaded while the other threads in the team participate in dequeuing the work from the taskq queue and executing it The captureprivate clause ensures that a private copy of the link pointer p is captured at the time each task is being enqueued hence preserving the sequential semantics void test1 LIST p pragma intel omp parallel taskq shared p while p NULL pragma intel omp task captureprivate p do workl p p p next 153 Optimization Support Features This section describes language extensions to the Intel C Compiler that let you optimize your source code directly Examples are included of optimizations
321. ns Specify one and only one of aligned or unaligned D If you specify aligned as an argument you must be absolutely sure that the loop will be vectorizable using this instruction Otherwise the compiler will generate incorrect code The loop in the example below uses the aligned qualifier to request that the loop be vectorized with aligned instructions as the arrays are declared in such a way that the compiler could not normally prove this would be safe to do so Example void foo float a pragma vector aligned for i 0 i lt m itt The compiler has at its disposal several alignment strategies in case the alignment of data structures is not known at compile time A simple example is shown below but several other strategies are supported as well If in the loop shown below the alignment of a is unknown the compiler will generate a prelude loop that iterates until the array reference that occurs the most hits an aligned address This makes the alignment properties of a known and the vector loop is optimized accordingly 127 Intel C Compiler for Linux Systems User s Guide Alignment Strategies Example float a alignment unknown for i 0 i lt 100 i ali a i 1 0f dynamic loop peeling p a amp OxOf if p 0 p 16 p 4 for i 0 i lt p i ali a i 1 0f loop with a aligned will be vectorized accordingly for i p i lt 100 i
322. nsions The Intel C Compiler implements the following groups of functions as extensions to the OpenMP run time library e getting and setting stack size for parallel threads e memory allocation The Intel extensions described in this section can be used for low level debugging to verify that the library code and application are functioning as intended It is recommended to use these functions with caution because using them requires the use of the openmp stubs command line option to execute the program sequentially These functions are also generally not recognized by other vendor s OpenMP compliant compilers which may cause the link stage to fail for these other compilers 3 Note The functions below require the pre processor directive include lt omp h gt Stack Size In most cases directives can be used in place of extensions For example the stack size of the parallel threads may be set using the KMP STACKSIZE environment variable rather than the kmp_set_stacksize_s function F Note A run time call to an Intel extension takes precedence over the corresponding environment variable setting See the definitions of stack size functions in the Stack Size table below Memory Allocation The Intel C Compiler implements a group of memory allocation functions as extensions to the OpenMP run time library to enable threads to allocate memory from a heap local to each thread These functions are kmp malloc kmp calloc
323. nteger Vector Classes The Ivec classes provide an interface to SIMD processing using integer vectors of various sizes The class hierarchy is represented in the following figure Ivec Class Hierarchy The M64 and M128 classes define the _ m64 and m1281i data types from which the rest of the Ivec classes are derived The first generation of child classes are derived based solely on bit sizes of 128 64 32 16 and 8 respectively for the 1128vec1 164vecl 164vec2 132vec2 I32vec4 I16vec4 I16vec8 I8vec16 and I8vec8 classes The latter seven of the these classes require specification of signedness and saturation A Caution Do not intermix the M64 and M128 data types You will get unexpected behavior if you do The signedness is indicated by the s and u in the class names Is64vec2 Iu64vec2 Is32vec4 Iu32vec4 Isl6vec8 Iul6vec8 Is8vecl6 Iu8vecl16 Is32vec2 Iu32vec2 Isl6vec4 Iul6vec4 Is8vec8 Iu8vec8 339 Terms Conventions and Syntax The following are special terms and syntax used in this chapter to describe functionality of the classes with respect to their associated operations Ivec Class Syntax Conventions The name of each class denotes the data type signedness bit size number of elements using the following generic format lt type gt lt signedness gt lt bits gt vec lt elements gt F I s uj 64 32 16 8 vec 8 4 2 1 where type indicates floating point F or integer I
324. ntify the test prof dpi file Sets the path name of the output report file comp Sets the filename that contains the list of files of interest 109 Intel C Compiler for Linux Systems User s Guide Option Description cutoff value Terminates when the cumulative block coverage reaches value of pre computed total coverage value must be greater than 0 0 for example 99 00 It may be set to 100 nototal Does not pre compute the total coverage mintime Minimizes testing execution time The execution time of each test must be provided on the same line of dpi list file after the test name in dd hh mm ss format verbose Generates more logging information about the program progress Usage Requirements To run the Test prioritization Tool on an application s tests the following files are required The spi file generated by the Intel compilers when compiling the application for the instrumented binaries with the pro genx option The dpi files generated by the Intel compiler profmerge tool as a result of merging the dynamic profile information dyn files of each of the application tests The user needs to apply the profmerge tool to all dyn files that are generated for each individual test and name the resulting dpi in a fashion that uniquely identifies the test The profmerge tool merges all the dyn files that exist in the given directory F Note It is very important that you make sure
325. ntrinsics and their respective instructions provide memory and initialization operations for the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file e Load Operations e Set Operations e Store Operations Integer Load Operations for Streaming SIMD Extensions 2 The following Load operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file __m128i _mm_load_sil28 __m128i const p uses MOVDQA Loads 128 bit value Address p must be 16 byte aligned D Se p __m128i mm loadu si128 41m128i const p uses MOVDQU Loads 128 bit value Address p not need be 16 byte aligned r Ze p m128i mm loadl epi64 41m128i const p uses MOVQ Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result zeroing the upper 64 bits of the result r0 p 63 0 r1 0x0 280 Intel C Intrinsics Reference Integer Set Operations for Streaming SIMD Extensions 2 The following set operation intrinsics and their respective instructions are functional in the Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file __m128i _mm_set_epi64 __m64 ql __m64 q0 Sets the 2 64 bit integer values r0 q0 rl ql __m128i mm set
326. ny optimization is attempted To get the debug information use the g option The compiler lets you generate code to support symbolic debugging while 01 O2 or O3 is specified on the command line along with g which produces symbolic debug information in the object file Note that if you specify the O1 O2 or 03 option with the g option some of the debugging information returned may be inaccurate as a side effect of optimization It is best to make your optimization and or debugging choices explicit e Ifyou need to debug your program excluding any optimization effect use the 00 option which turns off all the optimizations e If you need to debug your program with optimization enabled then you can specify the O1 02 or 03 option on the command line along with g Note The g option slows down the program when O1 02 or O3 is not specified In this case g turns on O0 which is what slows the program down If both 02 and g are specified the code should run nearly the same speed as if g were not specified Refer to the table below for the summary of the effects of using the g option with the optimization options These Produce these results options g Debugging information produced 00 enabled optimizations disabled fp enabled for IA 32 targeted compilations g OI Debugging information produced 01 optimizations enabled g 02 Debugging information produced O2 optimizations enabled g
327. o b The upper 32 bits of the result are forced to 0 and then bits 31 30 of b are copied to bits 62 61 of the result The result is returned __m64 m64 pshradd2 m64 a const int count m64 b The four signed 16 bit data elements of a are each independently shifted to the right by count bits the high order bits of each element are filled with the initial value of the sign bits of the data elements in a they are then added to the four signed 16 bit data elements of b The result is returned __m64 m64 paddluus _m64 a __m64 b a is added to b as eight separate byte wide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word m64 m64 padd2uus m64 a X m64 b a is added to b as four separate 16 bit wide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word m64 m i psubluus m i a __m64 b a is subtracted from b as eight separate byte wide elements The elements of a are treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word 305 Intel C Compiler for Linux Systems User s Guide __m64 _m64_psub2uus __m64 a __m64 b a is subtracted from b as four separate 16 bit wide elements The elements of a a
328. o effect In other words data is aligned to the maximum of its own alignment or the alignment specified with declspec align You can request alignments for individual variables whether of static or automatic storage duration Global and static variables have static storage duration local variables have automatic storage duration by default You cannot adjust the alignment of a parameter nor a field ofa struct or class You can however increase the alignment of a struct or union or class in which case every object of that type is affected As an example suppose that a function uses local variables i and j as subscripts into a 2 dimensional array They might be declared as follows int i j These variables are commonly used together But they can fall in different cache lines which could be detrimental to performance You can instead declare them as follows __declspec align 8 struct int i j sub The compiler now ensures that they are allocated in the same cache line In C you can omit the struct variable name written as sub in the above example In C however it is required and you must write references to i and j as sub i and sub j If you use many functions with such subscript pairs it is more convenient to declare and use a struct type for them as in the following example typedef struct __declspec align 8 int i j Sub By placing the declspec align after the keyword struct you are reque
329. o extern Space separated symbols listed in OFF the ile argument will get visibility set to default Space separated symbols listed in OFF the ile argument will get visibility set to protected Space separated symbols listed in OFF the ile argument will get visibility set to hidden Space separated symbols listed in OFF the ile argument will get visibility set to internal Ensure that string literals are OFF placed in a writable data section Use this option to specify the OFF location of g when compiler cannot locate gcc C libraries For use with cxxlib gcc configuration Use this option when referencing a non standard gcc installation This option provides compatible ON behavior with gcc where nnn indicates the gcc version This version of the Intel compiler supports gcc version 320 Default Intel C Compiler for Linux Systems User s Guide Option isystemdir no gcc nostdinc pch pch dir dirname prefetch prof format 32 shared libcxa static libcxa strict ansi T file use pch filename Wbrief 10 Description Add directory dir to the start of the system include path Do not predefine the GNUC GNUC_MINOR__ and __GNUC_PATCHLEVEL__ macros Same as X Automatic processing for precompiled headers Directs the compiler to find and or create a file for pre compiled headers in dirname Enables d
330. oat 4 floats 2 doubles 1 float 4 floats 2 doubles 374 F32vec4 R cmpeq F32vec4 A F64vec2 R cmpeq F64vec2 A F32vecl R cmpeq F32vecl A Compare for Inequality F32vec4 R cmpneq F32vec4 A F64vec2 R cmpneq F64vec2 A F32vecl R cmpneq F32vecl A Compare for Less Than F32vec4 R cmplt F32vec4 A F64vec2 R cmplt F64vec2 A F32vecl R cmplt F32vecl A Compare for Less Than or Equal F32vec4 R cmple F32vec4 A F64vec2 R cmple F64vec2 A Returns Example Syntax Usage Intrinsic Compare for Equality mm cmpeq ps mm cmpeq pd mm cmpeq ss mm cmpneq ps mm cmpneq pd mm cmpneq ss mm cmplt ps mm cmplt pd mm cmplt ss mm cmple ps mm cmple pd F32vec4 F64vec2 F32vec1 Intel C Intrinsics Reference 1 float F32vecl R cmple F32vecl A _mm_cmple_pd Compare for Greater Than 4 floats F32vec4 R cmpgt F32vec4 A _mm_cmpgt_ps 2 doubles F64vec2 R cmpgt F32vec42 A mm cmpgt pd 1 float F32vecl R cmpgt F32vecl A _mm_cmpgt_ss Compare for Greater Than or Equal To 4 floats F32vec4 R cmpge F32vec4 A _mm_cmpge_ps 2 doubles F64vec2 R cmpge F64vec2 A _mm_cmpge_pd 1 float F32vecl R cmpge F32vecl A _mm_cmpge_ss Compare for Not Less Than 4 floats F32vec4 R cmpnlt F32vec4 A _mm_cmpnlt_ps 2 doubles F64vec2 R cmpnit F64ve
331. oat 3 float f2 A2 EI float fl float 0 Aie F32vec4 A float 0 _mm_setl_ps AO f0 Initializes all return values Al f0 with the same floating point value A2 fo A3 f0 F32vec4 A double d0 _mm_setl_ps d AO d0 Initialize all return values with Al d0 the same double precision value A2 d0 A3 d0 F32vec1 A double d0 _mm_set_ss d AO d0 Initializes the lowest value of A Al 0 with dO and the other values with 0 AS 0 A3 0 365 F32vecl B float f0 _mm_set_ss Initializes the lowest value of B with f0 and the other values with 0 F32vecl B int I mm cvtsi32 ss Initializes the lowest value of B with f0 other values are undefined Arithmetic Operators The following table lists the arithmetic operators of the Fvec classes and generic syntax The operators have been divided into standard and advanced operations which are described in more detail later in this section Fvec Arithmetic Operators m Category Operation Operators Generic Syntax Standard Addition R A B Bes R A Subtraction R A B m R A Multiplication K R B s R A Ka Division R A B R A aaa Advanced Suare Root sqrt R sqrt A Reciprocal rcp R rcp A Newton Raphson rep_nr R rcp nr A Reciprocal Square Root rsqrt R rsqrt A Newton Raphson rsqrt nr R rsqrt nr A 366
332. occurs the most hits an aligned address This makes the alignment properties of a known and the vector loop is optimized accordingly Example of Alignment Strategies float a Alignment unknown for i 0 i 100 i a il a i 1 0f Dynamic loop peeling p a amp OxOf if p 0 p 16 p 4 for i 0 i lt p i a i a i 1 0f Loop with a aligned Will be vectorized accordingly for i p i 100 i a il a i 1 0f novector Directive The novector directive specifies that the loop should never be vectorized even if it is legal to do so In this example suppose you know the trip count ub 1b is too low to make vectorization worthwhile You can use novector to tell the compiler not to vectorize even if the loop is considered vectorizable 158 Optimization Support Features Example of novector Directive void foo int lb int ub pragma novector for j lb j ub j a j1 a 31 b 31 Optimizer Report Generation The Intel C Compiler provides options to generate and manage optimization reports e opt report generates an optimization report and directs it to stderr By default the compiler does not generate optimization reports e opt report filefilename generates an optimization report and directs it to a file specified in filename e opt report level min med max specifies the detail level of the optimization report The min argum
333. of FP registers and therefore each intrinsic corresponds to at least one pair of Itanium instructions operating on the pair of FP register operands Compatibility versus Performance Many of the Streaming SIMD Extensions intrinsics for Itanium based systems were created for compatibility with existing IA 32 intrinsics and not for performance In some situations intrinsic usage that improved performance on IA 32 will not do so on Itanium based systems One reason for this is that some intrinsics map nicely into the A 32 instruction set but not into the Itanium instruction set Thus it is important to differentiate between intrinsics which were implemented for a performance advantage on Itanium based systems and those implemented simply to provide compatibility with existing IA 32 code The following intrinsics are likely to reduce performance and should only be used to initially port legacy code or in non critical code sections e Any Streaming SIMD Extensions scalar intrinsic _ss variety use packed ps version if possible e comi and ucomi Streaming SIMD Extensions comparisons these correspond to IA 32 COMISS and UCOMISS instructions only A sequence of Itanium instructions are required to implement these e Conversions in general are multi instruction operations These are particularly expensive mm cvtpil6 ps mm cvtpul ps mm cvtpi8 ps mm cvtpu8 ps mm cvtpi32x2 ps mm cvtps pil6 mm cvtps pi8 e Streaming SIMD Extensions
334. of profile guided optimizations Normally profile information is generated by an instrumented application when it terminates by calling the standard exit function The functions described in this section may be necessary in assuring that profile information is generated in the following situations e when the instrumented application exits using a non standard exit routine e when instrumented application is a non terminating application where exit is never called e when you want control of when the profile information is generated This section includes descriptions of the functions and environment variable that comprise Profile Information Generation Support The functions are available by inserting include lt pgouser h gt at the top of any source file where the functions may be used The compiler sets a define for PGO INSTRUMENT when you compile with either prof genor prof genx Dumping Profile Information void _PGOPTI_Prof_Dump void Description This function dumps the profile information collected by the instrumented application The profile information is recorded in a dyn file Recommended Usage Insert a single call to this function in the body of the function which terminates your application Normally _PGOPTI_Prof_Dump should be called just once It is also possible to use this function in conjunction with _PGOPTI_Prof_Reset to generate multiple dyn files presumably from multiple sets of input data
335. of the mathematical functions long_double Option Use 1ong double to change the size of the long double type to 80 bits The Intel compiler s default long double type is 64 bits in size the same as the double type This option introduces a number of incompatibilities with other files compiled without this option and with calls to library routines Therefore Intel recommends that the use of Long double variables be local to a single file when you compile with this option 81 Intel C Compiler for Linux Systems User s Guide prec_div Option With some optimizations such as xK and xW the Intel C Compiler changes floating point division computations into multiplication by the reciprocal of the denominator For example A B is computed as A x 1 B to improve the speed of the computation However for values of B greater than 2 5 the value of 1 B is flushed changed to 0 When it is important to maintain the value of 1 B use prec_ div to disable the floating point division to multiplication optimization The result of prec_ div is greater accuracy with some loss of performance pcn Option Use the pcn option to enable floating point significand precision control Some floating point algorithms are sensitive to the accuracy of the significand or fractional part of the floating point value For example iterative operations like division and finding the square root can run faster if you lower the precision with the p
336. ofile summary information By default this prevents you from e using the profile summary file dpi if you move your application sources e sharing the profile summary file with another user who is building identical application sources that are located in a different directory Source Relocation To enable the movement of application sources as well as the sharing of profile summary files use profmerge with the src_old and src new options For example prompt profmerge prof dir pl src old lt p2 gt src new lt p3 gt where e lt p gt is the full path to dynamic information file dpi e p2 is the old full path to source files e p3 is the new full path to source files The above command will read the pgopt i dpi file For each function represented in the pgopti dpi file whose source path begins with the lt p2 gt prefix profmerge replaces that prefix with lt p3 gt The pgopti dpi file is updated with the new source path information You can execute profmerge more than once on a given pgopti dpi file You may need to do this if the source files are located in multiple directories For example prompt profmerge prof dir src old src prog 1 src new src prog 2 prompt profmerge prof dir src old proj 1 src new proj 2 In the values specified for are old and src new uppercase and lowercase characters are treated as identical Likewise forward slash and backward slash characters are t
337. old more than one data element the processor can process more than one data element simultaneously This processing capability is also known as single instruction multiple data processing SIMD For each computational and data manipulation instruction in the new extension sets there is a corresponding C intrinsic that implements that instruction directly This frees you from managing registers and assembly programming Further the compiler optimizes the instruction scheduling so that your executable runs faster 3 Note The MM and XMM registers are the SIMD registers used by the IA 32 platforms to implement MMX technology and Streaming SIMD Extensions Streaming SIMD Extensions 2 intrinsics On the Itanium based platforms the MMX and Streaming SIMD Extension intrinsics use the 64 bit general registers and the 64 bit significand of the 80 bit floating point register Data Types Intrinsic functions use four new C data types as operands representing the new registers that are used as the operands to these intrinsic functions The table below shows the new data type availability marked with X 201 Intel C Compiler for Linux Systems User s Guide Data Types Available MMX Technology E New Data Type m64 X X X m128 N A X X m128d N A N A X mi28i N A N A X X m64 Data Type Streaming SIMD Streaming SIMD Itanium amp Extensions Extensions 2 Processor 3 The __m64 data typ
338. ollows par_report0 no diagnostic information is displayed par report 1 indicates loops successfully auto parallelized default Issues a LOOP AUTO PARALLELIZED message for parallel loops par report2 indicates successfully auto parallelized loops as well as unsuccessful loops par report3 same as 2 plus additional information about any proven or assumed dependencies inhibiting auto parallelization reasons for not parallelizing 135 Intel C Compiler for Linux Systems User s Guide Example of Parallelization Diagnostics Report The example below shows output generated by par_report3 prompt gt icpe c parallel par report3 prog cpp Sample Ouput program prog procedure prog serial loop line 5 not a parallel candidate due to Statement at line 6 serial loop line 9 flow data dependence from line 10 to line 10 due to a 12 Lines Compiled where the program prog cpp is as follows Sample prog c Assumed side effects for i 1 i lt 10000 i foo i Actual dependence for i 1 i lt 10000 i a i 1 i Troubleshooting Tips e Use par_threshold0 to see if the compiler assumed there was not enough computational work e Use par_report3 to view diagnostics e Use ipo to eliminate assumed side effects done to function calls Parallelization with OpenMP The Intel C Compiler supports the OpenMP C version 2 0 API specification OpenMP provid
339. ompare for greater than rO a0 gt b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 m128 mm cmpgt ps m128 a m128 b Compare for greater than r0 a0 gt bO Oxffffffff 0x0 rl al gt bl Oxffffffff 0x0 r2 a2 gt b2 Oxffffffff 0x0 r3 a3 gt b3 Oxffffffff 0x0 m128 mm cmpge ss m128 a m128 b Compare for greater than or equal r0 a0 gt b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 m128 mm cmpge ps m128 a m128 b Compare for greater than or equal r0 a0 gt b0 Oxffffffff 0x0 rl al gt bl Oxffffffff 0x0 r2 a2 gt b2 Oxffffffff 0x0 r3 a3 gt b3 Oxffffffff 0x0 m128 mm cmpneq ss m128 a m128 Di Compare for inequality rO a0 b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 m128 mm cmpneq ps m128 a m128 b 228 Compare for inequality r0 ze a0 b Oxffffffff rl al bl Oxffffffff r2 a2 b2 Oxffffffff r3 a3 b3 Oxffffffff 0x0 0x0 0x0 0x0 Intel C Intrinsics Reference __m128 mm cmpnlt ss m128 a __m128 Compare for not less than rO a0 lt bO Oxf fffffff rl al r2 a2 r3 a3 m128 mm cmpnlt ost m128 a m128 Compare for not less than rO a0 lt b0 Oxffffffff rl al lt bl Oxffffffff r2 a2 lt b2 Oxffffffff r3 ze l a3 lt b3 Oxffffftft m128 mm cmpnle ss m128 a m128 Compare for not less than or equal rO a0 lt
340. on so Pentium 4 processor Streaming SIMD Extensions 2 are not implemented on Itanium based systems For more details refer to the Pentium 4 processor Streaming SIMD Extensions 2 External Architecture Specification EAS and other Pentium 4 processor manuals available for download from the developer intel com web site You should be familiar with the hardware features provided by the Streaming SIMD Extensions 2 when writing programs with the intrinsics The following are three important issues to keep in mind e Certain intrinsics such as_mm_loadr_pd and _mm_cmpgt_sd are not directly supported by the instruction set While these intrinsics are convenient programming aids be mindful of their implementation cost e Data loaded or stored as m128d objects must be generally 16 byte aligned e Some intrinsics require that their argument be immediates that is constant integers literals due to the nature of the instruction The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file Floating point Arithmetic Operations for Streaming SIMD Extensions 2 The arithmetic operations for the Streaming SIMD Extensions 2 are listed in the following table and are followed by descriptions of each intrinsic The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file Corresponding Operation RO R1 Instruction Value Value oe esu Ma iu SE X i 249
341. on and to display and check compilation errors The options in this section describe e Parsing for Syntax Only e Optimizations and Debugging Parsing for Syntax Only Use the syntax option to stop processing source files after they have been parsed for C language errors This option provides a method to quickly check whether sources are syntactically and semantically correct The compiler creates no output file In the following example the compiler checks prog cpp and displays diagnostic information to the standard error output prompt gt icpe syntax prog cpp Optimizations and Debugging This topic describes the command line options that you can use to debug your compilation and to display and check compilation errors The options that enable you to get debug information while optimizing are as follows Option Description 00 Disables optimizations Enables the fp option g Generates symbolic debugging information and line numbers in the object code for use by the source level debuggers Turns off 02 and makes O0 the default unless O1 O2 or O3 is explicitly specified in the command line together with g f E Disable using the EBP register as general purpose register Option Effect on fp O1 02 or 03 Disables fp 00 Enables fp 58 Building and Debugging Applications Combining Optimization and Debugging The 00 option turns off all optimizations so you can debug your program before a
342. onding intrinsic mm min ps Compute the minimum of the lowest single precision floating point values of A and B F32vecl R simd min F32vecl A F32vecl B RO min A0 B0 Corresponding intrinsic mm min ss Compute the maximums of the two double precision floating point values of A and B F64vec2 simd max F64vec2 A F64vec2 B RO max A0 B0 R1 max A1 B1 Corresponding intrinsic mm max pd 370 Intel C Intrinsics Reference Compute the maximums of the four single precision floating point values of A and B F32vec4 R simd_man F32vec4 A F32vec4 B RO max A0Q BO R1 max A1 B1 R2 max A2 B2 R3 max A3 B3 Corresponding intrinsic mm max ps Compute the maximum of the lowest single precision floating point values of A and B F32vecl simd max F32vecl A F32vecl B RO max A0 B0 Corresponding intrinsic mm max ss Logical Operators The table below lists the logical operators of the Fvec classes and generic syntax The logical operators for F32vec1 classes use only the lower 32 bits Fvec Logical Operators Return Value Mapping i Bitwise Operation Operators Generic Syntax AND OR XOR andnot andnot andnot A The following table lists standard logical operators syntax and corresponding intrinsics Note that there is no corresponding scalar intrinsic for the F32vec1 classes which accesses the lower 32 bits of the packed vector intrinsics Logical Operati
343. ons Calling interface double expml double x long double expmll long double x float expmlf float x Description The frexp function converts a floating point number x into signed normalized fraction in 1 2 1 multiplied by an integral power of two The signed normalized fraction is returned and the integer exponent stored at location exp Calling interface double frexp double x int exp long double frexp long double x int exp float frexpf float x int exp Description The hypot function returns the square root of x y errno ERANGE for overflow conditions Calling interface double hypot double x double y long double hypotl long double x long double y float hypotf float x float y 181 Intel C Compiler for Linux Systems User s Guide ILOGB LDEXP LOG LOG10 LOG1P 182 Description The ilogb function returns the exponent of x base two as a signed int value errno ERANGE for x 0 Calling interface int ilogb double x int ilogbl long double x int ilogbf float x exp Description The 1dexp function returns x 2 value Where exp is an integer errno ERANGE for underflow and overflow conditions Calling interface double ldexp double x int exp long double ldexpl long double x int exp float ldexpf float x int exp Description The 10g function returns the natural log of x 1n x This function may be inlined by the Itanium co
344. ons Cross Reference sis 31 Default Compiler Options sisi 37 Building and Debugging Applications 39 Getting KE WEE 39 Building Applications from the Command Line ss 40 elle Elei Bleu TEE 43 MAKING EcL 57 RT e ee Late eL 58 Using E 60 Default Libraries via maintien te ce aei cette lita 60 Intel Shared Libraries einen ai ete eee eee E te E A ote ad 62 Managirig Bibraries oin ptem inte te mete etti telis 62 Compiling for Non shared Libraries ccccesceeeeeeeeneeeeaeeeeaeeceeeeceaeeesaaeseeeeeseaeeesaeeeeaeeeeeeeeaas 63 goc Compatibilily cae scars isto tario ee 67 gcc Interoperability deer cite ciere eei idee dee ned a A eap eins 71 gee BUN ipee 74 ele later le nde Petai audi nn ne serbe latte er de desert lits 75 Language CON OLIN CS STARS ne nai 76 Gonformance Options ioi detente ete cient eaten s ee RED ru ieee 76 Conformance to the C Standard eene entente nennen 76 Conformance to the C Standard 78 Compiler EE 79 Optimizat M DEE 79 Floating point Optimizations usine 81 Optimizing for Specific Processors us 84 Interprocedural Optimizations sise 91 M ltifile PO Re ee Eeer 94 Inline Expansion of Functions ss 97 Profile guided Optimizations sise 99 High level Language Optimizations HO 116 EE sara pie 118 Table Of Contents Vectorization IA 32 only issus 119 Auto Parallelization 5 r EE 132 Parallelization with OpenMP
345. ons for Fvec Classes Operation Returns Example Syntax Usage Intrinsic AND 4 floats F32vec4 amp F32vec4 A amp F32vec4 mm and ps B r F32vec4 amp amp F32vec4 A 2 F64vec2 R F64vec2 A amp F32vec2 _mm_and_pd doubles B F64vec2 R amp F64vec2 A 1 float F32vec1 R F32vecl A amp F32vecl _mm_and_ps B F32vec1 R amp F32vec1 A 371 Operation Returns Example Syntax Usage Intrinsic OR 4 floats F32vec4 R F32vec4 A F32vec4 mm or ps B F32vec4 R F32vec4 A 2 F64vec2 R F64vec2 A F32vec2 mm or pd doubles B F64vec2 R F64vec2 A 1 float F32vecl R F32vecl A F32vecl _mm_or_ps B F32vecl R F32vecl A XOR 4 floats F32vec4 R F32vec4 A F32vec4 mm xor ps B F32vec4 R F32vec4 A 2 F64vec2 R F64vec2 A mm xor pd doubles F364vec2 B F64vec2 R F64vec2 A float F32vecl R F32vecl A F32vecl mm xor ps B F32vecl R F32vecl A ANDNOT 2 F64vec2 R andnot F64vec2 A mm andnot pd doubles F64vec2 B H Compare Operators The operators described in this section compare the single precision floating point values of A and B Comparison between objects of any Fvec class return the same class being compared The following table lists the compare operators for the Fvec classes Compare Operators and Corresponding Intrinsics Compare For Operators Syntax Equality cmpeq R cmpeq A B Inequality cmpneq R cmpne
346. ontrol The par_threshold n option sets a threshold for the auto parallelization of loops based on the probability of profitable execution of the loop in parallel The value of n can be from 0 to 100 The default value is 75 This option is used for loops whose computation work volume cannot be determined at compile time The threshold is usually relevant when the loop trip count is unknown at compile time The par threshold n option has the following versions and functionality Default par threshold is not specified in the command line which is the same as when par_thresholdO is specified The loops get auto parallelized regardless of computation work volume that is parallelize always par threshold100 loops get auto parallelized only if profitable parallel execution is almost certain The intermediate 1 to 99 values represent the percentage probability for profitable speed up For example n 50 would mean parallelize only if there is a 50 probability of the code speeding up if executed in parallel The default value of n is n 75 or par threshold75 When par threshold is used on the command line without a number the default value passed is 75 The compiler applies a heuristic that tries to balance the overhead of creating multiple threads versus the amount of work available to be shared amongst the threads Diagnostics The par_report 0 1 2 3 option controls the auto parallelizer s diagnostic levels 0 1 2 or 3 as f
347. ools vec_report n Controls the amount of ON Per vectorizer diagnostic vec reportl information e n 0 no diagnostic information e n l indicates vectorized loops DEFAULT e n 2 indicates vectorized non vectorized loops e n 3 indicates vectorized non vectorized loops and prohibiting data dependence information e n 4 indicates non vectorized loops e n 5 indicates non vectorized loops and prohibiting data W Disable all warnings OFF Wall Enable all warnings OFF Wbrief Enable a mode in which a OFF shorter form of the diagnostic output is used When enabled the original source line is not displayed and the error message text is not wrapped when too long to fit on a single line 28 Compiler Options Quick Reference Option Description Default Wcheck Performs compile time code OFF checking for code that exhibits non portable behavior represents a possible unintended code sequence or possibly affects operation of the program because of a quiet change in the ANSI C Standard wn Control diagnostics ON e n 0 displays errors wl same as w e n 1 displays warnings and errors DEFAULT e n 2 displays remarks warnings and errors wdLl1 L2 Disables diagnostics L1 through OFF LN weLl L2 Changes severity of diagnostics OFF L1 through LN to error Werror Force warnings to be reported as OFF errors Limits the number of errors ON displaye
348. or specific compilation options used For more information see Criteria for Inline Expansion of Functions A change of the default precision control or rounding mode may affect the results returned by some of the mathematical functions See Floating point Arithmetic Precision Depending on the data types used some important compiler options include e Long double Use this option when compiling programs that require support for the long double data type 80 bit floating point Without this option compilation will be successful but long double data types will be mapped to double data types e c99 Use this option when compiling programs that require support for Complex data types 175 Intel C Compiler for Linux Systems User s Guide Math Functions Trigonometric Functions The Intel Math library supports the following trigonometric functions ACOS ACOSD ASIN ASIND ATAN 176 Description The acos function returns the principal value of the inverse cosine of x in the range 0 pi radians for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double acos double x long double acosl long double x float acosf float x Description The acosd function returns the principal value of the inverse cosine of x in the range 0 180 degrees for x in the interval 1 1 errno EDOM for x gt 1 Calling interface double acosd double x long double acosdl long double x
349. os specified by the ISO ANSI standard are not listed in the table For a list of all macro definitions in effect use the E aM options For example prompt gt icpe E dM progl cpp Macro Name Value Architecture DATE Current date Both ECC 1 Itanium architecture only EDG 1 Both EDG VERSION 302 Both ___ELF 1 Both extension no value Both gnu linux 1 Both __GNUC__ 3 Both GNUC_MINOR 2 Both __GNUC_PATCHLEVEL 0 Both ___GXX_ABI_VERSION 102 Both HONOR STD 1 IA 32 only 1386 1 IA 32 only 1386 1 IA 32 only i386 1 IA 32 only __ia64 1 Itanium architecture only __ia64 1 Itanium architecture only ia64 1 Itanium architecture only Ice 800 IA 32 only 46 Building and Debugging Applications Macro Name Value Architecture __INTEL COMPILER 800 _INTEGRAL_MAX_BITS 64 Itanium architecture only itanium 1 Itanium architecture only __linux 1 linux 1 linux 1 Both __LONG_DOUBLE_SIZE__ 80 IA 32 only __1p64 1 Itanium architecture only LP64 1 Itanium architecture only _LP64 1 Itanium architecture only __NO_INLINE__ 1 Both __NO_MATH_INLINES 1 Both __NO_STRING_INLINES 1 Both OPTIMIZE 1 Both PGO INSTRUMENT 1 Both __PTRDIFE TYPE int Both on IA 32 long on Itanium architecture __OMSPP_ 1 IA 32 only REGISTER PREFIX no value Both SIGNED CHARS 1 Both SIZE_TYPE unsigned Both on IA 32 unsigned long on Itanium architectu
350. ov prj Project Name comp componenti Elte Each line of the component file should include one and only one module name Any module of the application whose full path name has an occurrence of any of the names in the component file will be selected for coverage analysis For example if a line of file component1 in the above example contains mod cpp then all modules in the application that have such a name will be selected The user can specify a particular module by giving more specific path information For instance if the line contains cmp1 mod1 cpp then only those modules with the name mod1 cpp will be selected that are in a directory named cmp1 If no component file is specified then all files that have been compiled with prof gens are selected for coverage analysis Dynamic Counters This feature displays the dynamic execution count of each basic block of the application providing useful information for both coverage and performance tuning The coverage tool can be configured to generate information about dynamic execution counts This configuration requires the counts option The counts information is displayed under the 107 Intel C Compiler for Linux Systems User s Guide code after a sign precisely under the source position where the corresponding basic block begins If more than one basic block is generated for the code at a source position macros for example then the total number of such blocks and
351. p __m128d a uses MOVAPD Stores two DP FP values The address dp must be 16 byte aligned dp 0 a0 dp 1 al void _mm_storeu_pd double dp __m128d a uses MOVUPD Stores two DP FP values The address dp need not be 16 byte aligned dp 0 a0 dp 1 al void _mm_storer_pd double dp __m128d a uses MOVAPD shuffling Stores two DP FP values in reverse order The address dp must be 16 byte aligned dp 0 al dp 1 a0 void _mm_storeh_pd double dp __m128d a uses MOVHPD Stores the upper DP FP value of a dp al void _mm_storel_pd double dp __m128d a uses MOVLPD Stores the lower DP FP value of a dp a0 Miscellaneous Operations for Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file __mi28d _mm_unpackhi_pd __m128d a __m128d bi uses UNPCKHPD Interleaves the upper DP FP values of a and b r0 al rl b1 __mi28d _mm_unpacklo_pd __m128d a __m128d b uses UNPCKLPD Interleaves the lower DP FP values of a and b ro a0 ri bo int _mm_movemask_pd __m128d a uses MOVMSKPD Creates a two bit mask from the sign bits of the two DP FP values of a r sign al lt lt 1 sign a0 __mi28d mm shuffle pd _mi28d a __m128d b int i uses SHUFPD Selects two specific DP FP values from a and b based on the mask i The mask must be an immediate See Macro Function for Shuffle for a description of the shuffle semantics
352. p sp al bl cl dl ah bh ch dh st st 1 st 7 mm0 mm7 xmm0 xmm7 and cc It is also legal to specify memory ina clobber spec This prevents the compiler from keeping data cached in registers across the asm statement Intrinsics Cross processor Implementation This section provides a series of tables that compare intrinsics performance across architectures Before implementing intrinsics across architectures please note the following e Instrinsics may generate code that does not run on all IA processors Therefore the programmer is responsible for using CPUID to detect the processor and generating the appropriate code e Implement intrinsics by processor family not by specific processor The guiding principle for which family IA 32 or Itanium processors the intrinsic is implemented on is performance not compatibility Where there is added performance on both families the intrinsic will be identical Intrinsics For Implementation Across All IA The following intrinsics provide significant performance gain over a non intrinsic based code equivalent int abs int long labs long float logf float double fabs double double log double unsigned long _ lrotl unsigned long value int shift unsigned long __lrotr unsigned long value int shift unsigned int __rotl unsigned int value int shift unsigned int __rotr unsigned int value int shift
353. pd Storeh pd Storel pd add epi8 add epil6 add epi32 add si64 add epi64 adds epi8 adds epil6 adds epu8 adds epu16 avg epu8 Across MMX All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Streaming SIMD Extenions Extensions 2 N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A N A A Itanium amp Architecture N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A 327 Intel C Compiler for Linux Systems User s Guide Intrinsic Across MMX Streaming Streaming Itanium amp AIl IA Technology SIMD SIMD Architecture Extenions Extensions mm avg epul6 N A N A N A N A mm madd epil6 N A N A N A N A mm max epil N A N A N A N A mm max epu8 N A N A N A N A mm min epil6 N A N A N A N A mm min epu8 N A N A N A N A mm mulhi epil6 N A N A N A N A mm mulhi epul6 N A N A N A N A mm mullo epil6 N A N A N A N A mm mul su32 N A N A N A N A mm mul epu32 N A N A
354. pe data dependence and ambiguity resolution Language Support Feature Description __declspec align n Directs the compiler to align the variable to an n byte boundary Address of the variable is address mod n 0 declspec align n off Directs the compiler to align the variable to an n byte boundary with offset off within each n byte boundary Address of the variable is address mod n off restrict Permits the disambiguator flexibility in alias assumptions which enables more vectorization 125 Intel C Compiler for Linux Systems User s Guide Feature Description assume aligned a n Instructs the compiler to assume that array a is aligned on an n byte boundary used in cases where the compiler has failed to obtain alignment information fpragma ivdep Instructs the compiler to ignore assumed vector dependencies pragma vector aligned unaligned always Specifies how to vectorize the loop and indicates that efficiency heuristics should be ignored pragma novector Specifies that the loop should never be vectorized Multi version Code Multi version code is generated by the compiler in cases where data dependence analysis fails to prove independence for a loop due to the occurrence of pointers with unknown values This functionality is referred to as dynamic dependence testing Pragma Scope These pragmas control the vectorization of only the subsequent loop in the program but th
355. pecified it should be quoted along with the parentheses delimiting it 44 Building and Debugging Applications For example to make an assertion for the identifier fruit with the associated values orange and banana use the following command prompt gt icpe A fruit orange banana progl cpp Using D Use the D option to define a macro Syntax Dname value Argument Description name The name of the macro to define value Indicates a value to be substituted for name If you do not enter a value name is set to 1 The value should be quoted if it contains non alphanumerics For example to define a macro called SIZE with the value 100 use the following command prompt icpc DSIZE 100 progl cpp The D option can also be used to define functions For example prompt gt icpe D f x 2x progl cpp Using U Use the U option to remove undefine a pre defined macro Syntax Uname Argument Description The name of the macro to undefine name E Note If you use D and U in the same compilation the compiler processes the D option before U rather than processing them in the order they appear on the command line 45 Intel C Compiler for Linux Systems User s Guide Predefined Macros The predefined macros available for the Intel C Compiler are described in the table below The Architecture column indicates which Intel architecture supports the macro Predefined macr
356. per DP FP value is passed through from a rO sqrt b0 rd al __mi28d mm sort pd m128d a Computes the square roots of the two DP FP values of a rO sqrt a0 rl sqrt al __mi28d mm min sd mi28d a m128d b Computes the minimum of the lower DP FP values of a and b The upper DP FP value is passed through from a rO min a0 b0 rl sal __mi28d _mm_min_pd __m128d a __m128d b Computes the minima of the two DP FP values of a and b rO min a0 bO rl min al b1 __mi28d _mm_max_sd __m128d a __m128d b Computes the maximum of the lower DP FP values of a and b The upper DP FP value is passed through from a rO max a0 b0 ri at __mi28d _mm_max_pd __m128d a __m128d b Computes the maxima of the two DP FP values of a and b rO max a0 bO rl max al bl Logical Operations for Streaming SIMD Extensions 2 The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmint rin h header file __mi28d _mm_and_pd __m128d a __m128d b uses ANDPD Computes the bitwise AND of the two DP FP values of a and b ro a0 amp bO rl al amp bl m128d mm andnot pd mi128d a m128d bi uses ANDNPD Computes the bitwise AND of the 128 bit value in b and the bitwise NOT of the 128 bit value in a rO a0 amp bO rl al amp bl 251 Intel C Compiler for Linux Systems User s Guide __m128d _mm_or_pd __m128d a m128d b uses
357. ping and sets the approximate frequency at which dumps will occur The interval parameter is measured in milliseconds and specifies the time interval at which profile dumping will occur For example if interval is set to 5000 then a profile dump and reset will occur approximately every 5 seconds The interval is approximate because the time check controlling the dump and reset is only performed upon entry to any instrumented function in your application 115 Intel C Compiler for Linux Systems User s Guide F Note e Setting interval to zero or a negative number will disable interval profile dumping e Setting interval to a very small value may cause the instrumented application to spend nearly all of its time dumping profile information Be sure to set interval to a large enough value so that the application can perform actual work and collect substantial profile information Recommended Usage Call this function at the start of a non terminating application to initiate Interval Profile Dumping Note that an alternative method of initiating Interval Profile Dumping is by setting the environment variable PROF DUMP INTERVAL to the desired interval value prior to starting the application The intention of Interval Profile Dumping is to allow a non terminating application to be profiled with minimal changes to the application source code Environment Variable PROF DUMP INTERVAL This environment variable may be used to
358. portion of code colored in this color was exercised by the tests The default color can be overridden with the ccolor option Uncovered Basic blocks that are colored in this color were not exercised by any of the basic block tests They were however within functions that were executed during the tests The default color can be overridden with the bcolor option Uncovered Functions that are colored in this color were never called during the tests The function default color can be overridden with the fcolor option Partially More than one basic block was generated for the code at this position Some covered code ofthe blocks were covered while some were not The default color can be overridden with the pcolor option 106 Compiler Optimizations This color No code was generated for this source line Most probably the source at this position is a comment a header file inclusion or a variable declaration The default color can be overridden with the ucolor option The default colors can be customized to be any valid HTML color by using the options mentioned for each coverage category in the table above For code coverage colored presentation the coverage tool uses the following heuristic Source characters are scanned until reaching a position in the source that is indicated by the profile information as the beginning of a basic block If the profile information for that basic block indicates that a coverage category c
359. ptimize your application s performance for a specific Intel Itanium processor The resulting binary will also run on the processors listed in the table below The Intel C Compiler includes gec compatible versions of the t pp options These options are listed in the gec Version column Option gcc Version Optimizes for tppl mcpu itanium Itanium processors tpp2 mcpu itanium2 Itanium 2 processors FP Note The tpp2 option is ON by default 84 Compiler Optimizations Example The invocations listed below all result in a compiled binary optimized for the Intel Itanium 2 processor The same binary will also run on Intel Itanium processors prompt gt icpe prog cpp prompt gt icpc tpp2 prog cpp prompt icpc mcpu itanium2 prog cpp Processor specific Optimization IA 32 only The x KIW N B P options target your program to run on a specific Intel processor The resulting code might contain unconditional use of features that are not supported on other processors Option Specific Optimization for E xK Intel Pentium III and compatible Intel processors xW Intel Pentium 4 and compatible Intel processors xN Intel Pentium 4 and compatible Intel processors Programs where the function main is compiled with this option will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processo
360. ption iiec 11 80 OT optIOn eene een 11 79 Q2 OptIOT edat cereos 11 79 Q3 ODHOnDa uisa ee ERE eius 11 79 Ob Option Rare e even 11 OMP DYNAMIC environment variable 143 OMP NESTED environment variable 143 OMP NUM THREADS environment variable X H A 134 143 OMP SCHEDULE environment variable 134 143 openmp option 11 140 OpenMP E San RR ou RR 141 CULE CEIVES eet Ee 141 OpenMP 137 140 141 142 143 144 147 148 153 openmp report option 11 140 openmp stubs option 11 opt report option 11 opt report file option 11 159 opt report help option sssssesseeseeseeseees 11 159 opt report level option 11 159 opt report phase option 11 159 opt report routine option 11 159 optimization for floating point precision 81 83 for Intel processors sssssss 85 86 high level language 117 interprocedural 92 94 95 97 98 parallel programming 133 134 135 137 140 141 142 143 144 147 148 153 profile guided 99 100 101 102 103 109 114 115 116 TESINE enero e tue ie 80 vectorization 119 120 121 122 123 124 125 129 131 optimization Aene eheu eee 79 options cross reference 31 default eese a de ne 37 TOW sn nement 7 quick reference rRNA 11 P option 4
361. q A B Greater Than cmpgt R cmpgt A B Greater Than or Equal To cmpge R cmpge A B Not Greater Than cmpngt R cmpngt A B Not Greater Than or Equal To cmpnge R cmpnge A B 372 Intel C Intrinsics Reference RE Compare For Operators Less Than Less Than or Equal To Not Less Than Not Less Than or Equal To Compare Operators The mask is set to Ox for each floating point value where the comparison is true and 0x00000000 where the comparison is false The table below shows the return values for each class of the compare operators which use the syntax described earlier in the Return Value Notation section Compare Operator Return Value Mapping R AO For Any B if True If False F32vec4 F64vec2 F32vec1 Operators RO Al emp eq Bl Oxffff ffff 0Ox0000000 X X X WAL lt B1 le gt ge cmp ne nit nle ngt nge RI AI emp eq B2 Oxffffffff 0x0000000 X X N A ail tt B2 le gt gel cmp ne nit nle ngt nge R2 Al emp eq B3 Oxffffffff 0x0000000 X N A N A Ar1 l 1t B3 le gt ge cmp ne nit nle ngt nge 373 Operators For Any If True If False Oxffffffff 0x0000000 X The table below shows examples for arithmetic operators and intrinsics Compare Operations for Fvec Classes 4 floats 2 doubles float 4 floats 2 doubles 1 fl
362. r Default For Itanium based Systems cpu values are e itanium Optimize for Itanium processor e itanium2 Optimize for Itanium 2 processor Default MD Preprocess and compile OFF Generate output file d extension containing dependency information MF file Generate makefile dependency OFF information in file Must specify M or MM MG Similar to M but treats missing OFF header files as generated files MM Similar to M but does not OFF include system header files MMD Similar to MD but does not OFF include system header files mp Favors conformance to the OFF ANSI C and IEEE 754 standards for floating point arithmetic mp1 Improve floating point precision OFF speed impact is less than mp Pass relax to the linker ON mno relax Do not pass relax to the OFF linker 20 Compiler Options Quick Reference Option Description Default mserialize volatile Impose strict memory access OFF ordering for volatile data object references mno serialize volatile The compiler may suppress both OFF run time and compile time memory access ordering for volatile data object references Specifically the xe1 acq completers will not be issued on referencing loads and stores MX Generate dependency file OFF o dep extension containing information used for the Intel wb tool nobss init Places variables that are OFF initialized with ze
363. r application The default status of ftz is OFF By default the compiler lets results gradually underflow The ftz switch only needs to be used on the source containing function main The effect of the ftz switch is to turn on FTZ mode for the process started by main The initial thread and any threads subsequently created by that process will operate in FTZ mode s Note The 03 option turns ftz ON Use ft z to disable flushing denormal results to zero Allocation of Zero initialized Variables By default variables explicitly initialized with zeros are placed in the BSS section But using the nobss_init option you can place any variables that are explicitly initialized with zeros in the DATA section if required 54 Building and Debugging Applications Precompiled Header Files pch The Intel C Compiler supports precompiled header PCH files to significantly reduce compile times using the following options e pch e create_pch filename e use_pch filename e pch_dir dirname A Caution Depending on how you organize the header files listed in your sources these options may increase compile times See Organizing Source Files to learn how to optimize compile times using the PCH options The pch option directs the compiler to use appropriate PCH files If none are available they are created as sourcefile pchi This option supports multiple source files such as the ones shown in Example 1 Exampl
364. r equal to b If a is less than or equal to b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 _mm_comigt_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than b If a is greater than b are equal is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 _mm_comige_sd __m128d a __m128d b Compares the lower DP FP value of a and b for a greater than or equal to b If a is greater than or equal to b 1 is returned Otherwise 0 is returned r a0 gt b0 0x1 0x0 _mm_comineq_sd __m128d a __m128d bi Compares the lower DP FP value of a and b for a not equal to b If a and b are not equal 1 is returned Otherwise 0 is returned r a0 p0 0x1 0x0 mm ucomieq sd mi28d a __m128d bi Compares the lower DP FP value of a and b for a equal to b If a and b are equal 1 is returned Otherwise 0 is returned r e a0 b0 0x1 0x0 _mm ucomilt sd mi128d a m128d b Compares the lower DP FP value of a and b for a less than b If a is less than b 1 is returned Otherwise 0 is returned r a0 lt b0 0x1 0x0 _mm_ucomile_sd __m128d a m128d b Compares the lower DP FP value of a and b for a less than or equal to b If a is less than or equal to b is returned Otherwise 0 is returned r a0 lt b0 Oxl 0x0 Intel C Intrinsics Reference int _mm_ucomigt_sd __m128d a __m128d bi Compares the lower DP FP value of a and b
365. r for Linux Systems User s Guide is FE Lo ANNUM void itrd i whichTransReg void itri whichTransReg void ptcl i int64 pagesz __ptcga __int64 va pagesz ptri int64 va pagesz ptrd int64 va pagesz void invala gr const int whichGeneralReg void invala fr const int whichFloatReg void break const int Insert an entry into the instruction translation cache Map itc i Map the itr d instruction Map the itr i instruction Map the ptc e instruction Purges the local translation cache Maps to the ptc l r r instruction Purges the global translation cache Maps to the ptc g r rinstruction Purges the global translation cache and ALAT Maps to the ptc ga r r instruction Purges the translation register Maps to the ptr i r r instruction Purges the translation register Maps to the ptr d r r instruction Invalidates ALAT Maps to the invala instruction Same as void invalat void whichGeneralReg 0 127 whichFloatReg 0 127 Generates a break instruction with an immediate void nop const int Generate a nop instruction void debugbreak void Generates a Debug Break Instruction fault void fc int64 294 Flushes a cache line associated with the address given by the argument Maps to the c instruction Intel C Intrinsics Reference au ar oo RN void sum int mask void rum
366. r results ro a0 bO El a2 b2 __m128i mm sad_epu8 __m128i a __m128i b Computes the absolute difference of the 16 unsigned 8 bit integers from a and the 16 unsigned 8 bit integers from b Sums the upper 8 differences and lower 8 differences and packs the resulting 2 unsigned 16 bit integers into the upper and lower 64 bit elements r0 abs a0 b0 abs al b1 abs a7 b7 rl 0x0 r2 0x0 r3 0x0 r4 abs a8 b8 abs a9 b9 abs al5 b15 rb 0x0 r6 0x0 r7 0x0 __m128i _mm_sub_epi8 __m128i a __m128i b Subtracts the 16 signed or unsigned 8 bit integers of b from the 16 signed or unsigned 8 bit integers of a CU a0 b rl al bl r15 al5 b15 m128i mm sub epil6 m128i a __m128i b Subtracts the 8 signed or unsigned 16 bit integers of b from the 8 signed or unsigned 16 bit integers of a r0 a0 bO rl al bl r7 a7 b7 m128i mm sub epi32 m128i a m128i b Subtracts the 4 signed or unsigned 32 bit integers of b from the 4 signed or unsigned 32 bit integers of a r0 a0 bO rl al bl r2 a2 b2 r3 a3 b3 m64 mm sub si64 __m64 a _ m64 b Subtracts the signed or unsigned 64 bit integer b from the signed or unsigned 64 bit integer a FP ze ac P 267 Intel C Compiler for Linux Systems User s Guide __m128i _mm_sub_epi64 __m128i a m128i bi Subtracts the 2 signed or unsigned 64 bit integers i
367. r specific optimizations xB Intel Pentium M and compatible Intel processors Programs where the function main is compiled with this option will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations xP Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 Programs where the function main 1s compiled with this option will detect non compatible processors and generate an error message during execution This option also enables new optimizations in addition to Intel processor specific optimizations To execute a program on x86 processors not provided by Intel Corporation do not specify the x KIW N B P option Example The invocation below compiles prog cpp for Intel Pentium 4 and compatible processors The resulting binary might not execute correctly on Pentium Pentium Pro Pentium II Pentium III or Pentium with MMX technology processors or on x86 processors not provided by Intel corporation prompt gt icpe xW prog cpp A Caution If a program compiled with x K W N B P is executed on a non compatible processor it might fail with an illegal instruction exception or display other unexpected behavior Executing programs compiled with xN xB or xP on unsupported processors see table above will display the following run time error 85 Intel C Compiler for Linux Sy
368. r with icc the i files are treated as C source files The i files are treated as C sources if you compile with icpc Compiled object module Assembly file Shared object file Assembly file that requires preprocessing C language source file C language source file Building and Debugging Applications Compilation Options This section describes the Intel C Compiler options that determine the compilation process and output By default the compiler converts source code directly to an executable file Appropriate options allow you to control the process by directing the compiler to produce e Preprocessed files i with the P option e Assembly files s with the S option e Object files o with the c option e Executable files out by default You can also name the output file or designate a set of options that are passed to the linker If you specify a phase limiting option the compiler produces a separate output file representing the output of the last phase that completes for each primary input file Preprocessor Options This section describes the options you can use to direct the operations of the preprocessor Preprocessing performs such tasks as macro substitution conditional compilation and file inclusion Preprocessing Options Option Aname values Description Associates a symbol name with the specified sequence of values Equivalent to an assert prepro
369. rallelization Compiler Options Quick Reference Option Description Default par threshold n Sets a threshold for the auto OFF parallelization of loops based on the probability of profitable execution of the loop in parallel n 0 to 100 This option is used for loops whose computation work volume cannot be determined at compile time e par threshold loops get auto parallelized regardless of computation work volume e par threshold100 loops get auto parallelized only if profitable parallel execution is almost certain pc32 Set internal FPU precision to 24 OFF 1A 32 only bit significand pc64 Set internal FPU precision to 53 OFF 14 32 only bit significand pc80 Set internal FPU precision to 64 ON 1 32 only bit significand pch Automatic processing for OFF precompiled headers Directs the compiler to find OFF and or create a file for precompiled headers in dirname Disables the floating point OFF division to multiplication optimization Improves precision of floating point divides prefetch Enables disables the insertion ON of software prefetching by the compiler Default is prefetch prof dir dirname Specify the directory OFF dirname to hold profile information dyn dpi prof file filename Specify the filename for OFF profiling summary file 25 Intel C Compiler for Linux Systems User s Guide Option Description Def
370. rates symbolic debugging OFF information in the object code for use by source level debuggers The g option changes the default optimization from O2 to O0 16 Compiler Options Quick Reference Option Description Default Use this option to specify the OFF location of g when compiler cannot locate gcc C libraries For use with cxxlib gcc configuration Use this option when referencing a non standard gcc installation gcc name name gcc version nnn This option provides compatible ON behavior with gcc where nnn indicates the gcc version This version of the Intel compiler supports gcc version 320 Default H Print include file order and OFF continue compilation help Prints compiler options OFF summary idirafterdir Add directory dir to the OFF second include file search path after I Idirectory Specifies an additional OFF directory to search for include files i dynamic Link Intel provided libraries OFF dynamically inline debug info Preserve the source position of OFF inlined code instead of assigning the call site source position to inlined code Enables interprocedural OFF optimizations for single file compilation Enable disable the combining OFF of floating point multiplies and add subtract operations IPF fltacc Enable disable optimizations OFF that affect floating point accuracy IPF flt eval methodO Floa
371. ration Corresponding Name Name Instruction m pextrw mm extract pil6 Extract on of four words P _m_pinsrw _mm_insert_pil6 Insert a word PINSRW _m_pmaxsw _mm_max_pil6 Compute the maximum PMAXSW _m_pmaxub mm max pu8 Compute the maximum PMAXUB unsigned 236 Intel C Intrinsics Reference Intrinsic Alternate Operation Corresponding Name Name Instruction _m_pminsw _mm_min_pil6 Compute the minimum F m pminub mm min pu8 Compute the minimum unsigned m pmovmskb mm movemask pi8 Create an eight bit mask PMOVMSKB m pmulhuw mm mulhi pul6 Multiply return high bits PMULHUW _m_pshufw mm shuffle pil6 Return a combination of PSHUFW four words m maskmovq mm maskmove si64 Conditional Store ASKMOVOQ m pavgb mm avg pu8 Compute rounded average _m_pavgw _mm_avg_pul6 Compute rounded average _m_psadbw _mm_sad_pu8 Compute sum of absolute PSADBW differences For these intrinsics you need to empty the multimedia state for the mmx register See The EMMS Instruction Why You Need It and When to Use It topic for more details int m pextrw _m64 a int n Extracts one of the four words of a The selector n must be an immediate r n 0 a0 n 1 a1 n 2 a2 a3 D m64 m pinsrw m64 a int d int n Inserts word d into one of four words of a The selector n must be an immediate rO n 0 d a0 rl n 1 d als r2 n 2 d a2 r3
372. re __STDC__ 1 Both 47 Intel C Compiler for Linux Systems User s Guide Macro Name Value Architecture __STDC_HOSTED__ 1 __ TIME Current time unix 1 unix 1 unix 1 USER_LABEL_PREFIX__ no value WCHAR_TYPE long int on IA 32 int on Itanium architecture WINT_TYPE unsigned int Suppress Macro Definition Use the Uname option to suppress any macro definition currently in effect for the specified name The U option performs the same function as an funde f preprocessor directive Compilation Environment Customizing the Compilation Environment For IA 32 and the Intel Itanium architecture you will need to set a compilation environment To customize the environment used during compilation you can specify e Environment Variables the paths where the compiler and other tools can search for specific files e Configuration Files the options to use with each compilation e Response Files the options and files to use for individual projects e Include Files the names and locations of source header files Environment Variables You can customize your environment by specifying paths where the compiler can search for special files such as libraries and include files e LD_LIBRARY_PATH specifies the location for shared objects e PATH specifies the directories the system searches for binary executable files e IC
373. re in the mmintrin h header file Alternate Corresponding Operation Name Instruction empty mm empty EMMS Empty MM state from int mm cvtsi32 si64 VD Convert from int to i _mm_cvtsi64 si32 Convert from int 9 B 211 Intel C Compiler for Linux Systems User s Guide Alternate Corresponding Operation Name Instruction m punpcklbw mm unpacklo pi8 PUNPCKLBW weie puo Qr di a FIG Jeer e void _m_empty void Empty the multimedia state m64 m from int int i Convert the integer object i to a 64 bit m64 object The integer value is zero extended to 64 bits int m to int m64 m Convert the lower 32 bits of the __m64 object m to an integer m64 m packsswb m64 ml m64 m2 Pack the four 16 bit values from m1 into the lower four 8 bit values of the result with signed saturation and pack the four 16 bit values from m2 into the upper four 8 bit values of the result with signed saturation m64 m packssdw m64 ml m64 m2 Pack the two 32 bit values from m1 into the lower two 16 bit values of the result with signed saturation and pack the two 32 bit values from m2 into the upper two 16 bit values of the result with signed saturation m64 m packuswb m64 ml m64 m2 Pack the four 16 bit values from m1 into the lower four 8 bit values of the result with unsigned saturation and pack the four 16 bit values from m2 into the upper four 8 bit values of the resu
374. re treated as unsigned while the elements of b are treated as signed The results are treated as unsigned and are returned as one 64 bit word m64 _m64_pavgl_nraz __m64 a __m64 b The unsigned byte wide data elements of a are added to the unsigned byte wide data elements of b and the results of each add are then independently shifted to the right by one position The high order bits of each element are filled with the carry bits of the sums m64 _m64_pavg2_nraz __m64 a __m64 b The unsigned 16 bit wide data elements of a are added to the unsigned 16 bit wide data elements of b and the results of each add are then independently shifted to the right by one position The high order bits of each element are filled with the carry bits of the sums Synchronization Primitives The synchronization primitive intrinsics provide a variety of operations Besides performing these operations each intrinsic has two key properties e the function performed is guaranteed to be atomic e associated with each intrinsic are certain memory barrier properties that restrict the movement of memory references to visible data across the intrinsic operation by either the compiler or the processor For the intrinsics listed below lt t ype gt is either a 32 bit or 64 bit integer Atomic Fetch and op Operations lt type gt sync_fetch_and_add lt type gt ptr lt type gt val lt type gt sync_fetch_and_and lt type gt ptr lt typ
375. reads Original Serial Code i21 i 100 i b i clil 132 Parallel Programming Transformed Parallel Code for i 1 i lt 50 i ali a i bli clil for i 50 i lt 100 i ali ali b i clil Thread 1 Thread 2 Programming with Auto parallelization The auto parallelization feature implements some concepts of OpenMP such as worksharing construct with the parallel for directive This section provides specifics of auto parallelization Guidelines for Effective Auto parallelization Usage A loop is parallelizable if The loop is countable at compile time This means that an expression representing how many times the loop will execute also called the loop trip count can be generated just before entering the loop There are no FLOW READ after WRITE OUTPUT WRITE after READ or ANTI WRITE after READ loop carried data dependences A loop carried data dependence occurs when the same memory location is referenced in different iterations of the loop At the compiler s discretion a loop may be parallelized if any assumed inhibiting loop carried dependencies can be resolved by run time dependency testing The compiler may generate a run time test for the profitability of executing inparallel for loop with loop parameters that are not compile time constants Coding Guidelines Enhance the power and effectiveness of the auto parallel
376. reated as identical Because the source relocation feature of profmerge modifies the pgopti dpi file you may wish to make a backup copy of the file prior to performing the source relocation 102 Compiler Optimizations Code coverage Tool The Intel C Compiler Code coverage Tool can be used for both IA 32 and Itanium amp architectures in a number of ways to improve development efficiency reduce defects and increase application performance The major features of the Intel compiler Code coverage Tool are e Visual presentation of the application s code coverage information with a code coverage coloring scheme e Display of the dynamic execution counts of each basic block of the application e Differential coverage or comparison of the profiles of the application s two runs Command line Syntax The syntax for this tool is as follows codecov codecov_option where codecov_option isa tool option If you do not use any option the tool will provide the top level code coverage for your whole program Tool Options The tool uses options that are listed in the table that follows Option help spi file dpi file prj counts nopartial comp ref demang mname maddr bcolor Description Prints all the options of the code coverage tool Sets the path name of the static profile information file Spi Sets the path name of the dynamic profile information file dpi Sets the project na
377. ress MOVUPS unaligned mm_storer_ps Store four values in MOVAPS reverse order Shuffling Set the low word and pass MOVSS in three high values mm getcsr Return register contents STMXCSR mm setcsr Control Register LDMXCSR mm prefetch mm stream pi mm stream ps mm sfence mm cvtss f32 m128 mm load ss float const a Loads an SP FP value into the low word and clears the upper three words r0 a rl 0 0 r2 0 0 r3 0 0 m128 mm load psl float const a Loads a single SP FP value copying it into all four words r0 a rl a r2 a r3 a m128 _mm_load_ps float const a Loads four SP FP values The address must be 16 byte aligned rO a 0 rl a 1 r2 a 2 r3 a 3 240 Intel C Intrinsics Reference m128 _mm_loadu_ps float const a Loads four SP FP values The address need not be 16 byte aligned ro rl r2 r3 a 0 1 2 3 m128 mm loadr ps float const a Loads four SP FP values in reverse order The address must be 16 byte aligned ro rl r2 r3 a 3 a 2 a 1 a 0 m128 mm set ss float a Sets the low word of an SP FP value to a and clears the upper three words r0 c rl e r2 9 rS i19 0 0 m128 mm set psl float a Sets the four SP FP values to a rO rl r2 r3 a m128 mm set ps float a float b float c float d Sets the four SP FP values to the four inputs
378. rinsics Reference Naming and Usage Syntax Most of the intrinsic names use a notational convention as follows _mm_ lt intrin_op gt _ lt suffix gt intrin op Indicates the intrinsics basic operation for example add for addition and sub for subtraction lt suffix gt Denotes the type of data operated on by the instruction The first one or two letters of each suffix denotes whether the data is packed p extended packed ep or scalar s The remaining letters denote the type e ssingle precision floating point d double precision floating point 1128 signed 128 bit integer 164 signed 64 bit integer u64 unsigned 64 bit integer i32 signed 32 bit integer u32 unsigned 32 bit integer i16 signed 16 bit integer u16 unsigned 16 bit integer i8 signed 8 bit integer u8 unsigned 8 bit integer A number appended to a variable name indicates the element of a packed object For example r0 is the lowest word of r Some intrinsics are composites because they require more than one instruction to implement them The packed values are represented in right to left order with the lowest value being used for scalar operations Consider the following example operation double a 2 1 0 2 0 __mi28d t mm load pd a The result is the same as either of the following mi28d t mm set pd 2 0 1 0 __mi28d t mm setr pd 1 0 2 0 In other words the xmm register that holds the value t will look as follows
379. rl min al b1 r7 min a7 b7 m128i _mm_min_epu8 __m128i a __m128i b Computes the pairwise minima of the 16 unsigned 8 bit integers from a and the 16 unsigned 8 bit integers from b rO min a0 b0 ri min al bl r15 min al5 b15 __m128i mm mulhi epil6 m128i a __m128i b Multiplies the 8 signed 16 bit integers from a by the 8 signed 16 bit integers from b Packs the upper 16 bits of the 8 signed 32 bit results rO a0 50 31 16 rl al b1 31 16 iy ek a7 b7 31 16 __m128i mm mulhi epul6 m128i a m128i b Multiplies the 8 unsigned 16 bit integers from a by the 8 unsigned 16 bit integers from b Packs the upper 16 bits of the 8 unsigned 32 bit results rO a0 50 31 16 rl al b1 31 16 ER a7 b7 31 16 266 Intel C Intrinsics Reference m128i mm mullo epil6 m128i a __m128i b Multiplies the 8 signed or unsigned 16 bit integers from a by the 8 signed or unsigned 16 bit integers from b Packs the lower 16 bits of the 8 signed or unsigned 32 bit results rO a0 50 15 0 rl al b1 15 0 r7 a7 b7 15 0 m64 mm mul su32 m64 a _ m64 b Multiplies the lower 32 bit integer from a by the lower 32 bit integer from b and returns the 64 bit integer result r a0 bO __m128i _mm_mul_epu32 __m128i a m128i b Multiplies 2 unsigned 32 bit integers from a by 2 unsigned 32 bit integers from b Packs the 2 unsigned 64 bit intege
380. rocessor with Streaming SIMD Extensions 3 SSE3 respectively or a compatible Intel processor If the program is not executed on one of these processors the program terminates with an error Example To optimize the program prog cpp for the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 issue the following command prompt gt icpc xP prog cpp The resulting executable aborts if it is executed on a processor that does not support the Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 such as the Intel Pentium III or Intel Pentium 4 processor If you intend to run your programs on multiple LA 32 processors do not use the x options that optimize for processor specific features consider using ax to attain processor specific performance and portability among different processors Setting FTZ and DAZ Flags Previously the values of the flags flush to zero FTZ and denormals as zero DAZ for IA 32 processors were off by default However even at the cost of losing IEEE compliance turning these flags on significantly increases the performance of programs with denormal floating point values in the gradual underflow mode run on the most recent IA 32 processors Hence for the Intel Pentium III Pentium 4 Pentium M Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 and compatible IA 32 processors the compiler s default behavior is to turn these flags on The compiler inserts code in
381. roes in the DATA section Disables placement of zero initialized variables in BSS use DATA no cpprt Do not link in C run time OFF libraries nodefaultlibs Do not use standard libraries when linking no gcc Do not predefine the OFF __GNUC__ GNUC_MINOR__ and __GNUC_PATCHLEVEL macros nolib inline Disables inline expansion of OFF standard library functions nostartfiles Do not use standard startup files OFF when linking nostdinc Same as X OFF nostdlib Do not use standard libraries and OFF startup files when linking Wei Same as 01 on IA 32 Same as OFF 02 on Itanium based systems 00 Disables optimizations OFF 01 Enable optimizations Optimizes ON for speed For Itanium compiler 01 turns off software pipelining to reduce code size 21 Intel C Compiler for Linux Systems User s Guide Option Description Default 02 Same as 01 on IA 32 Same as OFF O on Itanium based systems 03 Enable 02 plus more OFF aggressive optimizations that may increase the compilation time Impact on performance is application dependent some applications may not see a performance improvement Obn Controls the compiler s inline ON expansion The amount of inline expansion performed varies with the value of n as follows e 0 Disables inlining e 1 Enables default inlining of functions declared with the inline keyword Also enables inlining
382. s cmpngt ps cmpnge ss cmpnge ps cmpord ss cmpord ps Alternate Name Across MMX TM All IA N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A Streaming Itanium amp Technology SIMD Architecture Extensions Streaming SIMD Extensions 2 N A A A N A A A N A A A N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A N A B B N A A A 319 Intel C Compiler for Linux Systems User s Guide mm mm mm Inm mm mm 320 Intrinsic Name cmpunord ss cmpunord ps com com com com com com ieq ss ilt ss ile ss igt ss ige ss ineq ss ucomieq ss ucomilt ss ucomile ss ucomigt ss ucomige ss ucomineq ss cvt ss2si Cvt ps2pi cvtt ss2si cvtt ps2pi cvt si2ss Cvt pi2ps cvtpil6 ps cvtpul6 ps cvtpi8 ps cvtpu8 ps CVtpi32x2 ps cvtps_pil Alternate Name mm cvtss si32 mm cvtps pi32 mm cvttss si32 mm cvttps pi32 mm cvtsi32 ss mm cvtpi32 ps Across MMX TM AIIA Technology N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A N A
383. s Reference Shift Operators Corresponding Intrinsics and Classes Part 1 I32vec4 Operators Corresponding Intrinsic 1128vec1 l64vec2 I16vec8 gt gt gt lt lt lt lt epi64 epi64 N A N A epi64 epi64 epi32 epi32 epi32 epi32 epi32 epi32 epil epil epil epil epil epil l8vec16 N A N A N A N A N A N A 383 Comparison Operators Corresponding Intrinsics and Classes Part 1 I16vec8 Operators Corresponding I32vec4 l8vec16 I32vec2 I16vec4 levec8 Intrinsic cmpeq mm cmpeq x epi32 epil epi8 pi32 pil6 cmpneq mm cmpeq x mm andnot y cmpgt mm cmpgt x epi32 epil epi8 pi32 pil6 cmpge mm cmpge x mm andnot y mm cmple x mm andnot y cmpngt mm cmpngt x epi32 epil epi8 pi32 pil6 cmpnge mm cmpnge x N A N A N A N A N A cmnpnit mm cmpnit x N A N A N A N A N A cmpnle mm cmpnle x N A N A N A N A N A Note that mm andnot y intrinsics do not apply to the fvec classes Comparison Operators Corresponding Intrinsics and Classes Part 2 Operators Corresponding F64vec2 F32vec4 F32vec1 Intrinsic DUE ee cm _mm_cmpeq_ mm andnot y mm cmpngt x mm cmpnge x 384 Intel C Intrinsics Reference Operators Corresponding Intrinsic ne S Cond
384. s in b rO a0 bO rl al bl r2 a2 b2 r3 a3 b3 mm add si64 m64 a m64 b 264 Adds the signed or unsigned 64 bit integer a to the signed or unsigned 64 bit integer b r a b Intel C Intrinsics Reference __m128i _mm_add_epi64 __m128i a m128i bi Adds the 2 signed or unsigned 64 bit integers in a to the 2 signed or unsigned 64 bit integers in b r0 a0 b rl t al bl __m128i _mm_adds_epi8 __m128i a m128i b Adds the 16 signed 8 bit integers in a to the 16 signed 8 bit integers in b using saturating arithmetic rO SignedSaturate a0 bO rl SignedSaturate al bl r15 SignedSaturate al5 b15 __m128i mm adds epil6 m128i a m128i b Adds the 8 signed 16 bit integers in a to the 8 signed 16 bit integers in b using saturating arithmetic r0 SignedSaturate a0 bO rl SignedSaturate al bl r7 SignedSaturate a7 b7 __m128i mm adds epu8 m128i a m128i b Adds the 16 unsigned 8 bit integers in a to the 16 unsigned 8 bit integers in b using saturating arithmetic r0 UnsignedSaturate a0 bO rl UnsignedSaturate al bl r15 UnsignedSaturate al5 b15 __m128i mm adds epul6 m128i a m128i b Adds the 8 unsigned 16 bit integers in a to the 8 unsigned 16 bit integers in b using saturating arithmetic r0 UnsignedSaturate a0 bO rl UnsignedSaturate al bl r15 UnsignedSaturate a7 b7 __m
385. s is required for C ABI conformance 71 Intel C Compiler for Linux Systems User s Guide By default the Intel C Compiler uses headers and libraries included with the product If you are linking with code compiled with g which was compiled against gnu C headers then differences in the headers might cause incompatibilities that result in run time errors If you build one shared library against the Intel C libraries build a second shared library against the gnu C libraries and use both libraries in a single application you will have two C run time libraries in use Since the application might use symbols from both libraries the following problems may occur e partially initialized libraries e lost I O operations from data put in unaccessed buffers e other strange results such as jumbled output The Intel C Compiler does not support more than one run time library in one application A Warning If you successfully compile your application using more than one run time library the resulting program will likely be very unstable especially when new code is linked against the shared libraries You should use the cxx1ib gcc option if your application includes source files generated by g and source files generated by the Intel C Compiler This option directs the Intel compiler to use the g header and library files to build one set of run time libraries As a result your program should run correctly gcc
386. s the loop structure in two ways e It increases the temporal and spatial locality in the data cache if the data are reusable in different passes of an algorithm e It reduces the number of iterations of the loop by a factor of the length of each vector or number of operations being performed per SIMD operation In the case of Streaming SIMD Extensions this vector or strip length is reduced by 4 times four floating point data items per single Streaming SIMD Extensions single precision floating point SIMD operation are processed First introduced for vectorizers this technique consists of the generation of code when each vector operation is done for a size less than or equal to the maximum vector length on a given vector machine The compiler automatically strip mines your loop and generates a cleanup loop Before Vectorization i 0 while i lt n Original loop code a i b i c il tti After Vectorization The vectorizer generates the following two loops i20 while i lt n n Vector strip mined loop Subscript i i 3 denotes SIMD execution a i i 3 b i i 3 c i i 3 i i 4 while i lt n Scalar clean up loop ali l blil cli i 124 Parallel Programming Statements in the Loop Body The vectorizable operations are different for floating point and integer data Floating point Array Operations The statements within the loop body may contain f
387. sd_ss __m128 a m128d bi Converts the lower DP FP value of b to an SP FP value The upper SP FP values in a are passed through r0 float b r1 al r2 a2 r3 a3 __mi28d mm cvtsi32 sd m1l28d a int b Converts the signed integer value in b to a DP FP value The upper DP FP value in a is passed through ro double b ril al __mi28d mm cvtss sd mi128d a m128 bi Converts the lower SP FP value of b to a DP FP value The upper value DP FP value in a is passed through r0 double b0 rl al 258 Intel C Intrinsics Reference m128i _mm_cvttpd_epi32 __m128d a Converts the two DP FP values of a to 32 bit signed integers using truncate rO int a0 rl int al r2 0x0 r3 0x0 int _mm_cvttsd_si32 __m128d a Converts the lower DP FP value of a to a 32 bit signed integer using truncate r int a0 m64 mm cvtpd pi32 1m128d a Converts the two DP FP values of a to 32 bit signed integer values r0 int a0 ri int al m64 mm cvttpd pi32 1m128d a Converts the two DP FP values of a to 32 bit signed integer values using truncate rO int a0 rl int al __mi28d _mm_cvtpi32_pd __m64 a Converts the two 32 bit signed integer values of a to DP FP values rO double a0 rl double al mm_cvtsd_f64 __m128d a This intrinsic extracts a double precision floating point value from the first vector element of an__m128d It does so in the most ef
388. second iteration was written into the first iteration For vectorization the iterations must be done in parallel without changing the semantics of the original loop Data Dependence Theory Data dependence analysis involves finding the conditions under which two memory accesses may overlap Given two references in a program the conditions are defined by e whether the referenced variables may be aliases for the same or overlapping regions in memory e for array references the relationship between the subscripts For array references the Intel C Compiler s data dependence analyzer is organized as a series of tests that progressively increase in power as well as time and space costs First a number of simple tests are performed in a dimension by dimension manner since independence in any dimension will exclude any dependence relationship Multi dimensional arrays references that may cross their declared dimension boundaries can be converted to their linearized form before the tests are applied Some of the simple tests used are the fast GCD test proving independence if the greatest common divisor of the coefficients of loop indices cannot evenly divide the constant term and the extended bounds test which tests potential overlap for the extreme values of subscript expressions If all simple tests fail to prove independence the compiler will eventually resort to a powerful hierarchical dependence solver that uses Fourier Motzkin elim
389. sed using the register names mm0 to mm7 The prototypes for MMX technology intrinsics are in the mmint rin h header file The EMMS Instruction Why You Need It Using EMMS is like emptying a container to accommodate new content For instance MM STM instructions automatically enable an FP tag word in the register to enable use of the m64 data type This resets the FP register set to alias it as the MMX register set To enable the FP register set again reset the register state with the EMMS instruction or via the mm empty intrinsic Why You Need EMMS to Reset After an MMX Instruction MMX instuction Regislers Heed msa Data ypes FP Tag 0 e mmx Registers 0 ses Tm uno MM7 FP Tag Word Akases FP Registers to Act Like oi Regsiers to Accept zez Data Types Cine T p W od with EMS JUL empty L i FP Instruction Registess Need ko be FL eset to Accept FP Data types of 32 64 and BO bits PTa P FPTag 0 70 FE Ragisiets FPO FP7 mu sumpty Clears fie FP Tag Word and Allows FP Data Types in Aegisies Again ics Failure to empty the multimedia state after using an MMX instruction and before using a floating point instruction can result in unexpected execution or poor performance EMMS Usage Guidelines The guidelines when to use EMMS are west e Do not use on Itanium based systems There are no special registers or overlay for the MMX instructions or Streaming SIMD Extensions on Itanium based
390. shown in Figure 6 and return the result mm EHEN is 302 Intel C Intrinsics Reference m64 m64 musl m64 a const int n Based on the value of n a permutation is performed on a as shown in Figure 7 and the result is returned Table 1 shows the possible values of n rev mix Gett Gah brest Table 1 Values of n for m64_mux1 Operation ST a Ca 303 Intel C Compiler for Linux Systems User s Guide m64 m64_mux2 __m64 a const int n m64 m64 m64 304 Based on the value of n a permutation is performed on a as shown in Figure 8 and the result is returned GR rq mux ri r2 OxBb shufile 10 00 11 01 mux2 r1 i2 Ox1b reverse 00 01 10 11 GR ri mux2 r1 2 Oxe4 alternate 11 01 10 00 mux2 r1 r2 Oxaa broadcast 10 10 10 10 Fig 8 m64 pavgsubl m64 a __m64 b The unsigned data elements bytes of b are subtracted from the unsigned data elements bytes of a and the results of the subtraction are then each independently shifted to the right by one position The high order bits of each element are filled with the borrow bits of the subtraction m64 pavgsub2 m64 a X m64 b The unsigned data elements double bytes of b are subtracted from the unsigned data elements double bytes of a and the results of the subtraction are then each independently shifted to the right by one position The high order bits of each element are fill
391. sics Reference Conditional Select Symbols and Corresponding Intrinsics Conditional Operators Corresponding Additional Select For Intrinsic Intrinsic Applies to All Equality R mm cmpeq pi32 mm and si64 select eq A mm cmpeq pil6 mm or si64 B C D mm cmpeq pi8 mm andnot si64 Inequality R _mm_cmpeq_pi32 select neq A mm cmpeq pil6 B C D mm cmpeq pi8 Greater Than select gt R mm cmpgt pi32 select gt A mm cmpgt pil6 B C D mm cmpgt pi8 Greater Than R _mm_cmpge_pi32 or Equal To select_gt A mm cmpge pil6 B C D _mm_cmpge_pi8 Less Than select lt R select lt A B C D Less Than R or Equal To select le A B C D All conditional select operands must be of the same size The return data type is the nearest common ancestor of operands C and D For conditional select operations using greater than or less than operations the first and second operands must be signed as listed in the table that follows Conditional Select Operator Overloading R Comparison Aand B C I32vec2 R I s u 32vec2 I s u 32vec2 select eq I s u 32vec2 Il6vec4 R I s u 16vec4 I s uJl6vec4 I s u 16vec4A I8vec8 R I s u 8vec8 Is32vec2 Is32vec2 Isl6vec4 Isl6vec4 Is8vec8 Is8vec8 The table below shows the mapping of return values from RO to R7 for any number of elements The same return value mappings also apply when there are fewer than four return values
392. sing Z2p 1 2 4 8 16 Packs structures on 1 2 4 8 or 16 byte boundaries Compiler Options Cross Reference Linux Default Windows Description Remove all predefined macros QAname val Create an assertion name having value val Anamel val Enable disable assumption of ANSI conformance ansi Za 31 Intel C Compiler for Linux Systems User s Guide Windows Description Linux Default ax KIWINIBIP Qax KIWINIBIP Generates specialized OFF code for processor specific codes K W N B and P while also generating generic IA 32 code e 6K Intel Pentium III and compatible Intel processors e W Intel Pentium 4 and compatible Intel processors e N Intel Pentium 4 and compatible Intel processors e B Intel Pentium M and compatible Intel processors e p Intel Pentium 4 processor with Streaming SIMD Extensions 3 SSE3 eG C Don t strip comments OFF SE c Compile to object OFF o only do not link Dname value Dname value Define macro OFF E E Preprocess to stdout OFF fp Oy Use EBP based stack OFF frame for all functions g Zi Produce symbolic OFF debug information in object file The g option changes the default optimization from 02 to O0 32 Compiler Options Quick Reference Linux Windows Description Linux Default Print include file order
393. specify to assume alignment Misaligned Data Crossing 16 Byte Boundary 16 Byte 16 Byte Boundaries Boundaries LI 7 j L Misaligned Data For example if you know that elements a 0 and b 0 are aligned on a 16 byte boundary then the following loop can be vectorized with the alignment option on pragma vector aligned Alignment of Pointers is Known float b int i for int i20 i 10 i After vectorization the loop is executed as shown here Vector and Scalar Clean up Iterations 2 vector iterations 2 clean up iterations in scalar mode IE M M i 0 1 2 3i24 5 6 7 i 8 9 Both the vector iterations a 0 3 25b 0 3 and a 4 7 b 4 7 can be implemented with aligned moves if both the elements a 0 and b 0 or likewise a 4 and b 4 are 16 byte aligned A Caution If you specify the vectorizer with incorrect alignment options the compiler will generate unexpected behavior Specifically using aligned moves on unaligned data will result in an illegal instruction exception 130 Parallel Programming Data Alignment Examples The example below contains a loop that vectorizes but only with unaligned memory instructions The compiler can align the local arrays but because 1b is not known at compile time The correct alignment cannot be determined Loop Unaligned Due to Unknown Variable Value at Compile Time void f int 1b float
394. ss all object files that have an IR F Note The compiler does not support multifile IPO for static libraries a files See Compilation with Real Object Files for more information ipo enables the driver and compiler to attempt detecting a whole program automatically If a whole program is detected the interprocedural constant propagation stack frame alignment data layout and padding of common blocks optimizations perform more efficiently while more dead functions get deleted This option is safe Compilation with Real Object Files In certain situations you might need to generate real object files with ipo To force the compiler to produce real object files instead of mock ones with IPO you must specify ipo ob in addition to ipo Use of ipo obj is necessary under the following conditions e The objects produced by the compilation phase of ipo will be placed in a static library without the use of xildor xild 1ib The compiler does not support multifile IPO for static libraries so all static libraries are passed to the linker Linking with a static library that contains mock object files will result in linkage errors because the objects do not contain real code or data Specifying ipo_obj causes the compiler to generate object files that can be used in static libraries e Alternatively if you create the static library using xiar or xild 1lib then the resulting static library will work as a normal library
395. ss include e computation e horizontal data motion e branch compression elimination e caching hints Understanding each of these capabilities and how they interact is crucial to achieving desired results Computation The SIMD C classes contain vertical operator support for most arithmetic operations including shifting and saturation Computation operations include reciprocal rcp and rcp nr square root sqrt reciprocal square root rsqrt and rsqrt nr Operations rcp and rsqrt are new approximating instructions with very short latencies that produce results with at least 12 bits of accuracy Operations rcp nrand rsqrt nr use software refining techniques to enhance the accuracy of the approximations with a minimal impact on performance The nr stands for Newton Raphson a mathematical technique for improving performance using an approximate result Horizontal Data Support The C SIMD classes provide horizontal support for some arithmetic operations The term horizontal indicates computation across the elements of one vector as opposed to the vertical element by element operations on two different vectors The add horizontal unpack low and pack sat functions are examples of horizontal data support This support enables certain algorithms that cannot exploit the full potential of SIMD instructions Shuffle intrinsics are another example of horizontal data flow Shuffle intrinsics are not expressed i
396. stems this intrinsic is synonymous with __ builtin return address The name and the argument are provided for compatibility with gcc void set return address void addr This intrinsic overwrites the default return address of the current function with the address indicated by its argument On return from the current invocation program execution continues at the address provided void get frame address unsigned int level This intrinsic returns the frame address of the current function The level argument must be a constant value A value of 0 yields the frame address of the current function Any other value yields a zero return value On Linux systems this intrinsic is synonymous with builtin frame address The name and the argument are provided for compatibility with gcc Data Alignment Memory Allocation Intrinsics and Inline Assembly This section describes features that support usage of the intrinsics The following topics are described e Alignment Support e Allocating and Freeing Aligned Memory Blocks e Inline Assembly Alignment Support To improve intrinsics performance you need to align data For example when you are using the Streaming SIMD Extensions you should align data to 16 bytes in memory operations to improve performance Specifically you must align 128 objects as addresses passed to the mm load and mm store intrinsics If you want to declare arrays of floats and treat them as _m128 objects by ca
397. stems User s Guide Fatal Error This program was not built to run on the processor in your system Automatic Processor specific Optimizations IA 32 only The ax K W N B P options direct the compiler to find opportunities to generate separate versions of functions that take advantage of features that are specific to the specified Intel processor If the compiler finds such an opportunity it first checks whether generating a processor specific version of a function is likely to result in a performance gain If this is the case the compiler generates both a processor specific version of a function and a generic version of the function The generic version will run on any IA 32 processor At run time one of the versions is chosen to execute depending on the Intel processor in use In this way the program can benefit from performance gains on more advanced Intel processors while still working properly on older IA 32 processors The disadvantages of using ax K W N B P are e The size of the compiled binary increases because it contains processor specific versions of some of the code as well as a generic version of the code e Performance is affected slightly by the run time checks to determine which code to use 3 Note Applications that you compile with this option will execute on any IA 32 processor If you specify both the x and ax options the x option forces the generic code to execute only on processors compatible wit
398. sting you need to ensure that the float arrays are properly aligned Use declspec align to direct the compiler to align data more strictly than it otherwise does on both IA 32 and Itanium based systems For example a data object of type int is allocated at a byte address which is a multiple of 4 by default the size of an int However by using declspec align you can direct the compiler to instead use an address which is a multiple of 8 16 or 32 with the following restrictions on IA 32 e 32 byte addresses must be statically allocated e 16 byte addresses can be locally or statically allocated You can use this data alignment support as an advantage in optimizing cache line usage By clustering small objects that are commonly used together into a struct and forcing the struct to be allocated at the beginning of a cache line you can effectively guarantee that each object is loaded into the cache as soon as any one is accessed resulting in a significant performance benefit 307 Intel C Compiler for Linux Systems User s Guide The syntax of this extended attribute is as follows align n where n is an integral power of 2 less than or equal to 32 The value specified is the requested alignment A Caution In this release dec lspec align 8 does not function correctly Use declspec align 16 instead Note Ifa value is specified that is less than the alignment of the affected data type it has n
399. sting the appropriate alignment for all objects of that type However that allocation of parameters is unaffected by declspec align If necessary you can assign the value of a parameter to a local variable with the appropriate alignment You can also force alignment of global variables such as arrays declspec align 16 float array 1000 Allocating and Freeing Aligned Memory Blocks Use the mm malloc and mm free intrinsics to allocate and free aligned blocks of memory These intrinsics are based on malloc and free which are in the 1ibirc a library You need to include malloc h The syntax for these intrinsics is as follows void mm malloc int size int align void mm free void p The mm malloc routine takes an extra parameter which is the alignment constraint This constraint must be a power of two The pointer that is returned from mm mal1oc is guaranteed to be aligned on the specified boundary 308 Intel C Intrinsics Reference 3 Note Memory that is allocated using _mm_malloc must be freed using mm ree Calling free on memory allocated with mm mallocorcalling mm free on memory allocated with malloc will cause unpredictable behavior Inline Assembly By default the compiler inlines a number of standard C C and math library functions This usually results in faster execution of your program Sometimes inline expansion of library functions can cause unexpected results The inlined library
400. summarize the capabilities and restrictions of the vectorizer with respect to loop structures Data Dependence Data dependence relations represent the required ordering constraints on the operations in serial loops Because vectorization rearranges the order in which operations are executed any auto vectorizer must have at its disposal some form of data dependence analysis The Data dependent Loop example shows some code that exhibits data dependence The value of each element of an array is dependent on itself and its two neighbors Data dependent Loop float data N int i for i l i lt N 1 i data i data i 1 0 25 data i 0 5 data it 1l 0 25 The loop in the example above is not vectorizable because the write to the current element data i is dependent on the use of the preceding element data i 1 which has already been written to and changed in the previous iteration To see this look at the access patterns of the array for the first two iterations as shown in the following example Data Dependence Vectorization Patterns for i 0 i 100 i a il b i has access pattern read b 0 write a 0 read b 1 write a l i 1 READ data 0 EAD data 1 EAD data 2 RITE data 1 i 2 READ data 1 EAD data 2 EAD data 3 RITE data 2 121 Intel C Compiler for Linux Systems User s Guide In the normal sequential version of the loop shown the value of data 1 read during the
401. supported by Intel extended directives and library routines that enhance and or help analyze performance Compiler Directives This section discusses the language extended directives used in e Software Pipelining e Loop Count and Loop Distribution e Loop Unrolling e Prefetching e Vectorization Pipelining for Itanium amp based Applications The swp and noswp directives indicate preference for a loop to get software pipelined or not The swp directive does not help data dependence but overrides heuristics based on profile counts or lop sided control flow The syntax for this directive is pragma swp pragma noswp Example of swp Directive pragma swp for i 0 i m itt if ali 0 The software pipelining optimization triggered by the swp directive applies instruction scheduling to certain innermost loops allowing instructions within a loop to be split into different stages allowing increased instruction level parallelism This can reduce the impact of long latency operations resulting in faster loop execution Loops chosen for software pipelining are always innermost loops that do not contain procedure calls that are not inlined Because the optimizer no longer considers fully unrolled loops as innermost loops fully unrolling loops can allow an additional loop to become the innermost loop You can request and view the optimization report to see whether software pipelining was applied see Optimizer Report Generation
402. t R Iul6vec4 Ali signed char R Is8vecl6 A i unsigned char R Iu8vec16 A i signed char R Is8vec8 A i 355 unsigned char R Iu8vec8 Ali Access and read element i of A If DEBUG is enabled and the user tries to access an element outside of A a diagnostic message is printed and the program aborts Corresponding Intrinsics none Element Assignment Operators Is64vec2 A i int R Is32vec4 A i int R lu32vec4 A i unsigned int R Is32vec2 A i int R lu32vec2 A i unsigned int R Isl6vec8 A i short R Iul6vec8 A i unsigned short R Isl6vec4 A i short R Iul6vec4 A i unsigned short R Is8vecl6 Ali signed char R lu8vec16 A i unsigned char R Is8vec8 A i signed char R lu8vec8 A i unsigned char R Assign R to element i of A If DEBUG is enabled and the user tries to assign a value to an element outside of A a diagnostic message is printed and the program aborts Corresponding Intrinsics none Unpack Operators Interleave the 64 bit value from the high half of A with the 64 bit value from the high half of B I364vec2 unpack high I64vec2 A I64vec2 B Is64vec2 unpack high Is64vec2 A Is64vec2 B Iu64vec2 unpack high Iu64vec2 A Iu64vec2 B RO Al Rl Bl Corresponding intrinsic mm unpackhi epi64 Interleave the two 32 bit values from the high half of A with the two 32 bit values from the high half of B I32vec4 unpack high 132
403. t by fast specify that option after the fast option on the command line The options set by fast may change from release to release 116 Compiler Optimizations To target fast optimizations for a specific processor use one of the x options For example prompt icpc fast xW source file cpp Loop Transformations All these transformations are supported by data dependence These techniques also include induction variable elimination constant propagation copy propagation forward substitution and dead code elimination The loop transformation techniques include e loop normalization e loopreversal e loop interchange and permutation e loop skewing e loop distribution e loop fusion e scalar replacement In addition to the loop transformations listed for both IA 32 and Itanium architectures above the Itanium architecture allows collapsing techniques Absence of Loop carried Memory Dependency with IVDEP Directive For Itanium based applications the ivdep_parallel option indicates there is absolutely no loop carried memory dependency in the loop where IVDEP directive is specified This technique is useful for some sparse matrix applications For example the following loop requires ivdep parallel in addition to the directive IVDEP to indicate there is no loop carried dependencies Example fpragma ivdep for i l i lt n i The following example shows that using this option and the
404. t link e Preprocessed files 51 Intel C Compiler for Linux Systems User s Guide Option Input S e Source files Generate assemblable files with s suffix and stops the compilation process Preprocessed files syntax Source files Emits diagnostic list of syntax errors to sdt out There is no output for source files free of syntax errors Preprocessed files Default Source files Executable file out files Preprocessed files Assemblable files Object files Libraries Controlling Compilation Flow Option Description Stops the compilation process after an object file has been generated The compiler generates an object file for each C or C source file or preprocessed source file Also takes an assembler file and invokes the assembler to generate an object file Kpic KPIC Generate position independent code lname Link with a library indicated in name nobss init Places variables that are initialized with zeroes in the DATA section SES E Stops the compilation process after C or C source files have been preprocessed and writes the results to files named according to the compiler s default file naming conventions S Generates assemblable file only with s suffix then stops the compilation sox Enables disables the saving of compiler options and version information in the executable file Default is sox Zp 1121418116 Packs structures on 1 2 4 8 or
405. t x float y FMA Description The fma functions return x y z Calling interface double fma double x double y long double z long double fmal long double x long double y long double z float fmaf float x float y long double z FMAX Description The max function returns the maximum numeric value of its arguments Calling interface double fmax double x double y long double fmaxl long double x long double y float fmaxf float x float y 190 Intel Math Library FMIN Description The fmin function returns the minimum numeric value of its arguments Calling interface double fmin double x double y long double fminl long double x long double y float fminf float x float y FPCLASSIFY ISFINITE ISGREATER Description The fpclassify function returns the value of the number classification macro appropriate to the value of its argument Calling interface double fpclassify double x long double fpclassifyl long double x float fpclassifyf float x Description The isfinite function returns if x is not a NaN or infinity Otherwise 0 is returned Calling interface int isfinite double x int isfinitel long double x int isfinitef float x Description The isgreater function returns if x is greater than y This function does not raise the invalid floating point exception Calling interface int isgreater double x double y int isgreaterl
406. t64 addend specified by its argument Maps to the fet chadd instruction InterlockedExchange64 volatile Do an exchange operation Target __int64 value atomically Maps to the xchg instruction unsigned _ int64 Same as InterlockedExchangeU64 volatile unsigned InterlockedExchange64 __int64 Target unsigned __int64 value for unsigned quantities unsigned __int64 Maps to the cmpxchg rel InterlockedCompareExchange64 rel volatile instruction with appropriate unsigned __int64 Destination unsigned __int64 Exchange unsigned __int64 Comparand setup Atomically compare and exchange the value specified by the first argument a 64 bit pointer unsigned _ int64 Maps to the cmpxchg acq InterlockedCompareExchange64 acq volatile instruction with appropriate unsigned __int64 Destination unsigned __int64 Exchange unsigned __int64 Comparand setup Atomically compare and exchange the value specified by the first argument a 64 bit pointer int64 Same as above for signed InterlockedCompareExchange64 volatile quantities int64 Destination int64 Exchange int64 Comparand int64 InterlockedExchangeAdd64 volatile Use compare and exchange to int64 addend int64 increment do an atomic add of the increment value to the addend Maps to a loop with the cmpxchg instruction to guarantee atomicity int64 InterlockedAdd64 volatile int 64 Same as above Returns the
407. t_lock continue Note that the first branch is predicted to fall through to the critical section in anticipation of successfully gaining access to the lock It is highly recommended that all spin wait loops include the PAUSE instruction Since PAUSE is backwards compatible to all existing 1A 32 processor generations a test for processor type a CPUID test is not needed All legacy processors will execute PAUSE as a NOP but in processors which use the PAUSE as a hint there can be significant performance benefit 276 Intel C Intrinsics Reference Miscellaneous Operations for Streaming SIMD Extensions 2 The miscellaneous intrinsics for Streaming SIMD Extensions 2 are listed in the following table followed by their descriptions The prototypes for Streaming SIMD Extensions 2 intrinsics are in the emmintrin h header file Instruction P P LWD P LDO N N NS ee p PCK PCK PCK PCK PCK PCKI PCKI PCK 277 Intel C Compiler for Linux Systems User s Guide __m128i _mm_packs_epil6 __m128i a __m128i b Packs the 16 signed 16 bit integers from a and b into 8 bit integers and saturates rO SignedSaturate a0 rl SignedSaturate al r7 SignedSaturate a7 r8 SignedSaturate b0 r9 SignedSaturate bl r15 SignedSaturate b7 __m128i _mm_packs_epi32 __m128i a __m128i b Packs the 8 signed 32 bit integers from a and b into
408. table and merges available dynamic information dyn files into a pgopti dpi file FP Note The dynamic information files are produced in Phase 2 when you run the instrumented executable If you perform multiple executions of the instrumented program prof use merges the dynamic information files again and overwrites the previous pgopti dpi file Disabling Function Splitting Itanium Compiler only fnsplit disables function splitting Function splitting is enabled by prof use in Phase 3 to improve code locality by splitting routines into different sections one section to contain the cold or very infrequently executed code and one section to contain the rest of the code hot code You can use nsplit to disable function splitting for the following reasons e Most importantly to get improved debugging capability In the debug symbol table it is difficult to represent a split routine that is a routine with some of its code in the hot code section and some of its code in the cold code section e The fnsplit option disables the splitting within a routine but enables function grouping an optimization in which entire routines are placed either in the cold code section or the hot code section Function grouping does not degrade debugging capability e Another reason can arise when the profile data does not represent the actual program behavior that is when the routine is actually used frequently rather than infrequently
409. tecture Intel Corporation doc number 245318 001 e tanium amp Architecture Software Developer s Manual Vol 3 Instruction Set Reference Intel Corporation doc number 245319 001 e ltanium amp Architecture Software Developer s Manual Vol 4 Itanium Processor Programmer s Guide Intel Corporation doc number 245319 001 e Intel Architecture Optimization Manual Intel Corporation doc number 245127 e Intel Processor Identification with the CPUID Instruction Intel Corporation doc number 241618 e Intel Architecture MMX Technology Programmer s Reference Manual Intel Corporation doc number 241618 e Pentium Pro Processor Developer s Manual 3 volume Set Intel Corporation doc number 242693 e Pentium II Processor Developer s Manual Intel Corporation doc number 243502 001 e Pentium Processor Specification Update Intel Corporation doc number 242480 e Pentium Processor Family Developer s Manual Intel Corporation doc numbers 241428 005 Most Intel documents are also available from the Intel Corporation Web site at http www intel com How to Use This Document This User s Guide explains how you can use the Intel C Compiler It provides information on how to get started with the Intel C Compiler how this compiler operates and what capabilities it offers for high performance You learn how to use the standard and advanced compiler optimizations to gain maximum performance for your appli
410. tel Architecture MMX Technology Programmer s Reference Manual For descriptions of data types see the Intel Architecture Software Developer s Manual Volume 2 220 Intel C Intrinsics Reference Streaming SIMD Extensions This section describes the C language level features supporting the Streaming SIMD Extensions in the Intel C Compiler These topics explain the following features of the intrinsics e Floating Point Intrinsics e Arithmetic Operation Intrinsics e Logical Operation Intrinsics e Comparison Intrinsics e Conversion Intrinsics e Load Operations e Set Operations e Store Operations e Cacheability Support e Integer Intrinsics e Memory and Initialization Intrinsics e Miscellaneous Intrinsics e Using Streaming SIMD Extensions on Itanium Architecture The prototypes for Streaming SIMD Extensions intrinsics are in the xmmintrin h header file Floating point Intrinsics for Streaming SIMD Extensions You should be familiar with the hardware features provided by the Streaming SIMD Extensions when writing programs with the intrinsics The following are four important issues to keep in mind e Certain intrinsics such as_mm_loadr_ps and _mm_cmpgt__ss are not directly supported by the instruction set While these intrinsics are convenient programming aids be mindful that they may consist of more than one machine language instruction e Floating point data loaded or stored as _m128 objects must be
411. tel C Compiler for Linux Systems User s Guide Criteria for Inline Function Expansion Once the criteria are met the compiler picks the routines whose inline expansion will provide the greatest benefit to program performance The inlining heuristics used by the compiler differ based on whether or not you use profile guided optimizations prof_use When you use profile guided optimizations with ip or ipo the compiler uses the following heuristics e The default heuristic focuses on the most frequently executed call sites based on the profile information gathered for the program e By default the compiler will not inline functions with more than 230 intermediate statements You can change this value by specifying the option Qoption c ip ninl max stats new value Note there is a higher limit for functions declared by the user as inline or__inline e The default inline heuristic will stop inlining when direct recursion is detected e The default heuristic will always inline very small functions that meet the minimum inline criteria e Default for Itanium based applications ip ninl min stats 15 e Default for IA 32 applications ip_ninl_min_stats 7 This limit can be modified with the option Qoption c ip ninl min stats new value If you do not use profile guided optimizations with ip or ipo the compiler uses less aggressive inlining heuristics e Inline a function if the inline expansion will not increase the si
412. tel architecture features while maintaining object code compatibility with previous generations of Intel Pentium processors for IA 32 based systems only Product Web Site and Support For the latest information about Intel C Compiler visit http developer intel com software products For specific details on the Itantum architecture visit the web site at http developer intel com design itanium under_Inx htm Welcome to the Intel C Compiler System Requirements IA 32 Processor System Requirements e A computer based on a Pentium processor or subsequent IA 32 based processor Pentium 4 processor recommended e 128 MB of RAM 256 MB recommended e 100 MB of disk space Itanium amp Processor System Requirements e A computer with an Itanium processor e 256 MB of RAM e 100 MB of disk space Software Requirements See the Release Notes for a complete list of system requirements FLEXIm Electronic Licensing The Intel C Compiler uses Macrovision s FLEXIm licensing technology The compiler requires a valid license file in the 1icenses directory in the installation path The default directory is opt intel cc 80 licenses The license files have a Lic file extension If you require a counted license see Using the Intel License Manager for FLEXIm flex ug pdf Related Publications The following documents provide additional information relevant to the Intel C Compiler e ISO IEC 9989 19
413. tems In conjunction with ax K W N B P and x K W N B P options this option causes the compiler to perform more aggressive data dependency analysis than for O2 This may result in longer compilation times 79 Intel C Compiler for Linux Systems User s Guide Option Effect fast The fast option enhances execution speed across the entire program by including the following options that can improve run time performance e 03 maximum speed and high level optimizations e ipo enables interprocedural optimizations across files e static prevents linking with shared libraries To override one of the options set by fast specify that option after the fast option on the command line The options set by fast may change from release to release To target fast optimizations for a specific processor use one of the x options For example prompt gt icpe fast xW source file cpp Restricting Optimizations The following options restrict or preclude the compiler s ability to optimize your program Option Description 00 Disables optimizations Enables the fp option mp Restricts optimizations that cause some minor loss or gain of precision in floating point arithmetic to maintain a declared level of precision and to ensure that floating point arithmetic more nearly conforms to the ANSI and IEEE standards g Specifying the g option turns off the default O2 option and makes O0 the default unless
414. terleave the 32 bit value from the low half of A with the 32 bit value from the low half of B I32vec2 unpack low I132vec2 A I32vec2 B Is32vec2 unpack low Is32vec2 A Is32vec2 B Iu32vec2 unpack low Iu32vec2 A Iu32vec2 B RO AO R1 BO Corresponding intrinsic _mm_unpacklo_pi32 Interleave the two 16 bit values from the low half of A with the two 16 bit values from the low half of B Il6vec8 unpack low Il6vec8 A I16vec8 B Isl6vec8 unpack_low Isl6vec8 A Isl6vec8 B Iul6vec8 unpack low Iul6vec8 A Iul6vec8 B RO A0 RI B0 R2 Al R3 Bl R4 A2 R5 B2 R6 A3 R7 B3 Corresponding intrinsic mm unpacklo epil6 Interleave the two 16 bit values from the low half of A with the two 16 bit values from the low half of B Il6vec4 unpack low Il6vec4 A Il6vec4 B Isl6vec4 unpack_low Isl6vec4 A Isl6 vec4 B Iul6vec4 unpack_low Iul6vec4 A Iul6vec4 B RO AO R1 BO R2 Al R3 B1 Corresponding intrinsic mm unpacklo pil6 359 Interleave the four 8 bit values from the high low of A with the four 8 bit values from the low half of B I8vecl6 unpack_low I8vecl6 A I8vecl6 B Is8vecl6 unpack low Is8vecl6 A Is8vecl6 B lu8vec16 unpack low Iu8vecl6 A Iu8vecl6 B RO A0 R1 B0 R2 A1 R3 Bl R4 A2 R5 B2 R6 A3 R7 B3 R8 A4 R9 B4 R10 A5 R11 B5 R12 A6 R13 B6 R14 A7 R15 B7 Corresponding intrinsic m
415. tes A global symbol is one that is visible outside the compilation unit single source file and its include files in which it is declared In C C this means anything declared at file level without the static keyword For example int x 5 gl 1 data definition extern int y Li Gi 1 data reference int five gl l1 function definition return 5 extern int four gl 1 function reference A complete program consists of a main program file and possibly one or more shareable object so files that contain the definitions for data or functions referenced by the main program Similarly shareable objects might reference data or functions defined in other shareable objects Shareable objects are so called because if more than one simultaneously executing process has the shareable object mapped into its virtual memory there is only one copy of the read only portion of the object resident in physical memory The main program file and any shareable objects that it references are collectively called the components of the program Each global symbol definition or reference in a compilation unit has a visibility attribute that controls how or if it may be referenced from outside the component in which it is defined There are five possible values for visibility e EXTERNAL The compiler must treat the symbol as though it is defined in another component For a definition this means that the compiler must assume that
416. th library 1ibimf a contains optimized versions of math functions found in the standard C run time library The functions in 1ibimf a are optimized for program execution speed on Intel Pentium III and Pentium 4 processors The Itantum compiler also includes a libimf a designed to optimize performance on Itanium based systems The Intel math library is linked by default See Managing Libraries and Intel Math Library 61 Intel C Compiler for Linux Systems User s Guide Intel Shared Libraries By default the Intel C Compiler links Intel provided C libraries dynamically The GNU and Linux system libraries are also linked dynamically Options for Shared Libraries Option Description i dynamic Usethe i dynamic option to link Intel provided C libraries dynamically default This has the advantage of reducing the size of the application binary but it also requires the libraries to be on the systems where the application runs shared The shared option instructs the compiler to build a Dynamic Shared Object DSO instead of an executable For more details refer to the 1d man page documentation fpic Use the pic option when building shared libraries for Itanium based systems It is required for the compilation of each object file included in the shared library Managing Libraries The LD_LIBRARY_PATH environment variable contains a colon separated list of directories in which the linker will search
417. that unrelated dyn files oftentimes from previous runs or from other tests are not present in that directory Otherwise profile information will be based on invalid profile data This can negatively impact the performance of optimized code as well as generate misleading coverage information d Note For successful tool execution you should Name each test dpi file so that the file names uniquely identify each test Create a DPI list file a text file that contains the names of all dpi test files The name of this file serves as an input for the test prioritization tool execution command Each line of the DPI list file should include one and only one dpi file name The name can optionally be followed by the duration of the execution time for a corresponding test in the dd hh mm ss format For example Test1 dpi 00 00 60 35 informs that Test1 lasted 0 days 0 hours 60 minutes and 35 seconds The execution time is optional However if it is not provided then the tool will not prioritize the test for minimizing execution time It will prioritize to minimize the number of tests only 110 Compiler Optimizations Usage Model The chart that follows presents the Test prioritization Tool usage model Step 1 Compile with Keep the static profile information prof_genx Spi for coverage analysis and PGT Instrumented Executables Step 2 1 Run instrumented executables on Test_1 Step 2 n Run
418. the ANSI ISO requirements for minimum translation limits 76 Language Conformance Macros Included with the Compiler The ANSI ISO standard for C language requires that certain predefined macros be supplied with conforming compilers The following table lists the macros that the Intel C Compiler supplies in accordance with this standard The compiler provides predefined macros in addition to the predefined macros required by the standard Macro Description __cplusplus The name _cplusplus is defined when compiling a C translation unit __DATE__ The date of compilation as a string literal in the form Mmm dd yyyy __FILE A string literal representing the name of the file being compiled LINE The current line number as a decimal constant STDC The name __STDC_ is defined when compiling a C translation unit TIME The time of compilation As a string literal in the form hh mm ss C99 Support The following C99 features are supported in this version of the Intel C Compiler when using the c99 option restricted pointers r estrict keyword available with restrict See Note below variable length Arrays flexible array members complex number support _Complex keyword hexadecimal floating point constants compound literals designated initializers mixed declarations and code macros with a variab inline functions in le number of arguments Line keyword boolea
419. the changes needed to enable vectorization and no others In particular you should avoid these common changes e do not unroll your loops the compiler does this automatically e do not decompose one loop with several statements in the body into several single statement loops 120 Parallel Programming Restrictions Hardware The compiler is limited by restrictions imposed by the underlying hardware In the case of Streaming SIMD Extensions the vector memory operations are limited to stride 1 accesses with a preference to 16 byte aligned memory references This means that if the compiler abstractly recognizes a loop as vectorizable it still might not vectorize it for a distinct target architecture Style The style in which you write source code can inhibit optimization For example a common problem with global pointers is that they often prevent the compiler from being able to prove two memory references at distinct locations Consequently this prevents certain reordering transformations Many stylistic issues that prevent automatic vectorization by compilers are found in loop structures The ambiguity arises from the complexity of the keywords operators data references and memory operations within the loop bodies However by understanding these limitations and by knowing how to interpret diagnostic messages you can modify your program to overcome the known limitations and enable effective vectorizations The following topics
420. thread creation and synchronization Auto parallelization Enabling Options and Environment Variables To enable the auto parallelizer use the parallel option The parallel option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops An example of the command using auto parallelization follows prompt gt icpe c parallel prog cpp Auto parallelization Options The parallel option enables the auto parallelizer if the 02 or 03 optimization option is also on the default is 02 The parallel option detects parallel loops capable of being executed safely in parallel and automatically generates multithreaded code for these loops Option Description bel parallel Enables the auto parallelizer e parallel_threshold 1 Controls the work threshold needed for auto 100 parallelization see later subsection lt 4 par report 11213 Controls the diagnostic messages from the auto parallelizer see later subsection 134 Parallel Programming Auto parallelization Environment Variables Variable Description Default OMP NUM THREADS Controls the number of Number of processors currently installed threads used in the system while generating the executable OMP SCHEDULE Specifies the type of runtime scheduling Auto parallelization Threshold Control and Diagnostics Threshold C
421. ti nsigned _ int64 Exchange_acq volatile nation unsigned __int64 Exchange u Comparand nsigned _ int64 void _ReleaseSpinLock volatile int x Atomically decrement by one the value specified by its argument Maps to the fetchadd4 instruction Do an exchange operation atomically Maps to the xchg4 instruction Do a compare and exchange operation atomically Maps to the cmpxchg4 instruction with appropriate setup Use compare and exchange to do an atomic add of the increment value to the addend Maps to a loop with the cmpxchg4 instruction to guarantee atomicity Same as above but returns new value not the original one Map the exch8 instruction Atomically compare and exchange the pointer value specified by its first argument all arguments are pointers Atomically exchange the 32 bit quantity specified by the 1st argument Maps to the xchg4 instruction Maps to the cmpxchg4 rel instruction with appropriate setup Atomically compare and exchange the value specified by the first argument a 64 bit pointer Same as above but map the cmpxchg4 acq instruction Release spin lock 289 Intel C Compiler for Linux Systems User s Guide er EE __int64 _InterlockedIncrement 64 volatile Increment by one the value int64 addend specified by its argument Maps to the fet chadd instruction __int64 _InterlockedDecrement 64 volatile Decrement by one the value in
422. tifier Names The Character ESC in Constants p Specifying Attributes of Variables pe Specifying Attributes of Types Inquiring on Alignment of Types or Variables An Inline Function is As Fast As a Macro Assembler Instructions with C Expression Operands E Controlling Names Used in Assembler Code Variables in Specified Registers Alternate Keywords i Incomplete enum Types Function Names as Strings Getting the Return or Frame Address of a Function Most Most Yes Yes Yes Yes Yes GNU Description and Examples http gcc gnu org onlinedocs gcc 3 2 gcc Function Prototypes html Function 20Prototypes http gcc gnu org onlinedocs gec 3 2 gcc C Comments html C 20Comments http gcc gnu org onlinedocs gec 3 2 gcc Dollar Signs html Dollar 20Signs http gcc gnu org onlinedocs gec 3 2 gcc Character Escapes html Character 20Escapes http gcc gnu org onlinedocs gcc 3 2 gcc Variable Attributes html Variable 20Attributes http gcc gnu org onlinedocs gec 3 2 gcc Type Attributes html T ype 20A ttributes http gcc gnu org onlinedocs gec 3 2 gcc Alignment html Alignment http gcc gnu org onlinedocs gec 3 2 gcc Inline html Inline http gcc gnu org onlinedocs gec 3 2 gcc Extended Asm html Extended 20Asm http gcc gnu org onlinedocs gec 3 2 gcc Asm Labels htmlZAsm9620 Labels http gcc gnu org onlinedocs gcc 3 2 gcc Explicit Reg Vars html Explicit 20Reg
423. tifiers in a single block Number of macros simultaneously defined Number of parameters to a function call Number of parameters per macro Number of characters in a string Bytes in an object Include file nesting depth Case labels in a switch Members in one structure or union Enumeration constants in one enumeration Levels of structure nesting Size of arrays Tested Values 512 512 512 512 2048 64K 128K 2048 128K 512 512 128K 512K 512 32K 32K 8192 320 2 GB 162 Key Files Key Files Summary for IA 32 Compiler The following tables list and briefly describe files that are installed for use by the A 32 version of the compiler bin Files File codecov iccvars sh l1Gc ofg icc icpc profmerge proforder tselect xiar xild include Files File dvec h emm_func h emmintrin h float h fvec h iso646 h ivec h Description Code coverage tool Batch file to set environment variables Configuration file for use from command line Intel C Compiler Utility used for Profile Guided Optimizations Utility used for Profile Guided Optimizations Test prioritization tool Tool used for Interprocedural Optimizations Tool used for Interprocedural Optimizations Description SSE 2 intrinsics for Class Libraries Header file for SSE2 intrinsics used by emmintrin h Principal header file for SSE2 intrinsics IEEE 754 version of standard
424. timizations The code coverage tool can create two levels of coverage e Top level fora group of selected modules e Individual module source view Top Level Coverage The top level coverage reports the overall code coverage of the modules that were selected The following options are provided e You can select the modules of interest e For the selected modules the tool generates a list with their coverage information The information includes the total number of functions and blocks in a module and the portions that were covered e By clicking on the title of columns in the reported tables the lists may be sorted in ascending or descending order based on e basic block coverage e function coverage e function name The example that follows shows a top level coverage summary for a project By clicking on a module name for example SAMPLE C the browser will display the coverage source view of that particular module 104 Compiler Optimizations e intel Compilers code cowerage information for sample Project Microsoft Internet Explorer DEGENTES m RE E data s BE ai Beech tentes diues Al Ze A DI EE ess Address 0 Coverapelaaz corresk zamolelsamleS CODE COVERAGE HTML si es intel jecerated dy Intel Compiters code conrerage toot Uncovered Files in Sample_Project Functions Blocks secerated by ote WebPage Ovner inte seeeroted dr Inala Compiers We
425. ting point operands OFF evaluated to the precision indicated by the program 17 Intel C Compiler for Linux Systems User s Guide Option Description Default anise laxed Provides significant OFF performance benefit but slightly less precision when calculating floating point divides reciprocals square roots and reciprocal square roots The results have an error of no more than 1 ulp unit in the last place when rounding to nearest mode is used but most often less than 0 5 ulp and no more than 1 5 ulp when other rounding modes are used Enable floating point OFF speculations with the following mode conditions e fast speculate floating point operations te ionmode e safe speculate only when safe e strict same as off e off disables speculation of floating point operations ip no inlining Disables inlining that would OFF result from the ip interprocedural optimization but has no effect on other interprocedural optimizations ip no pinlining Disable partial inlining OFF Requires ip or ipo ipo Enables interprocedural OFF optimizations across files Generates a multifile object file OFF ipo out o that can be used in further link steps Forces the compiler to create OFF real object files when used with ipo Generates a multifile OFF assemblable file named ipo out that can be used in further link steps 18 Compiler Options Quick Reference
426. tion Type _mm_cvtpd_ps CVTPD2PS m128 __m128d a mm cvtps pd CVTPS2PD mi28d m128 a mm cvtepi32 pd CVTDQ2PD mi128d m128i a mm cvtpd epi32 CVTPD2DOQ m128i m128d a mm cvtsd si32 CVTSD2SI int __m128d a _mm_cvtsd_ss CVTSD2SS m128 __m128 a m128d b mm cvtsi32 sd CVTSI2SD __mi28d __m128d a int b _mm_cvtss_sd CVTSS2SD mi28d mi28da a __m128 b mm cvttpd epi32 CVTTPD2DOQ m128i m128d a mm cvttsd si32 CVTTSD2SI int __m128d a mm cvtpd pi32 CVTPD2PI m64 __m128d a 257 Intel C Compiler for Linux Systems User s Guide Intrinsic Name Corresponding Return Parameters Instruction _mm_cvttpd_pi32 CVTTPD2PI _mm_cvtpi32_pd CVTPI2PD mm cvtsd f64 None m128 mm cvtpd ost 1m128d a Converts the two DP FP values of a to SP FP values rO float a0 rl float al r2 0 0 r3 0 0 mi28d mm cvtps pd 1m128 a Converts the lower two SP FP values of a to DP FP values ro double a0 T double al m128d mm cvtepi32 pd m128i a Converts the lower two signed 32 bit integer values of a to DP FP values r0 double a0 rl double al m128i _mm_cvtpd_epi32 __m128d a Converts the two DP FP values of a to 32 bit signed integer values rO int a0 rl int al r2 0x0 r3 0x0 int _mm_cvtsd_si32 __m128d a Converts the lower DP FP value of a to a 32 bit signed integer value r int a0 m128 _mm_cvt
427. tion returns the principal value of the inverse tangent of y x in the range 180 180 degrees errno EDOM for x 0 Calling interface double atan2d double x double y long double atan2dl long double x long double y float atan2df float x float y Description The cos function returns the cosine of x measured in radians This function may be inlined with the Itanium compiler Calling interface double cos double x long double cosl long double x float cosf float x Description The cosd function returns the cosine of x measured in degrees Calling interface double cosd double x long double cosdl long double x float cosdf float x Description The cot function returns the cotangent of x measured in radians errno ERANGE for overflow conditions Calling interface double cot double x long double cotl long double x float cotf float x 177 Intel C Compiler for Linux Systems User s Guide COTD SIN SINCOS SINCOSD SIND TAN 178 Description The cotd function returns the cotangent of x measured in degrees errno ERANGE for overflow conditions Calling interface double cotd double x long double cotdl long double x float cotdf float x Description The sin function returns the sine of x measured in radians This function may be inlined with the Itanium compiler Calling interface double sin double x long double sinl long double x
428. to get output from multiple phases Valid name arguments e ipo Interprocedural Optimizer e hlo High Level Optimizer e ilo Intermediate Language Scalar Optimizer e ecg Code Generator e omp OpenMP e all All phases Specifies a routine OFF substring Reports from all routines with names that include substring as part of the name are generated By default reports for all routines are generated Displays all possible settings for OFF opt_report_phase No compilation is performed 23 Intel C Compiler for Linux Systems User s Guide parallel par_report 0 1 2 3 24 Option Description Default Same as qp Stops the compilation process after C or C source files have been preprocessed and writes the results to files named according to the compiler s default file naming conventions Detects parallel loops capable of OFF being executed safely in parallel and automatically generates multithreaded code for these loops Controls the auto parallelizer s diagnostic levels 0 1 2 or 3 as follows e par report0 no diagnostic information 1s displayed e Dar reportl indicates loops successfully auto parallelized default e par_report2 loops successfully and unsccessfully auto parallelized e par_report3 same as 2 plus additional information about any proven or assumed dependences inhibiting auto pa
429. to prepare your program by annotating the code with OpenMP directives The Intel C Compiler first processes the application and produces a multithreaded version of the code which is then compiled The output is a executable program with the parallelism implemented by threads that execute parallel regions or constructs Targeting a Processor Run time Check While parallelzing a loop the Intel compiler s loop parallelizer OpenMP tries to determine the optimal set of configurations for a given processor At run time a check is performed to determine for which IA 32 processor OpenMP should optimize a given loop See detailed information in the Processor specific Runtime Checks IA 32 Systems Performance Analysis For performance analysis of your program you can use the Intel VTune Performance Analyzer to show performance information You can obtain detailed information about which portions of the code require the largest amount of time to execute and where parallel performance problems are located Parallel Processing Thread Model This topic explains the processing of the parallelized program and adds more definitions of the terms used in parallel programming The Execution Flow As previously mentioned a program containing OpenMP C API compiler directives begins execution as a single process called the master thread of execution The master thread executes sequentially until the first parallel construct is encountered In the
430. treaming SIMD Extensions 2 249 USA ge SYNTAX 2 5 stresse ere anus 203 HP option cedi ease eee 11 92 98 ip no inlining option 11 92 97 ip no pinlining option 11 97 IPF flt eval method0 option 11 80 83 IPF fltacc option 11 80 83 PF fma option ssssss 11 80 83 IPF fp speculation option 0 s 0 11 80 83 ipo option ssseesse 11 92 94 95 97 98 ipo Optlon 1e nter 11 92 ipo obj option 11 92 94 119 ipo S option 11 92 isnan library function 190 Isystem Option 11 ivdep_parallel option 11 117 JO library function 184 394 jl library function 184 jn library function 184 Ket option x ssid Ace eh een 11 KMP LIBRARY environment variable 143 KMP STACKSIZE environment variable 143 Knopic option 11 KPIC option cecceceesseesceesceesceeeeeseeeteenseeneees 11 G 11 language Conformance seseeeeeeerreereee 76 Ee VE 150 LD LIBRARY PATH environment variable 62 Idexp library function ssessse 180 legal information 2 Igamma library function 184 Igamma r library function 184 libumta saos tere Moo tee 60 libraries MANA QING esee iine Eee 62 eenegen 3 Ilrint library function 187 Ilround library functio
431. types for MMX technology intrinsics are in the mmint rin h header file Alternate Operation Corresponding Name Instruction m64 _m_pand __m64 ml __m64 m2 Perform a bitwise AND of the 64 bit value in m1 with the 64 bit value in m2 m64 _m_pandn __m64 ml __m64 m2 Perform a logical NOT on the 64 bit value in m1 and use the result in a bitwise AND with the 64 bit value in m2 m64 _m_por __m64 ml __m64 m2 Perform a bitwise OR of the 64 bit value in m1 with the 64 bit value in m2 m64 _m_pxor __m64 ml __m64 m2 Perform a bitwise XOR of the 64 bit value in m1 with the 64 bit value in m2 MMX Technology Compare Intrinsics The prototypes for MMX technology intrinsics are in the mmint rin h header file Alternate Comparison Number Element Corresponding Name of Bit Size Instruction Elements _m_pcmpeqb mm cmpeq pi8 E 16 _m_pcmpgtb _mm_cmpgt_pi8 m pcmpgtd _mm_cmpgt_pi32 2 32 16 217 Intel C Compiler for Linux Systems User s Guide __m64 m pcmpegqb m64 ml __m64 m2 If the respective 8 bit values in m1 are equal to the respective 8 bit values in m2 set the respective 8 bit resulting values to all ones otherwise set them to all Zeros m64 m pcmpeqw m64 ml __m64 m2 If the respective 16 bit values in m1 are equal to the respective 16 bit values in m2 set the respective 16 bit resulting values to all ones otherwise set them to a
432. ual Not Equal 253 Intel C Compiler for Linux Systems User s Guide __mi28d _mm_cmpeq_pd __m128d a __m128d bi Compares the two DP FP values of a and b for equality rO a0 D Oxffffffffffffffff 0x0 rl al bl Oxffffffffffffffff 0x0 __mi28d mm cmplt pd m128d a m128d b Compares the two DP FP values of a and b for a less than b rO a0 lt bO Oxffffffffffffffff 0x0 rl t dal lt bl O0xfffiftffffifffffff 0x0 mi28d mm cmple pd mi28d a m128d b Compares the two DP FP values of a and b for a less than or equal to b rO a0 lt bO Oxffffffffffffffff 0x0 rl al lt bl Oxffffffffffffffff 0x0 __mi28d mm cmpgt pd mi128d a m128d bi Compares the two DP FP values of a and b for a greater than b rO a0 gt bO Oxffffffffffffffff 0x0 rl al gt bl Oxffffffffffffffff 0x0 __mi28d mm cmpge pd mi28d a m128d bi Compares the two DP FP values of a and b ro ri a0 gt b0 al gt bl __mi28d _mm_cmpord_pd __m128d a Oxffffffffffffffff Oxffffffffffffffff for a greater than or equal to b 0x0 0x0 mi28d b __mi28d _mm_cmpunord_pd __m128d a Compares the two DP FP values of a and b for ordered r0 a0 ord b0 Oxffffffffffffffff rl al ord bl Oxffffffffffffffff 128d b Compares the two DP FP values of a and b for unordered r0 a0 unord b0 Oxffffffff am EETLELEL 0x0 0x0 0x0 rd al unord b1
433. ual mm ucomigt ss Greater Than mm ucomige ss Greater Than or Equal mm ucomineq ss Not Equal m128 mm cmpeq ss m128 a _ m128 Compare for equality rO a0 b0 Oxffffffff rl al r2 a2 r3 a3 m128 mm cmpeq ps m128 a _ m128 Compare for equality rO a0 b0 Oxffffffff rl al bl Oxffffffff r2 a2 b2 Oxffffffff r3 a3 b3 Oxffffffff Corresponding Instruction CMPNLEPS CMPORDSS CMPORDPS CMPUNORDSS CMPUNORDPS COMISS COMISS COMISS COMISS COMISS COMISS UCOMISS UCOMISS UCOMISS UCOMISS UCOMISS UCOMISS 0x0 0x0 0x0 0x0 227 Intel C Compiler for Linux Systems User s Guide m128 _mm_cmplt_ss __m128 a __m128 b Compare for less than rO a0 lt bO Oxffffffff 0x0 rl al r2 a2 r3 a3 m128 mm cmplt ps m128 a m128 b Compare for less than r0 a0 lt b0 Oxffffffff 0x0 rl al lt bl Oxffffffff 0x0 r2 a2 lt b2 Oxffffffff 0x0 r3 a3 lt b3 Oxffffffff 0x0 m128 mm cmple ss m128 a m128 b Compare for less than or equal rO a0 lt b0 Oxffffffff 0x0 rl al r2 a2 r3 a3 m128 mm cmple pat m128 a __m128 b Compare for less than or equal rO a0 lt b0 Oxffffffff 0x0 rl al lt bl Oxffffffff 0x0 r2 a2 lt b2 Oxffffffff 0x0 r3 a3 lt b3 Oxffffffff 0x0 m128 mm cmpgt ss m128 a m128 b C
434. uble Complex z float Complex cacoshf float Complex z Description The carg function returns the value of the argument in the interval pi pi Calling interface double carg double Complex z long double cargl long double Complex z float cargf float Complex z Description The casin function returns the complex inverse sine of z Calling interface double Complex casin double Complex zi long double Complex casinl long double Complex z float Complex casinf float Complex z Description The casinh function returns the complex inverse hyperbolic sine of z Calling interface double Complex casinh double Complex z long double Complex casinhl long double Complex z float Complex casinhf float Complex z Description The catan function returns the complex inverse tangent of z Calling interface double Complex catan double Complex z long double Complex catanl long double Complex z float Complex catanf float Complex z Intel Math Library CATANH CCOS CCOSH CEXP CEXP10 CIMAG Description The cat anh function returns the complex inverse hyperbolic tangent of z Calling interface double Complex catanh double Complex z long double Complex catanhl long double Complex z float Complex catanhf float Complex z Description The ccos function returns the complex cosine of z Calling interface double Complex ccos double Complex z long double Complex ccosl long
435. ue of x to the power y with single precision Returns the sine of x with double precision Returns the sine of x with single precision Returns the cosine of x with double precision Returns the cosine of x with single precision Returns the tangent of x with double precision Returns the tangent of x with single precision Returns the arccosine of x with double precision Returns the arccosine of x with single precision 205 Intel C4 Compiler for Linux Systems User s Guide Intrinsic double acosh double float acoshf float double asin double float asinf float double asinh double float asinhf float double atan double float atanf float double atanh double float atanhf float float cabs double double ceil double float ceilf float double cosh double float coshf float float fabsf float 206 Description Compute the inverse hyperbolic cosine of the argument with double precision Compute the inverse hyperbolic cosine of the argument with single precision Compute arc sine of the argument with double precision Compute arc sine of the argument with single precision Compute inverse hyperbolic sine of the argument with double precision Compute inverse hyperbolic sine of the argument with single precision Compute arc tangent of the argument with double precision Compute arc tangent of the argument with single precision
436. ul6vec8 unsigned int 16 8 dvec h I8vecl6 unspecified char 8 16 dvec h Is8vecl signed char 8 16 dvec h Iu8vecl unsigned char 8 16 dvec h Most classes contain similar functionality for all data types and are represented by all available intrinsics However some capabilities do not translate from one data type to another without suffering from poor performance and are therefore excluded from individual classes n Note Intrinsics that take immediate values and cannot be expressed easily in classes are not implemented For example mm shuffle ps mm shuffle pil6 mm extract pil6 mm insert pilo 336 Intel C Intrinsics Reference Access to Classes Using Header Files The required class header files are installed in the include directory with the Intel C Compiler To enable the classes use the include directive in your program file as shown in the table that follows Include Directives for Enabling Classes lt _ Instruction Set Extension Include Directive _ MMX Technology lude lt ivec h gt p Streaming SIMD Extensions lude lt fvec h gt lt lt Streaming SIMD Extensions 2 include lt dvec h gt Each succeeding file from the top down includes the preceding class You only need to include fvec h if you want to use both the Ivec and Fvec classes Similarly to use all the classes including those for the Strea
437. utility intrinsic mm movemask ps If the inaccuracy is acceptable the SIMD reciprocal and reciprocal square root approximation intrinsics rcp and rsqrt are much faster than the true div and sqrt intrinsics 245 Intel C Compiler for Linux Systems User s Guide Macro Function for Shuffle Using Streaming SIMD Extensions The Streaming SIMD Extensions provide a macro function to help create constants that describe shuffle operations The macro takes four small integers in the range of 0 to 3 and combines them into an 8 bit immediate value used by the SHUFPS instruction See the example below Shuffle Function Macro MM SHUFFLE z y x w expands to the following value z lt lt 6 ye lt 4 ix 2 eg You can view the four integers as selectors for choosing which two words from the first input operand and which two words from the second are to be put into the result word View of Original and Result Words with Shuffle Function Macro mm shuffle ps ml m2 MM SHUFFLE 1 0 3 2 Macro Functions to Read and Write the Control Registers The following macro functions enable you to read and write bits to and from the control register For details see Set Operations For Itanium based systems these macros do not allow you to access all of the bits of the FPSR See the descriptions for the get fpsr and setfpsr intrinsics in the Native Intrinsics for Itanium Instructions topic Exception
438. vec4 A I32vec4 B Is32vec4 unpack_high Is32vec4 A Is32vec4 B Iu32vec4 unpack high Iu32vec4 A Iu32vec4 B RO Al Rl Bl R2 A2 R3 B2 Corresponding intrinsic mm unpackhi epi32 356 Intel C Intrinsics Reference Interleave the 32 bit value from the high half of A with the 32 bit value from the high half of B I32vec2 unpack high I32vec2 A I32vec2 B Is32vec2 unpack high Is32vec2 A Is32vec2 B Iu32vec2 unpack high Iu32vec2 A Iu32vec2 B RO Al Rl Bl Corresponding intrinsic mm unpackhi pi32 Interleave the four 16 bit values from the high half of A with the two 16 bit values from the high half of B Il6vec8 unpack high Il6vec8 A I16vec8 B Isl6vec8 unpack high Isl6vec8 A Isl6vec8 B Iul6vec8 unpack high Iul6vec8 A Iul6vec8 B RO A2 R1 B2 R2 A3 R3 B3 Corresponding intrinsic mm unpackhi epil6 Interleave the two 16 bit values from the high half of A with the two 16 bit values from the high half of B Il6vec4 unpack high Il6vec4 A I16vec4 B Isl6vec4 unpack_high Isl6vec4 A Isl6vec4 B Iul6vec4 unpack high Iul6vec4 A Iul6vec4 B RO R2 A2 R1 A3 R3 B2 B3 Corresponding intrinsic mm unpackhi pil6 Interleave the four 8 bit values from the high half of A with the four 8 bit values from the high half of B I8vec8 unpack high I8vec8 A I8vec8 B Is8vec8 unpack high Is8vec8 A I8vec8 B Iu8vec8 unpack high Iu8vec8 A I8vec8 B RO A4
439. w ubs pil6 F __m64 Corresponding Instruction PADDB PADDW PADDD PADDSB PADDSW PADDUSB PADDUSW U We U ca U We BUSB PSUBUSW PMADDWD PMUL PMUL m2 m2 Multiplication 4 16 4 16 high Multiplication a 4 16 4 16 low Add the eight 8 bit values in m1 to the eight 8 bit values in m2 m64 Add the four 16 bit values in m1 to the four 16 bit values in m2 213 Intel C Compiler for Linux Systems User s Guide m64 m paddd m64 ml __m64 m2 m64 m64 m64 m64 64 Add the two 32 bit values in m1 to the two 32 bit values in m2 m paddsb m64 ml m64 m2 Add the eight signed 8 bit values in m1 to the eight signed 8 bit values in m2 using saturating arithmetic m paddsw m64 ml __m64 m2 Add the four signed 16 bit values in m1 to the four signed 16 bit values in m2 using saturating arithmetic m paddusb m64 ml m64 m2 Add the eight unsigned 8 bit values in m1 to the eight unsigned 8 bit values in m2 and using saturating arithmetic m paddusw m64 ml m64 m2 Add the four unsigned 16 bit values in m1 to the four unsigned 16 bit values in m2 using saturating arithmetic m psubb m64 ml __m64 m2 64 Subtract the eight 8 bit values in m2 from the eight 8 bit values in m1 m psubw m64 ml m i m2 64 Subtract the four 16 bit values in m2 from the four 16 bit values in m1 m psubd m64 ml __
440. ws cout F64vec2 A 1 A1 0 AO Corresponding intrinsics none The four single precision floating point values of A are placed in the output buffer and printed in decimal format as follows cout F32vec4 A 3 A3 2 A2 1 A1 0 AO Corresponding intrinsics none The lowest single precision floating point value of A is placed in the output buffer and printed cout lt lt F32vecl A Corresponding intrinsics none Element Access Operations double d F64vec2 A int i Read one of the two double precision floating point values of A without modifying the corresponding floating point value Permitted values of i are 0 and 1 For example If DEBUG is enabled and i is not one of the permitted values 0 or 1 a diagnostic message is printed and the program aborts double d F64vec2 A 1 Corresponding intrinsics none Read one of the four single precision floating point values of A without modifying the corresponding floating point value Permitted values of i are 0 1 2 and 3 For example float f F32vec4 A int i If DEBUG is enabled and i 1s not one of the permitted values 0 3 a diagnostic message is printed and the program aborts float f F32vec4 A 2 Corresponding intrinsics none 379 Element Assignment Operations F64vec4 A int i double d Modify one of the two double precision floating point values of A Permitted values of int i are 0 and 1 For example F32vec4 A 1 double
441. ws automatic and explicit sign and size typecasting Explicit means that it is illegal to mix different types without an explicit typecasting Automatic means that you can mix types freely and the compiler will do the typecasting for you 341 Summary of Rules Major Operators weg Operators Sign Size Other Typecasting Requirements Typecasting Typecasting LL 1 Assignment mn Na N A E Logical Automatic Automatic Explicit typecasting is required for to left different types used in non logical expressions on the right side of the assignment bc Addition and Automatic Explicit N A Subtraction M Multiplication Automatic Explicit N A l Shift Automatic Explicit Casting Required to ensure arithmetic shift Ka Compare Automatic Explicit Explicit casting is required for signed classes for the less than or greater than operations Kc Conditional Automatic Explicit Explicit casting is required for signed Select classes for less than or greater than operations Data Declaration and Initialization The following table shows literal examples of constructor declarations and data type initialization for all class sizes All values are initialized with the most significant element on the left and the least significant to the right Declaration and Initialization Data Types for Ivec Classes Operation Class Syntax Declaration M128 I128vecl A Iu8vecl6 A Declaration M64 I64vecl A Iu8vec16 A
442. x double y int islessgreaterl long double x long double y int islessgreaterf float x float y ISNAN Description The isnan function returns a non zero value if and only if x has a NaN value Calling interface int isnan double x int isnanl long double x int isnanf float x ISNORMAL Description The isnormal function returns a non zero value if and only if x is normal Calling interface int isnormal double x int isnormall long double x int isnormalf float x ISUNORDERED 192 Description The isunordered function returns 1 if either x or y is a NaN This function does not raise the invalid floating point exception Calling interface int isunordered double x double y int isunorderedl long double x long double y int isunorderedf float x float y Intel Math Library NEXTAFTER Description The next after function returns the next representable value in the specified format after x in the direction of y errno ERANGE for values too large Calling interface double nextafter double x double y long double nextafterl long double x long double y float nextafterf float x float y NEXTTOWARD SIGNBIT Description The next toward function returns the next representable value in the specified format after x in the direction of y If x equals y then the function returns y converted to the type of the function errno ERANGE for values too large Calling interface double nexttoward
443. xa a libcxa so libcxa so 3 libcxaguard a libcxaguard so libcxaguard so 3 Description For OpenMP implementation OpenMP static library for the parallelizer tool with performance statistics and profile information Library that resolves references to OpenMP subroutines when OpenMP is not in use Short vector math library Intel support library for PGO and CPU dispatch Mulit thread version on 1ibirc a Intel math library Intel math library Dinkumware C Library Unwinder library Intel run time support for C features Used for interoperability support with the cxxlib gcc option See gcc Interoperability 165 Intel C Compiler for Linux Systems User s Guide Key Files Summary for Itanium Compiler The following tables list and briefly describe files that are installed for use by the Itanium compiler bin Files File Description codecov Code coverage tool iccvars sh Batch file to set environment variables icc cfg Configuration file for use from command line icc Intel C Compiler icpe profmerge Utility used for Profile Guided Optimizations proforder Utility used for Profile Guided Optimizations tselect Test prioritization tool xiar Tool used for Interprocedural Optimizations xild Tool used for Interprocedural Optimizations include Files File Description emmintrin h Principal header file for SSE2 intrinsics float h IEEE 754 version of standard 1oat h fvec h
444. y count while shifting in the sign bit For the best performance count should be a constant m psrad m64 m m64 count m64 m64 Shift two 32 bit values in m right the amount specified by count while shifting in the sign bit m psradi m64 m int count Shift two 32 bit values in m right the amount specified by count while shifting in the sign bit For the best performance count should be a constant m psrlw m64 m X m64 count m64 m64 Shift four 16 bit values in m right the amount specified by count while shifting In Zeros m psrlwi m64 m int count Shift four 16 bit values in m right the amount specified by count while shifting in zeros For the best performance count should be a constant m psrld m64 m m64 count m64 m64 Shift two 32 bit values in m right the amount specified by count while shifting In Zeros m psrldi m64 m int count Shift two 32 bit values in m right the amount specified by count while shifting in zeros For the best performance count should be a constant m psrlq m64 m __m64 count m64 216 Shift the 64 bit value in m right the amount specified by count while shifting in Zeros m psrlgi m64 m int count Shift the 64 bit value in m right the amount specified by count while shifting in zeros For the best performance count should be a constant Intel C Intrinsics Reference MMX Technology Logical Intrinsics The proto
445. y function 190 e EE 150 floor library function sssssesss 187 flushing denormal results 54 fma library function 190 fmax library function 190 fmin library function 190 fminshared option 11 fmod library function 189 fno alias option 11 fno common option 11 fno fnalias option sssssssseeee 11 fno rtti option ssssessessssseeeesseserseesersersteresseseeses 11 fnsplit option 11 100 fp option 11 80 fp port option 11 80 81 PIC Options errore Rn 11 fpstkchk option 11 81 SREL 11 frexp library function 180 fshort enums option sssee 11 fsource asm option 11 fsyntax only option 11 ftz option seen cede 11 83 function splitting sse 100 funsigned bitfields option 11 funsigned char option 11 fvisibility option 11 fvisibility default option 11 fvisibility extern option 11 fvisibility hidden option 11 fvisibility internal option 11 fvisibility protected option 11 O OPUlON oco ac tree iR Dd iren 11 gamma library function sess 184 gamma r library function sseeeeeeeeeeee 184 gcc interoperability with sess 71 gcc function attributes
446. yntax of these extended attributes is as follows cpu specific cpuid cpu dispatch cpuid list The values for cpuid and cpuid list are shown in the tables below Processor Values for cpuid x86 processors not provided by Intel Corporation generic Intel Pentium processors pentium Intel Pentium processors with MMX Technology pentium_mmx Intel Pentium Pro processors pentium_pro Intel Pentium II processors pentium_ii Intel Pentium III processors pentium iii Intel Pentium III exclude xmm registers pentium iii no xmm regs Intel Pentium 4 processors pentium 4 Intel Pentium M processors pentium m Intel Pentium 4 processor with Streaming SIMD future cpu 10 Extensions 3 SSE3 Values for cpuid list cpuid list cpuid The attributes are not case sensitive The body of a function declared with __declspec cpu_dispatch must be empty and is referred to as a stub an empty bodied function 87 Intel C Compiler for Linux Systems User s Guide Use the following guidelines to implement automatic processor dispatch support 1 88 Stub for cpu_dispatch must have a cpuid defined in cpu_specific elsewhere If the cpu dispatch stub for a function f contains the cpuid p then a cpu specific definition of with cpuid p must appear somewhere in the program otherwise an unresolved external error is reported A cpu specific function definition need not appear in the same translation unit as the correspondi
447. ype Complex This can cause some performance improvements in programs that use Complex arithmetic but values at the extremes of the exponent range may not compute correctly Default is complex limited range 13 Intel C Compiler for Linux Systems User s Guide Option Description Default create pch filename Manual creation of precompiled OFF header filename pchi cxxlib gcc Link using C run time OFF libraries provided with gcc requires gcc 3 2 or above cxxlib icc Link using C run time ON libraries provided by Intel Output macro definitions in OFF effect after preprocessing use with E Defines a macro name and OFF associates it with the specified value Equivalent to a def ine preprocessor directive Dname value dryrun Show driver tool commands but OFF do not execute tools dynamic linkerfilename Selects a dynamic linker OFF filename other than the default Stops the compilation process OFF after the C or C source files have been preprocessed and writes the results to stdout Preprocess to stdout omitting OFF line directives f no verbose asm Produce assemblable file with ON compiler comments falias Assume aliasing in program ON Maximize speed across the OFF entire program Turns on 03 ipo and static fcode asm Produce assemblable file with OFF optional code annotations Requires S ffnalias Assu
448. ze of the final program e Inline a function if it is declared with the inline or inline keywords 98 Compiler Optimizations Profile guided Optimizations Profile guided optimizations PGO tell the compiler which areas of an application are most frequently executed By knowing these areas the compiler is able to use feedback from a previous compilation to be more selective in optimizing the application For example the use of PGO often enables the compiler to make better decisions about function inlining thereby increasing the effectiveness of interprocedural optimizations Instrumented Program Profile guided optimization creates an instrumented program from your source code and special code from the compiler Each time this instrumented code is executed the instrumented program generates a dynamic information file When you compile a second time the dynamic information files are merged into a summary file Using the profile information in this file the compiler attempts to optimize the execution of the most heavily travelled paths in the program Unlike other optimizations such as those used strictly for size or speed the results of IPO and PGO vary This is due to each program having a different profile and different opportunities for optimizations The guidelines provided here help you determine if you can benefit by using IPO and PGO Profile guided Optimizations Methodology PGO works best for code with many frequently

Intel(R) C++ Compiler for Linux* Systems User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents