Home

Read the User`s Guide - BeBOP (Berkeley Benchmarking and

image

Contents

1. 5 3 1 Applying the transpose of a matrix We follow the BLAS convention of allowing the user to apply the transpose or for complex data the transpose or Hermitian transpose See Table 6 for a list of transpose options provided by the scalar type oski_matop_t The notation op A indicates that any of A A or A may be applied i e op A A AT AH The high level kernel oski MatTransMatMult has inherently more matrix reuse op portunities This kernel allows the user to apply any of the four matrix operations listed in Table 7 given a matrix A AAT AT A AA and AP A 5 3 2 Aliasing The interface guarantees correct results only if multivector view object input arguments do not alias any multivector view object output arguments i e if input and output views do not view the same user data If such aliasing occurs the results are not defined OP AT A Apply AT A OP AH A Apply AH A OP A AT Apply AAT OP A AH Apply AA Table 7 Matrix transpose times matrix options type oski_ataop_t Constants that allow a user to apply AT A AH A AAT or AA in calls to the routine oski MatTransMatMult These options are called op2 A in Table 5 5 4 Tuning 22 5 3 3 Scalars vs 1x1 matrix objects An object of type oski matrix t created with dimensions 1 x 1 is riot treated as a scalar by the kernel routines Therefore such an object may only be applied to a single vector and not a n x k multivector o
2. oski SetHintMatMult and MatTransMult oski SetHintMatPowMult oski SetHint Specify hints about the non zero structure that may be relevant to tuning For a list of available hints see Table 9 on the following page oski TuneMat Tune the matrix data structure using all hints and implicit workload data accumulated so far Table 8 Tuning primitives The user tunes a matrix object by first specifying work load and structural hints followed by an explicit call to the tuning routine oski TuneMat Bindings appear in Appendix B 4 on page 52 so desires to determine if reordering has taken place and access P PT and A directly to reduce the number of permutations 5 4 1 Providing workload hints explicitly Each of the kernels listed in Table 5 on page 21 has a corresponding workload hint routine The user calls these routines to tell the library which kernels she will call and with what arguments for a given matrix object and the expected frequency of such calls The routines for specifying workload hints Table 8 all have an argument signature of the form oski SetHint lt kERNEL gt A tunable lt KERNEL_PARAMS gt num calls where num calls is an integer This hint tells the library that we will call the specified KERNEL on the object A tunable with the arguments lt KERNEL_PARAMS gt and that we expect to make num calls such calls Instead of specifying an estimate of the number of
3. D OSKI Library Integration Notes The following subsections discuss integration of OSKI with specific higher level libraries and applications D 1 Sparse BLAS The recent revision of the BLAS standard defines an interface for sparse matrix kernels 6 3 Indeed the very first OSKI design proposed to add just a single routine to the Sparse BLAS for explicit tuning analogous to oski TuneMat and a few additional routines to provide the new cache friendly kernels like sparse ATA x and A x 22 Chapter 8 Although the current OSKI design departs from the Sparse BLAS in several ways the similarities make OSKI a suitable basis for building a Sparse BLAS implementation Below we describe the main differences and outline what a Sparse BLAS implementation based on OSKI might look like The primary ways in which OSKI differs from the Sparse BLAS interface are as follows e Explicit tuning In OSKI the user must call a tuning routine explicitly thereby ex posing the point during program execution at which the non trivial cost of tuning may occur In the Sparse BLAS any such tuning happens implicitly e Workload hints for tuning Both the Sparse BLAS and OSKI define allow the user to provide hints about matrix structure for example dense rectangular block substruc ture OSKI also allows workload hints such as the frequency and mix of which kernels will be called so that an OSKI implementation can perform a cost benefit analysis to deci
4. 1 0 35 oski_vecview_t x_view oski_CreateVecView x 3 STRIDE_UNIT oski_MatTrisolve A tunable OP NORMAL 1 0 x view Should print the solution x 0 1 0 2 0 3 printf x Sf f f n x 0 x 1 x 2 int oski MatTransMatMult const oski matrix t A tunable oski_ataop_t opA oski value t alpha const oski_vecview_t x view oski value t beta oski vecview t y view oski vecview t t view Computes y a op A x 8 y where op A AAT AT A AA AP A Also optionally computes t A x if op 4 ATA AP A t AT zif op A AAT or t AP z if op A AA at the caller s request Parameters A tunable input A tunable is valid An object for a matrix A opA input See Table 7 on page 21 Specifies op A alpha beta input The scalar constants a p respectively x_view input x_view is valid View object for a multi vector z y view input output y view is valid View object for a multi vector y t view output t view may be valid or INVALID MAT An optional view object for a multi vector t Actions and Returns Returns an error code and leaves y and t if specified unchanged on error Otherwise returns 0 and computes y a op A x B y On a O return also computes t if t view is specified and not equal to INVALID MAT Error conditions and actions Possible error conditions include unsatisfied argument preconditions
5. B y and z w op A z C z where op A A AT AH Parameters A_tunable input output A tunable is valid An object for a matrix A alpha beta omega zeta input The scalar constants a 3 w C respectively B 4 Tuning 55 opA input See Table 6 on page 21 Specifies op A x view y view w_view z_view input Vectors are valid or symbolic see Table 11 on page 25 View objects for multi vectors x y w and z respectively num calls input num calls is non negative or symbolic see Table 10 on page 25 The number of times this kernel will be called with these arguments Actions and Returns If A z and y have compatible dimensions and if op A w and z have compatible dimensions then this routine registers the workload hint with A tunable and returns 0 Otherwise returns an error code Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD ARG ERR BAD MAT ERR BAD VECVIEW and incompatible operand dimensions ERR DIM MISMATCH int oski SetHintMatPowMult oski matrix t A tunable oski matop t opA int power oski value t alpha const oski vecview t x view oski value t beta const oski vecview t y view const oski vecview t T view int num calls Workload hint for the kernel operation oski MatPowMult which computes a power of a matrix times a vector or y a op A z 8 y Also optionally computes T t t 1
6. ERR BAD ARG ERR BAD MAT ERR BAD VECVIEW and incompatible operand dimensions ERR DIM MISMATCH Example Let A tunable be an object corresponding to the sparse lower triangular matrix shown in Figure 1 on page 9 and let x7 1 2 3 The following code computes B 3 Kernels 50 Tite A z and y AT A x Set up vectors double x 1 2 3 oski_vecview_t x view oski_CreateVecView x 3 STRIDE UNIT doublet 1 21 1 y oski vecview t t view oski CreateVecView t 3 STRIDE UNIT double y 1 1 1 y oski vecview t y view oski_CreateVecView y 3 STRIDE_UNIT Execute kernel t A z y ATA x oski MatTransMatMult A tunable OP AT A 1 x view 0 y view t view Print results Should display a t 0 1 0 0 35 cy y 0 275 0 0 35 printf t Sf Sf n t 0 t 1 t 2 printf y Sf An yl0 y 1 yI2 X int oski MatMultAndMatTransMult const oski matrix t A tunable oski value t alpha const oski_vecview_t x view oski value t beta oski vecview t y view oski matop t opA oski value t omega const oski vecview t w view oski value t zeta oski vecview t z view Computes y a A z 8 y and z w op A z C z where op A A AT AP Y Parameters A tunable input A tunable is valid An object for a matrix A alpha beta omega zeta input The scalar
7. OSKI constants are shown red Furthermore this example assumes you have built the default int double version of the library 3 1 Initializing OSKI To initialize the library your application should include oski oski n line 7 and call oski Init line 27 For symmetry you can call oski Close line 44 at the end of your program to shut down the library although this step is optional 3 2 Creating a matrix The sparse matrix in Listing 1 on the next page is a 3 x 3 lower triangular matrix with all ones on the diagonal The input matrix here declared statically for simplicity in lines 15 17 is stored in a compressed sparse row CSR format format using 2 integer arrays Aptr and Aind to represent the non zero pattern and one array of doubles Aval to store the non zero values The diagonal is not stored explicitly This representation is a standard way of implementing CSR format in various sparse libraries 17 16 1 This particular example assumes the convention of 0 based indices and does not store the diagonal ex plicitly Lines 19 20 declare and initialize two arrays x and y to represent the vectors Again these declarations are standard implementations in that the user could call the dense BLAS on these arrays to perform for instance dot products or scalar times vector products axpy operations in the BLAS terminology We create a tunable matrix object A tunable from the input matrix by a call to oski CreateMatCSR
8. PermutedMatColPerm and to apply these permutations to vector view objects oski _PermuteVecView Given the user s matrix A suppose that tuning produces the representation A PT A P where P and P are permutation matrices and multiplying by A is much faster than multiplying by A This form corresponds to reordering the rows and columns of A by pre and post multiplying A by P and PZ to produce an optimized matrix A To compute y A x correctly while maintaining its interface and taking advantage of fast multiplication by A the kernel oski MatMult must instead compute P y P 2 Carrying out this computation in place requires permuting x and y on entry and then again on return If tuning produces such a permuted matrix all kernels perform these permutations as necessary Since the user may be able to reduce the permutation cost by permuting only once be fore executing her algorithm and perhaps again after her algorithm completes we provide several routines to extract view objects of P Pe and A These objects are views of the inter nal tuned representation of A Therefore they are valid for the lifetime of the matrix object that represents A they do not have to be deallocated explicitly by the user Moreover if A is re tuned these representations will be updated automatically We provide an additional type for permutations oski perm t and the routines listed in Table 13 An object of type oski perm t may
9. Then A i j rex Aval k That is repeated elements are summed Packed 3 array compressed sparse row using 1 based indices The user provides 3 arrays Aptr Aind and Aval corresponding to A These arrays satisfy the following conditions 1 Aptr is of length at least Aptr m 1 Aptr 0 gt 1 and for all 0 i lt m Aptr i Aptr i 1 2 Aind is of length at least Aptr m Each element of Aind lies between 1 and n inclusive 3 Avalis of length at least Aptr m A matrix element A i j is computed from this representation as follows Let K be the set k Aptr i 1 k lt Aptr i and Aind k j Then A i j peg Aval K Repeated elements are summed Packed 3 array compressed sparse column using 0 based indices The user pro vides 3 arrays Aptr Aind and Aval corresponding to A These arrays satisfy the following conditions 1 Aptr is of length at least Aptr n 1 Aptr 0 gt 0 and for all 0 lt j lt n Aptr j Aptr j 1 2 Aind is of length at least Aptr n Each element of Aind lies between 0 and m 1 inclusive 3 Avalis of length at least Aptr n A matrix element A i j is computed from this representation as follows Let Kj be the set k Aptr j 1 k lt Aptr j and Aind k i 1 Then A i j rex Aval k Repeated elements are summed Packed 3 array compressed sparse column using 1 based indices The user pro vides 3 arrays Aptr Aind and Aval
10. Wn T 3 T 4 TI5 printf y A 3 x 3f bf ee L yO yl21 B 4 Tuning int oski SetHintMatMult oski matrix t A tunable oski matop t opA oski value t alpha const oski vecview t x view oski value t beta const oski vecview t y view int num calls Workload hint for the kernel operation oski MatMult which computes y a op A x B y where op A A AT AH Parameters A tunable input output A tunable is valid An object for a matrix A B 4 Tuning 53 opA input See Table 6 on page 21 Specifies op A alpha beta input Scalar constants a 8 respectively x view y view input Vectors are valid or symbolic see Table 11 on page 25 View object for a multi vector x and y respectively num calls input num calls is non negative or symbolic see Table 10 on page 25 The number of times this kernel will be called with these arguments Actions and Returns Registers the workload hint with A tunable and returns 0 only if the dimensions of op A x and y are compatible Otherwise returns an error code Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD ARG ERR BAD MAT ERR BAD_VECVIEW and incompatible operand dimensions ERR DIM MISMATCH Example See Listing 3 on page 13 int oski SetHintMatTrisolve oski matrix t T tunable oski matop t opT oski value t alpha const oski vecview t x view in
11. lines 28 31 with the following arguments 1 Arguments 1 3 specify the CSR arrays line 28 2 Arguments 4 5 specify the matrix dimensions line 28 Fortran interfaces will be available soon 21 26 31 36 41 46 3 2 Creating a matrix 9 Listing 1 Using OSKI A first example This example illustrates basic object creation and kernel execution in OSKI Here we perform one sparse matrix vector multiply for a lower triangular matrix A with all ones on the diagonal as shown in the leading comment This example computes y a A x p y where 1 00 25 1 x A 2 1 0 z 45 and yis initially 1 5 0 1 65 1 Ais a sparse lower triangular matrix with a unit diagonal and zx y are dense vectors x include lt stdio h gt include lt oski oski h gt Get OSKI bindings define DIM 3 x matrix dimension define NUM STORED NZ 2 x number of stored non zero values int main Sparse matrix A in compressed sparse row CSR format int Aptr DIM 1 0 0 1 2 E int Aind NUM STORED NZ 0 0 y double Aval NUM_STORED_NZ 2 0 5 Y Dense vectors x y and scalar multipliers a B double x DIM 25 45 65 double y DIM 1 1 1 1 y double alpha 1 beta 1 OSKI matrix vector objects x oski_matrix_t A_tunable oski_vecview_t x_view y_view Initialize OSKI and create matrix oski_Init A tunable oski_CreateMatCSR
12. 4 3 on page 25 The user must explicitly call the tune rou tine oski TuneMat to initiate tuning Conceptually this routine marks the point in program execution at which the library may spend time changing the data structure The tune routine uses any hints about the non zero structure or workload whether they are specified explicitly by the user via calls to the above tuning primitives or they are gathered implicitly during any kernel calls made during the lifetime of the matrix object Section 4 on page 12 illustrates the common ways in which we expect users to use the interface to tune The library may optimize kernel performance by permuting the rows and columns of the matrix to reduce the bandwidth 13 19 4 10 5 or to create dense block structure 15 That is the library may compute a tuned matrix representation A P A P for the user s matrix A where P and P are permutation matrices However this optimization requires each kernel to permute its vectors on entry and exit to maintain the correctness of the interfaces Section 5 4 4 on page 25 discusses functionality that allows the user if she The interface also permits an implementation of this interface to generate code or perform other instruction level tuning at run time as well 5 4 Tuning 23 oski SetHintMatMult Workload hints specify the expected oski SetHintMatTrisolve options and frequency of the oski SetHintMatTransMatMult corresponding kernel call
13. 9 on page 24 User specified structural hint This hint may be followed by optional arguments as listed and typed in Table 9 on page 24 Actions and Returns Returns 0 if the hint is recognized and A_tunable is valid or an error code otherwise Error conditions and actions Possible error conditions include an invalid matrix object ERR BAD MAT or specifying a hint with the wrong number of hint arguments ERR BAD HINT ARC Example See Listing 3 on page 13 int oski TuneMat oski matrix t A tunable Tune the matrix object using all hints and implicit profiling data Parameters A tunable input output A tunable is valid Matrix object to tune Actions and Returns Returns a non negative status code whose possible values are defined by the constants listed in Table 12 on page 26 or an error code otherwise Error conditions and actions Possible error conditions include providing an invalid matrix ERR BAD MATT Example See Listing 3 on page 13 and Listing 4 on page 15 B 5 Permutations int oski IsMatPermuted const oski matrix t A tunable Checks whether a matrix has been tuned by reordering Parameters B 5 Permutations 57 A tunable input A tunable is valid A matrix object corresponding to some matrix A Actions and Returns Returns 1 if A tunable has been tuned by reordering That is if tuning produced a representation JA PT AP where either P or P is not equal to the iden
14. A tunable printf Matrix transformations An sin end n xforms Saveto a file fprintf fp saved xforms s n xforms fclose fp saved xforms free xforms eo oski_GetMatTransforms Retrieve a string representation of the tuning transformations that have been applied to a given matrix oski_ApplyMatTransforms Apply tuning transformations to a given ma trix Table 14 Saving and restoring tuning transformations The interface defines a basic fa cility to allow users to view the tuning transformations that have been applied to matrix and later re apply those or other transformations to another matrix Bindings appear in Appendix B 6 on page 58 5 5 Saving and restoring tuning transformations The interface defines basic facilities that allow users to view the tuning transformations which have been applied to a matrix to edit these transformations and to re apply them Table 14 To get a string describing how the input matrix data structure was transformed during tuning the user calls oski_GetMatTransforms This routine returns a newly al located string containing the transformations description To set the data structure i e to apply some set of transformations to the input matrix data structure the user calls oski_ApplyMatTransforms Such a call is equivalent to calling oski TuneMat except that instead of allowing the library to decide what data structure t
15. Aptr Aind Aval 3 3 CSR arrays SHAREINPUTMAT copy mode remaining args specify how to interpret non zero pattern 3 INDEX ZERO BASED MAT TRI LOWER MAT UNIT DIAG IMPLICIT Create wrappers around the dense vectors x x view oski CreateVecView x 3 STRIDE UNIT y view oski CreateVecView y 3 STRIDE UNIT Perform matrix vector multiply y a A x B y oski MatMult A tunable OP NORMAL alpha x view beta y view Clean up interface objects and shut down OSKI oski DestroyMat A tunable oski Destroy VecView x view oski Destroy VecView y view oski Close Print result y Should be 75 1 05 225 J printf Answer y Sf bf f JAn yl0l yi yI2 X return 0 3 3 Specifying the dense vectors 10 3 The 6th argument to oski CreateMatCSR line 29 specifies one of two possible copy modes for the matrix object to help control the number of copies of the assembled ma trix that may exist at any point in time The value SHARE INPUTMAT indicates that both the user and the library will share the CSR arrays Aptr Aind and Aval because the user promises a not to free the arrays before destroying the object A tunable via a call to oski DestroyMat line 41 and b to adhere to a particular set of read write conventions The other available mode COPY INPUTMAT indicates that the library must make a copy of these arrays before
16. HINT SINGLE BLOCKSIZE ARGS NONE the user may specify the block size explicitly if it is known oski SetHint A tunable HINT SINGLE BLOCKSIZE 6 6 6 x 6 blocks In this case either call is correct since specifying the block size is optional See Table 9 on the previous page for a list of hints their arguments and whether the arguments are optional or required A library implementation is free to ignore all hints so there is no strict definition of the library s behavior if for example the user provides conflicting hints We recommend that implementations use the following interpretation of multiple hints e If more than one hint from a mutually exclusive group is specified assume the latter is true For example if the user specifies HINT SINGLE BLOCKSIZE followed by HINT NO BLOCKS then no block hint should override the single block size hint e Hints from different groups should be joined by a logical and That is if the user specifies HINT SINGLE BLOCKSIZE and HINT SYMM PATTERN this combina tion should be interpreted as the user claiming the matrix is both nearly structurally symmetric and dominated by a single block size 5 43 Initiating tuning This interface defines a single routine oski TuneMat which marks the point during pro gram execution at which tuning may occur This routine returns one of the integer status codes shown in Table 12 on the following page to indicate whether it changed the data struc
17. an overview of the semantics and intent behind these bindings 5 1 Basic scalar types Most sparse matrix formats require storing both floating point data for non zero values and integer index data Our interface is defined in terms of two scalar types accordingly oski value t and oski index t By default these types are bound to the C double and int types respectively The library build process allows the user to generate separate interfaces bound to other ordinal and value type combinations to support applications that need to use multiple types These other interfaces are still C and Fortran callable but the names are mangled to include the selected type information To override the default bindings the user should include an alternate header file oski oski_Txy h instead of oski oski h where xy indicates the desired binding To bind oski index t to int or long let x be i or 1 respectively Similarly to bind oski value t to float double complex float or complex double let y be s d c or z respectively For example to use the OSKI bindings with oski index t long and oski value t float the user would include oski oski Tls h instead of oski oski hin Listing 1 on page 9 It is possible to mix types within a single source file See Appendix C on page 60 for details In some instances in which a value of type oski value t is returned a NaN value is possible Since oski value t may be bound to either a real or complex type w
18. conditions and actions Error conditions and actions and short usage examples Example As discussed in Section 5 6 on page 29 the interface provides two error handling mech anisms return codes and error handling functions By convention readers can assume that any routine returning integers type int will return negative values on errors In ad dition all routines call an error handler if one is available in a given context according to the process described in Section 5 6 For all routines a violation of argument preconditions is always considered an error condition Many of the specifications refer to the mathematical matrix A defined by a given matrix object We take this matrix to have dimensions m x n and with elements A i j referenced beginning at position 1 1 ie 1 i mand 1 j n However since we are presenting the C interface note that all array indexing will be zero based B 1 Matrix object creation and modification B 1 Matrix object creation and modification 36 oski_matrix_t oski_CreateMatCSR oski_index_t Aptr oski index tx Aind oski value t Aval oski index t num rows oski index t num cols oski_copymode_t mode int k oski inmatprop t property 1 oski inmatprop t property k Creates and returns a valid tunable matrix object from a compressed sparse row CSR representa tion Parameters num rows x num cols input num rows gt 0 num cols gt 0 Dimensio
19. constants a 3 w C respectively opA input See Table 6 on page 21 Specifies op A x view w view input x view w view are valid View objects for multi vectors x and w respectively y view z view input output y view z view are valid View objects for multi vectors y and z respectively Actions and Returns If A z and y have compatible dimensions and if op A w and z have compatible dimensions then this routine computes y a A x 8 y and z w op A w C z and returns 0 Otherwise returns an error code and takes no action Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD ARG ERR BAD MAT ERR BAD VECVIEW and incompatible operand dimensions ERR DIM MISMATCH Example B 3 Kernels 51 Let A tunable be a matrix object for the sparse lower triangular matrix A shown in Listing 1 on page 9 and let x7 1 2 3 This example computes EN y A z and z AT z double x 1 2 3 oski vecview t x view oski_CreateVecView x 3 STRIDE UNIT double y 1 1 1 oski_vecview_t y view oski CreateVecView y 3 STRIDE UNIT doublez 1 1 1 y oski vecview t z view oski_CreateVecView z 3 STRIDE UNIT Compute y A x and z AT a oski MatMult and MatTIransMult A tunable 1 x view 0 y view OP TRANS 1 x view 0 z view Print results Should pr
20. equal a special symbolic constant representing an identity permutation of any size PERM IDENTITY This constant may be used in either of the routines to apply a permutation or its inverse to a vector view Listing 5 on the next page sketches the way in which a user might use these routines in her application It reduces the number of permutations performed if P P a condition easily tested line 20 by directly comparing the corresponding permutation variables Pr and Pc 24 29 34 39 5 4 Tuning 27 Listing 5 An example of extracting permutations This example computes y A x Suppose the library tunes by symmetrically reordering the rows and columns i e by computing A PT A P where P is a permutation matrix and multiplying by A is much faster than multiplying by A Then this example shows how to pre and post permute x y only each instead of at every multiplication A x Create A x and y x oski matrix t A tunable oski vecview t x view y view Store permuted form Declared as Type const since they will be read only const oski perm t Pr Pc Stores P Po const oski matrix t A fast x Stores int iter max power 3 k Tune for our computation oski SetHintMatMult A tunable OP NORMAL 1 x view 1 y view max power oski TuneMat A tunable Extract permuted form A P A PT A fast oski ViewPermutedMat A tun
21. of columns then an error is generated Example See Listing 1 on page 9 oski matrix t oski CreateMatCSC oski index t Aptr oski index tx Aind oski value t Aval oski index t num rows oski index t num cols oski_copymode_t mode int k oski inmatprop t property 1 oski inmatprop t property k Creates and returns a valid tunable matrix object from a compressed sparse column CSC repre sentation Parameters B 1 Matrix object creation and modification 37 num rows x num cols input num_rows gt 0 num cols gt 0 Dimensions of the input matrix Aptr Aind Aval input Aptr Aind Aval ANULL The input matrix pattern and values must correspond to a valid CSC representation as defined in Appendix A on page 34 mode input See Table 2 on page 19 Specifies the copy mode for the arrays Aptr Aind and Aval See Section 5 2 1 on page 16 for a detailed explanation k input k 20 The number of qualifying properties property 1 property k input optional See Table 3 on page 20 The user may assert that the input matrix satisfies zero or more properties listed in Table 3 on page 20 Grouped properties are mutually exclusive and specifying two or more properties from the same group generates an error see below Actions and Returns A valid tunable matrix object or INVALID MAT on error Any kernel operations or tuning opera tions may be called using this object Error conditions and actio
22. option with papi to specify the appropriate link flags For example to link against the copy of PAPI in the papi 3 0 8 1ib subdirectory of your home directory 3 Using OSKI A First Example 8 oski 1 0 1h configur with prefix S OSKIDIR with papi LS HOME papi 3 0 8 lib lpapi We recommend that you use PAPI only if you encounter an error with the timing package during installation 3 Using OSKI A First Example This section introduces the C version of OSKI by example The interface uses an object oriented calling style where the two main object types are 1 a sparse matrix object and 2 a dense multiple vector object In addition to showing OSKT s basic usage and providing you with a first example to test your OSKI installation this example illustrates how you can gradually migrate your existing application to use OSKI provided that application uses standard array representations of sparse matrices and dense vectors Listing 1 on the following page shows a simple example C program that computes one sparse matrix vector multiply SpMV on a 3 x 3 lower triangular matrix A and then prints the result This example is located in the examp1e 01 subdirectory of the OSKI source tree We explain Listing 1 on the next page below to just build the program to test your installation skip ahead to Section 3 5 on page 11 Calls to OSKI routines are shown in bold face OSKI objects are shown in blue bold face
23. returning from this call because you may choose to free the arrays at any time We discuss the semantics of both modes in de tail in Section 5 on page 14 In this example think of A tunable as a wrapper around these arrays 4 Arguments 7 10 tell the library how to interpret the CSR arrays line 31 Argument 7 is a count that says the next 3 arguments are semantic properties needed to inter pret the input matrix correctly INDEX ZERO BASED says that the index values in Aptr and Aind follow the C convention of starting at 0 as opposed to the typical Fortran convention of starting at 1 the default is 1 based indexing if not otherwise specified The value MAT TRI LOWER asserts the pattern is lower triangular and MAT UNIT DIAG IMPLICIT asserts that no diagonal elements are specified explic itly but should be taken to be 1 The library will at this call check these properties to ensure they are true if the cost of doing so is O no of non zeros Since this example uses the SHARE INPUTMAT copy mode and performs no tuning OSKI will not create any copies of the input matrix The routine oski CreateMatCSR accepts a variable number of arguments only the first 6 arguments are required If you do not provide the optional arguments the library assumes the defaults discussed in Section 5 2 on page 16 3 3 Specifying the dense vectors Dense vector objects of type oski vecview t are always wrappers around user array rep resentations lines 34
24. static library versions of Listing 1 on page 9 Builds the program in Listing 1 on page 9 on Linux systems Location of your copy of OSKI OSKL DIR usr local OSKIINCS DIR OSKI_DIR include OSKILIBS_DIR OSKI_DIR lib oski OSKI link flags OSKILIBS SHARED Wl rpath W1 OSKILIBS_DIR L OSKILIBS_DIR V cat OSKILIBS_DIR site modules shared txt OSKILIBS_STATIC Wl whole archive cat OSKILIBS_DIR site modules static txt A W1 no whole archive CC icc CFLAGS I OSKI_DIR include O3 xB g CLDFLAGS SHARED OSKILIBS SHARED Im CLDFLAGS STATIC OSKILIBS STATIC 1dl lm sn wA ME all example1 shared examplel static examplel shared examplel o CC CFLAGS o 9 example1 o CLDFLAGS SHARED examplel static examplel o CC CFLAGS o 0 example1 o H CLDFLAGS STATIC C 0 CC CFLAGS o 9 c lt clean rm rf example1 shared example1 static examplel o core eof That A tunable x view and y view are shared with the library implies you can con tinue to operate on the data to which these views point as you normally would For in stance you can call dense BLAS operations such as a dot products or scalar vector multi ply the so called axpy operation on x and y You might also choose to introduce calls to the OSKI kernels se
25. the name of the U C Berkeley mascot Go Bears 2 2 How to install OSKI 5 IRIX Darwin and Cygwin Microsoft Windows platforms are supported only through Cygwin You ll also need to think about the following issues 22 What integer and floating point precisions do you need Sparse matrices are stored using both scalar integer indices and floating point val ues and you can build different versions of OSKI with various combinations of types OSKI currently supports supports the C language int and long usually 32 bit and 64 bit values respectively for the integer types and for floating point values supports single precision real using float double precision real using double and complex valued versions of single and double precision By default the installation process builds int double You specify the combinations of types you need during the Configuration step below In what directory will you install OSKI OSKI installs itself in the subdirectories of usr 10ca1 by default but you can over ride this In all of our examples we will use OSKIDIR as a placeholder for the directory of your choice The OSKI files are installed in S OSKIDIR bin S OSKIDIR lib oski S OSKIDIR include oski Do you want to build static or shared libraries The default installation builds both static and shared libraries except on Cygwin where only static libraries are built We recommend that when possible you build and use sh
26. their complex variants as the BLAS 2 Matrix handle creation Section 5 2 on page 16 The Sparse BLAS routine which ends the matrix assembly should construct a CSR or CSC matrix internally and call the OSKI matrix handle creation routine on this matrix The Sparse BLAS implemen tation needs to maintain a mapping between the handle it returns to the user and the OSKI handle Since the matrix is logically fixed in the Sparse BLAS once assembly ends the OSKI routines to get change non zero values are irrelevant 3 Kernels Section 5 3 on page 19 OSKI does not define any sparse or dense vector kernels The sparse matrix dense vector matrix multiply and triangular solve ker nels correspond directly to the equivalent OSKI routines The Sparse BLAS kernels just need to wrap the dense vector matrix data by creating vector views which are essentially just semantic wrappers around their data see Section 5 2 on page 16 Higher level kernels like A7 A x are not available in the Sparse BLAS and are thus not relevant to an OSKI based implementation 4 Tuning Sections 5 4 5 5 The simplest way to enable tuning is to use the implicit profiling style supported by OSKI Section 4 2 on page 14 at the end of every SpMV or sparse triangular solve SpTS call the Sparse BLAS calls oski TuneMat In addi tion any Sparse BLAS structural hints provided by the user can be mapped directly to equivalent OSKI structural hints and provided to OSKI during
27. to extract diagonal entries diag num input 1 m lt diag num n 1 Number d of the diagonal to extract diag vals output diag vals is a valid view Let X be the r x s multi vector object into which to store the diagonal values such that s gt 1 and r is at least the length of the diagonal i e r gt min max m n d min m njj Actions and Returns For all j i d stores A i j in X k 1 where k min i j and returns 0 On error returns an error code Error conditions and actions Possible error conditions include B 1 Matrix object creation and modification 42 1 Providing an invalid matrix ERR BAD MATT 2 Providing an invalid vector view or a vector view with invalid dimensions ERR BAD VECVIEW 3 Specifying an invalid diagonal ERR BAD ARG Example 1 2 6 First create A 2 1 0 asparse symmetric matrix with a unit diagonal 5 0 1 int Aptr 1 3 3 3 Aind 1 2 Uses 1 based indices double Aval 2 0 5 double diag vals 0 0 0 y oski vecview t diag vals view oski CreateVecView diag vals 3 STRIDE_UNIT oski matrix t A tunable oski CreateMatCSR Aptr Aind Aval 3 3 SHARE INPUTMAT 2 MAT SYMM UPPER MAT UNIT DIAG IMPLICIT Prints Main diagonal 1 1 1 oski GetMatDiagValues A tunable 0 diag vals view print Main diagonal f f f Wn diag vals 0 diag vals 1 diag vals 2 Prints First superd
28. un changed Error conditions and actions Possible error conditions include providing an invalid permutation ERR BAD PERM or vector ERR BAD VECVIEW Example See Listing 5 on page 27 B 6 Saving and restoring tuning transformations char x oski GetMatTransforms const oski matrix t A tunable Returns a string representation of the data structure transformations that were applied to the given matrix during tuning B 7 Error handling 59 Parameters A tunable input A tunable is valid The matrix object to which from which to extract the specified data structure transformations Actions and Returns Returns a newly allocated string representation of the transformations that were applied to the given matrix during tuning or NULL on error The user must deallocate the returned string by an appropriate call to the C free routine Returns NULL on error Error conditions and actions Possible error codes include providing an invalid matrix ERR BAD_MAT Example See Listing 6 on page 28 int oski_ApplyMatTransforms const oski matrix t A tunable const char xforms Replace the current data structure for a given matrix object with a new data structure specified by a given string Parameters A tunable input A tunable is valid The matrix object to which to apply the specified data structure transformations xforms input A string representation of the data structure transformations to be applie
29. where k ty op A x forall 1 k lt p A tunable input output A tunable is valid and square An object for a matrix A opA input See Table 6 on page 21 Specifies op A power input power gt 0 Power p of the matrix A to apply alpha beta input The scalar constants a 3 respectively x view y view input Vectors are valid or symbolic see Table 11 on page 25 single vectors View objects for the vectors zx y T_view input A valid or symbolic multivector or INVALID MAT If not equal to INVALID MAT T view is either a view object for the multivector T ti tp 1 or SYMBOLIC MULTIVEC num calls input num calls is non negative or symbolic see Table 10 on page 25 The number of times this kernel will be called with these arguments Actions and Returns Registers the workload hint with A tunable and returns 0 if the operand dimensions are compati ble Otherwise returns an error code Error conditions and actions B 5 Permutations 56 Possible error conditions include unsatisfied argument preconditions ERR BAD_ARG ERR BAD MAT ERR BAD_VECVIEW and incompatible operand dimensions ERR DIM MISMATCH int oski_SetHint oski matrix t A tunable oski_tunehint_t hint y Register a hint about the matrix structure with a matrix object Parameters A tunable input output A tunable is valid Matrix object for which to register a structural hint hint input See Table
30. 3 n vals 3 prints A 3 3 1 int oski_DestroyMat oski_matrix_t A_tunable Frees object memory associated with a given matrix object The object is no longer usable Parameters A_tunable input output A tunable is valid The object representing some m x n matrix A Actions and Returns Returns 0 if the object memory was fully successfully freed or an error code on error Error conditions and actions Regardless of the return value A tunable should not be used after this call Possible error condi tions include an invalid matrix object ERR BAD MATT B 2 Vector object creation 45 Example See Listing 1 on page 9 B 2 Vector object creation oski vecview t oski CreateVecView oski_value_t x oski index t length oski index t inc Creates a valid view on a single dense column vector z Parameters length input length gt 0 Number of vector elements inc input inc gt 0 Stride or distance in the user s dense array between logically consecutive elements of x Specifying STRIDE_UNIT is the same as setting inc 1 x input x ZNULL A pointer to the user s dense array representation of the vector x Element x of the logical vector x for all 1 lt i lt length lies at position x i 1 inc Actions and Returns Returns a valid vector view object for x or INVALID_VEC on error Error conditions and actions An error occurs if any of the argument preconditions are not sati
31. 35 We refer to such wrappers as views A vector view encapsulates basic information about an array such as its length or such as the stride between consec utive elements of the vector within the array As with the BLAS a non unit stride allows a dense vector to be a submatrix In addition an object of type oski vecview t can encapsu late multiple vectors multivector for kernels like sparse matrix multiple vector multiply SpMM or triangular solve with multiple simultaneous right hand sides The multivector object would also store the number of vectors and the memory organization i e row vs column major as discussed Section 5 2 3 on page 19 These vector views help unify and simplify some of the kernel argument lists 3 4 Calling sparse matrix vector multiply The argument lists to kernels such as oski MatMult for SpMV in this example line 37 follow the conventions of the BLAS For example a user can specify the constant OP TRANS as the second argument to apply A instead of A or specify other values for a and 6 The calls to oski_DestroyMat and oski Destroy VecView free any memory allocated by the library to these objects lines 41 43 However since the user and library share the arrays underlying A tunable x view and y view you are responsible for deallocating these arrays here Aptr Aind Aval x and y 21 26 31 3 5 Linking 11 Listing 2 Makefile for the first example This Makefile builds shared and
32. 83 return MBCSR InputMat 1 3 141 401 891 8 7179 return MBCSR InputMat 1 4 In this case the first 2 columns are the row and column block size and the fourth col umn shows the corresponding performance in Mflop s and the text after the 75 symbol show the OSKI Lua transformation program describing that data structure If you know your matrix has lots of dense block substructure look at factors of the structural block size in this file Comparing these data to the data in the corresponding files for CSR or CSC will give some indication as to whether or not you should even expect significant speedups 6 4 Pre built synthetic benchmarks The installation process installs several binary executables in OSKIDIR bin that call OSKI on simulated matrix workloads These programs are listed below where the suf fix Txy is a placeholder for the specific integer floating point type combinations you se lected during installation where x is either i for int or 1 for 1ong and y is one of s c z for single precision real double precision real single precision complex and double precision complex respectively e oskitimer The OSKI timer measures time in arbitrary units of ticks which are usually but not always equivalent to cycles You can use this utility to print the tick to seconds conversion factor which will generally match the clock rate of your machine e oskibench Txy Benchmarks OSKI on a user specified workload and sp
33. A Sloot D Abramson A V Bogdanov J J Dongarra A Y Zomaya and Y E Gorbachev eds vol LNCS 2660 Melbourne Australia June 2003 Springer pp 705 714 22 R W VUDUC Automatic performance tuning of sparse matrix kernels PhD thesis University of California Berkeley December 2003 23 R C WHALEY Automatically Tuned Linear Algebra Software ATLAS home page 2005 netlib org atlas A Valid input matrix representations 34 A Valid input matrix representations The user creates a sparse matrix object in our interface from a pre assembled input matrix At present we support the matrix representations listed below Each representation de fines a mathematical matrix A of size m x n whose element values we denote by A i j where 1 i mand 1 j n Packed 3 array compressed sparse row using 0 based indices The user provides 3 arrays corresponding to A Aptr Aind whose elements are non negative integers and Aval whose elements are real or complex values These arrays satisfy the fol lowing conditions 1 Aptr is of length at least Aptr m 1 Aptr 0 gt 0 and for all 0 i lt m Aptr i Aptr i 1 2 Aind is of length at least Aptr m Each element of Aind lies between 0 and n 1 inclusive 3 Avalis of length at least Aptr m A matrix element A i j is computed from this representation as follows Let Kj be the set k Aptr i 1 k lt Aptr i and Aind k j 1
34. AYS TUNE AGGRESSIVELY The relative amount of tuning is platform dependent These constants instruct the library to go ahead and try tuning at the next call to oski TuneMat assuming the application can always amortize cost This facility might be useful when say benchmarking an application on a test problem to gauge the potential perfor mance improvement from tuning Once A tunable has been created you may call the tuning hints as often as and when ever you choose For example suppose the user mixes calls to SpMV and ATA x in roughly equal proportion You specify such a workload as follows oski SetHintMatMult A tunable 1000 oski SetHintMatTransMatMult A tunable 1000 other hints oski TuneMat A tunable Then oski TuneMat will try to choose a data structure that yields good performance over all for this workload Workload hints are cumulative i e the call oski SetHintMatMult A tunable 2000 is equivalent to the two call sequence oski SetHintMatMult A tunable 1000 oski SetHintMatMult A tunable 1000 n assuming the arguments given by are identical and furthermore independent of what other operations occur in between the two calls 42 Tuning style 2 Implicit profiling Sparse matrix objects may also be tuned without any explicit hints In this case OSKI quietly monitors the number of times each is called with a particular matrix and kernel argume
35. C 500 workload hint oski SetHint A tunable HINT SINGLE BLOCKSIZE ARGS NONE structural hint oski TuneMat A tunable Mdots x oski_vecview_t x view oski CreateVecView oski_vecview_t y view oski_CreateVecView for i 0 i 100 i IH xs for k 0 k 5 k PE s oski MatMult A tunable OPLNORMAL 1 0 x_view 1 0 y view IH o Ho the expected workload or assert performance relevant structural properties of the matrix non zeros Listing 3 sketches a simple example in which we provide two tuning hints The first hint made via a call to oski SetHintMatMult specifies the expected workload We refer to such a hint as a workload hint This example tells the library that the likely workload consists of at least a total of 500 SpMV operations on the same matrix The argument list looks identical to the corresponding argument list for the kernel call oski MatMult except that there is one additional parameter to specify the expected frequency of SpMV operations The frequency allows the library to decide whether there are enough SpMV operations to hide the cost of tuning For optimal tuning the values of these parameters should match the actual calls as closely as possible The constant SYMBOLIC VEC indicates that we will apply the matrix to a single vec tor with unit stride Alternatively we could use the constant SYMBOLIC MULTIVEC to indicate that we will perform SpM
36. E INPUT MAT as discussed in Section 5 2 1 on page 16 so that not all methods would have to be specialized initially The cost of this arrangement is that there may be two copies of the matrix the local CSR data structure and a tuned data structure e Ata minimum an initial implementation of MatOSKI should implement the follow ing methods defined by Mat matrix assembly get set non zero values and SpMV kernel calls e Tuning could be performed in the implicit self profiling style described in Section 4 2 on page 14 since PETSc will not necessarily a priori which solver the user will call and therefore it will not known the number of SpMV operations The call to oski TuneMat could be inserted after each SpMV call since such calls will be low overhead calls until tuning occurs The current pre alpha implementation uses the explicit style tuning at the end of matrix assembly and also upon any explicit conversion D 3 MATLAB P 64 e Symmetric matrices in PETSc are stored in full storage format though a special tag identifies the matrix as symmetric For MatOSKI we may specify that the diagonal block is symmetric by providing the appropriate hint at handle creation time see Table 3 on page 20 e The default error handler for each OSKI handle should be replaced with a routine that prints an error message to PETSc s error log e We recommend that additional routines and command line options be created con forming to PE
37. GetMatEntry A_tunable 2 2 printf A 2 3 n oski_GetMatEntry A_tunable 2 3 printf A 3 1 n oski_ GetMatEntry A_tunable 3 1 int oski SetMatEntry oski matrix t A_tunable oski index t row oski index t col oski value t val Changes the value of the specified matrix element Parameters A tunable input output A tunable is valid The object representing some m x n matrix A row col input 1 lt row xm 1 lt col lt n Specifies the element whose value is to be modified This element must have had an associated element stored explicitly in the input matrix when A tunable was created If the user asserted that her input matrix was symmetric or Hermitian when the matrix was created these properties are preserved with this change in value In contrast asserting a tuning hint to say the matrix is structurally symmetric does not cause this routine to insert both A i j and A j 1 Actions and Returns Returns 0 and sets A row col val If the matrix was created as either symmetric or Hermitian including the semantic properties MAT SYMM FULL and MAT HERM FULL this routine logi cally sets A col row to be val also On error A tunable remains unchanged and an error code is returned NOTE When A tunable is tuned the tuned data structure may store additional explicit zeros to improve performance The user should avoid changing entries that were not explicitly stored when A tunable was
38. KI and Who Should Use It The Optimized Sparse Kernel Interface OSKI is a collection of low level primitives that provide automatically tuned computational kernels on sparse matrices for use by the de velopers of solver libraries and scientific and engineering applications These kernels in clude sparse matrix vector multiply and sparse triangular solve among others Tuning means choosing a data structure and corresponding kernel implementation that both a compactly represents the particular sparse matrix and b has an efficient implementation on the target machine The primary aim of OSKI is to hide the complex decision making process needed to tune while also exposing the steps and potentially non trivial costs of tuning at run time The interface also allows for optional continuous profiling and peri odic re tuning as well as user inspection and control of the tuning process OSKI provides functionality essentially at the level of the Basic Linear Algebra Subroutines BLAS 3 6 and we assume a user who has a basic familiarity with the BLAS The current version of OSKI targets uniprocessors machines based on cache based superscalar microprocessors although we are actively extending OSKI for vector architectures SMPs and distributed memory machines OSKI is part of on going work by the Berkeley Benchmarking and OPtimization Be BOP group an active research program on automatic performance tuning and analysis at the University o
39. LID PERM on error Error conditions and actions This routine calls the error handler and returns INVALID MAT if any argument preconditions are not satisfied Example See Listing 5 on page 27 B 6 Saving and restoring tuning transformations 58 const oski_perm_t oski ViewPermutedMatColPerm const oski_matrix_t A_tunable Given a matrix A possibly reordered during tuning to the form P A PT returns a read only object corresponding to P Parameters A tunable input A tunable is valid A matrix object corresponding to some matrix A Actions and Returns Returns a read only permutation object representing P This return is exactly equal to PERM IDENTITY if the matrix is not reordered i e if P P I the identity matrix Returns INVALID PERM on error Error conditions and actions This routine calls the error handler and returns INVALID MAT if any argument preconditions are not satisfied Example See Listing 5 on page 27 int oski PermuteVecView const oski perm t P oski matop t opP oski vecview t x view Permute a vector view object i e computes xz op P x Parameters P input P is valid An object corresponding to some permutation P opP input Specifies op P x view input output x view is valid The view object corresponding to the multi vector z Actions and Returns Permutes the elements of x and returns 0 On error returns an error code and leaves x view
40. M on at least two vectors Better still we could pass an actual instance of a oski vecview t object which has the precise stride and data layout information Analagous routines exist for each of the other kernels in the system The second hint made via a call to oski SetHint is a structural hint telling the library that we believe that the matrix non zero structure is dominated by a single block size Sev eral of the possible structural hints accept optional arguments that may be used to qualify the hint for this example you might explicitly specify a block size though here she in stead uses the constant ARGS NONE to avoid doing so The library implementation might 42 Tuning style 2 Implicit profiling 14 then know to try register blocking since it would be most likely to yield the fastest imple mentation 14 We describe a variety of other hints in Section 5 4 on page 22 These hints are directly related to candidate optimizations explored in our research and we expect the list of hints to grow over time The actual tuning i e possible change in data structure occurs at the call to oski TuneMat This example happens to execute SpMV exactly 500 times though there is certainly no requirement to do so Indeed instead of specifying an exact number or es timate the user may force the library to try a moderate level of tuning by specifying the symbolic constant ALWAYS TUNE or an aggressive level of tuning by specifying ALW
41. OSKI does not seem to be tuning first try enabling run time debug messages at level 10 see Section 6 2 OSKT s performance tuning heuristics will print data that helps explain why it failed to tune Keep in mind that OSKI does a cost benefit analysis to determine whether or not to tune and a large number of kernel calls may be required to trigger tuning You can also try running any of the pre built OSKI benchmarks on your matrix and a synthetic workload see Section 6 4 on the following page 6 4 Pre built synthetic benchmarks 31 You may also wish to inspect the off line benchmarking data collected during instal lation to get an idea of what performance improvements you might expect These files all have dat extensions and may be found in OSKIDIR 1ib oski bench These files are organized by data structure e g CSR or its blocked variant block compressed sparse row BCSR format and by type e 2 int double and show performance in Mflop s for some kernel on some test matrix as a function of the data structure parameters For instance the file OSKIDIR lib oski bench profile_MBCSR_MatMult_Tid dat contains SpMV data for a dense matrix stored in sparse format for the modified block com pressed sparse row MBCSR format data structure with the int double type combination Lines in this file might look like 111 273 755 5 03946 return MBCSR InputMat 1 1 121 347 543 8 3001 return MBCSR InputMat 1 2 131 386 285 8 43
42. SYMM UPPER MAT_SYMM_LOWER MAT_SYMM_FULL MAT_HERM_UPPER MAT_HERM_LOWER MAT_HERM_FULL Input matrix specifies all non zeros Only non zeros in the upper triangle exist Only non zeros in the lower triangle exist Matrix is symmetric but only the upper triangle is stored Matrix is symmetric but only the lower triangle is stored Matrix is symmetric and all non zeros are stored Matrix is Hermitian but only the upper triangle is stored Matrix is Hermitian but only the lower triangle is stored Matrix is Hermitian and all non zeros are stored MAT_DIAG_EXPLICIT MAT_UNIT_DIAG_IMPLICIT Any non zero diagonal entries are specified explicitly No diagonal entries are stored but should be assumed to be equal to 1 INDEX ONE BASED INDEX ZERO BASED Array indices start at 1 default Fortran convention Array indices start at 0 default C convention INDEX UNSORTED INDEX SORTED Non zero indices in CSR CSC format within each row column appear in any order Non zero indices in CSR CSC format within each row column are sorted in increasing order INDEX REPEATED INDEX UNIQUE Indices may appear multiple times Indices are unique Table 3 Input matrix properties type oski inmatprop t Upon the call to create a ma trix object the user may characterize the input matrix by specifying one or more of the above properties Properties grouped within the same box are
43. T A AP A t AT wif op A AAT ort A x if op A AA Parameters A tunable input output A tunable is valid An object for a matrix A opA input See Table 7 on page 21 Specifies op A alpha beta input The scalar constants a p respectively x view y view input x view y view are valid or symbolic see Table 11 on page 25 View objects for multi vector objects x y respectively for a multi vector x t view input May be valid symbolic or INVALID MAT An optional view object for a multi vector t num calls input num calls is non negative or symbolic see Table 10 on page 25 The number of times this kernel will be called with these arguments Actions and Returns Registers the workload hint with A tunable and returns 0 only if the argument dimensions are compatible Otherwise returns an error code Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD ARG ERR BAD MAT ERR BAD VECVIEW and incompatible operand dimensions ERR DIM MISMATCH int oski SetHintMatMultAndMatIransMult oski matrix t A tunable oski value t alpha const oski vecview t x view oski value t beta const oski vecview t y view oski matop t opA oski value t omega const oski vecview t w view oski value t zeta const oski vecview t z view int num calls Workload hint for the kernel operation oski MatMultAndMatTransMult which computes y a A z
44. TSc s calling style to provide access to the following OSKI interface features always tuning at matrix assembly and saving and restoring tuning trans formations Such routines and options would act as no ops if the user called them on a matrix not of type MatOSKI Users can use the usual PETSc mechanisms to select this type either on the command line at run time or by explicitly requesting it via an appropriate function call in their applica tions For most users this amounts to a one line change in their source code D 3 MATLAB P MATLAPBY P is a research prototype system providing distributed parallel extensions to MATLAB 18 Internally matrices are stored in CSC format Since MATLAB P is intended for use interactively the amount of time available for tuning may be limited Therefore we recommend using OSKI in the implicit self profiling mode described in Section 4 2 on page 14 D 4 Kokkos Trilinos The Trilinos solver package is another possible target for OSKI integration 11 Trilinos provides a uniform C interface for its users and solver components through Kokkos the core BLAS like sparse and dense kernel package OSKI could be used to implement the abstract interfaces defined in Kokkos to complement Kokkos existing sparse format implementations
45. The Optimized Sparse Kernel Interface OSKI Library User s Guide for Version 1 0 1h Richard Vuduc James W Demmel Katherine A Yelick Berkeley Benchmarking and OPtimization BeBOP Group University of California Berkeley http bebop cs berkeley edu oski June 27 2007 OSKI is based on research supported in part by the National Science Foundation under NSF Cooperative Agreement No ACI 9813362 NSF Cooperative Agreement No ACI 9619020 the Department of Energy under DOE Grant No DE FC02 01ER25478 and gifts from HP and Intel This research used resources of the Center for Computational Sciences at Oak Ridge National Laboratory which is supported by DOE under Contract No DE AC05 000R22725 the High Performance Computing Research Facility Mathematics and Computer Science Division Argonne National Laboratory the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory the University of Electro Communications in Tokyo Japan the Dept of Biomedical Informatics at Ohio State University and the OpenPower Project at the University of Augsburg Germany The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred Copyright 2005 2007 Regents of the University of California OSKI is distributed under a BSD license http bebop cs berkeley edu oski license txt Contents List of Tables List of Li
46. The new values are a 125 1 int rows 1 2 E int cols 1 2 E double vals 1 1 1 1 in row major order double new vals 1 125 125 1 in row major order oski vecview t vals view oski_CreateMultiVecView vals 2 2 LAYOUT ROWMAJ 2 oski vecview t new vals view oski CreateMultiVecView new vals 2 2 LAYOUT ROWMAJ 2 2 2 i 3 oski_GetMatClique A_tunable rows 2 cols 2 vals_view printf A 1 1 n vals 0 prints A 1 1 1 Retrieve the submatrix of values printf A 1 2 f n vals 1 prints A 1 2 2 printf A 2 1 f n vals 2 prints A 2 1 2 printf A 2 2 f n vals 3 prints A 2 2 1 Change the above values to ET iy oski SetMatClique A_tunable rows 2 cols 2 new_vals_view oski GetMatClique A tunable rows 2 cols 2 vals view printf A 1 1 n vals 0 prints A 1 1 1 printf A 1 2 n vals 1 prints A 1 2 0 125 printf A 2 1 n vals 2 prints A 2 1 0 125 printf A 2 2 n vals 3 prints A 2 2 1 int oski GetMatDiagValues const oski matrix t A tunable oski index t diag num oski vecview t diag vals Extract the diagonal d from A i e all entries A i j such that j i d Parameters A tunable input A tunable is valid The m x n matrix A from which
47. able Pr oski ViewPermutedMatRowPerm A tunable Pc oski ViewPermutedMatColPerm A tunable if Pr Pc May reduce the number of permutations needed Compute y A z in three steps 1 y P y x Pex x oski PermuteVecView Pr OP NORMAL y view oski PermuteVecView Pc OP NORMAL x view Ir 2R y A ax for iter 0 iter lt max power iter oski MatMult A fast OP NORMAL 1 0 x view 1 0 y view 3 y PF yx Pl 2 oski_PermuteVecView Pr OP_TRANS y_view oski_PermuteVecView Pc OP_TRANS x_view else use a conventional implementation for iter 0 iter lt max power iter oski MatMult A tunable OP NORMAL 1 0 x view 1 0 y view Clean up oski DestroyMat A tunable Invalidates Var Pr Var Pc and VVar AN fast oski Destroy VecView x view oski Destroy VecView y view 20 5 5 Saving and restoring tuning transformations 28 Listing 6 An example of saving transformations This example extracts the tuning trans formations applied to a matrix object A_tunable and saves them to a file oski matrix t A tunable oski CreateMatCSR chars xforms stores transformations FILE fp saved xforms fopen my x orm txt wt text file for output x Tune the matrix object oski TuneMat A tunable P sso f Extract transformations xforms oski GetMatTransforms
48. and its source location The arguments beginning at format string are printf compatible arguments that provide a flexible way to provide supplemental error information For example the following code shows how the error handler might be called from within the the SpMV kernel oski MatMult when the user incorrectly tries to apply a matrix A tunable with dimensions mxn to a vector of length veclen where n 4 veclen if n Zveclen your handler ERR DIM MISMATCH oski MatMult Mismatched dimensions oski_MatMult c 507 Matrix dimensions d x d Vector length d n m n veclen return ERR DIM MISMATCH 6 Troubleshooting If you are having difficulty with OSKI here are a few things to try to help track down the problem 6 1 Installation problems The configure script generates a log of its execution in con ig 1og Inspecting this file or including it in problem reports can help identify configuration problems 6 2 Run time errors When running your application if OSKI reports errors that you can t otherwise figure out try setting the environment variable OSK1 DEBUG LEVEL to a value of 1 or higher before running your application env OSKI DEBUG LEVEL 10 your application This causes the library to print diagnostic messages to standard error Inspecting this out put or including snippets of it in problem reports may help identify the problem 6 3 Tuning difficulties If
49. anual Tech Rep MCC 14 03 PUC Rio April 2003 www lua org REFERENCES 33 13 E J IM AND K A YELICK Optimizing sparse matrix vector multiplication on SMPs in Proceedings of the SIAM Conference on Parallel Processing for Scientific Comput ing San Antonio TX USA March 1999 14 E IM K A YELICK AND R VUDUC SPARSITY Framework for optimizing sparse matrix vector multiply International Journal of High Performance Computing Appli cations 18 2004 pp 135 158 15 A PINAR AND M HEATH Improving performance of sparse matrix vector multipli cation in Proceedings of Supercomputing 1999 16 K REMINGTON AND R POZO NIST Sparse BLAS User s Guide tech rep NIST 1996 gams nist gov spblas 17 Y SAAD SPARSKIT A basic toolkit for sparse matrix computations 1994 www cs umn edu Research arpa SPARSKIT sparskit html 18 I THE MATHWORKS Matlab 2003 www mathworks com 19 S TOLEDO Improving memory system performance of sparse matrix vector multi plication in Proceedings of the 8th SIAM Conference on Parallel Processing for Sci entific Computing March 1997 20 R VUDUC J W DEMMEL AND K A YELICK An interface for a self optimizing sparse matrix kernel library 2005 bebop cs berkeley edu oski 21 R VUDUC A GYULASSY J W DEMMEL AND K A YELICK Memory hierarchy optimizations and bounds for sparse AT Az in Proceedings of the ICCS Workshop on Parallel Linear Algebra P M
50. ared libraries for two rea sons First the entire OSKI library can be quite large and shared libraries will help reduce the size of your executable Second we designed OSKI to allow new exten sions to be plugged in as dynamically loaded modules Do you want OSKI to use any support libraries OSKI can take advantage of additional libraries that may be available for your sys tem such as a highly optimized dense BLAS library or PAPI for timing benchmark ing For a list see Section 2 3 3 on page 7 How to install OSKI Follow these steps to compile and install OSKI We show sample command lines for sh csh compatible shells where denotes the command line prompt 1 Download oski 1 0 1h tar gz or oski 1 0 1h tar bz2 from the OSKI home page 2 Unpack the distribution gunzip oski 1 0 1h tar gz tar xvf This command unpacks the OSKI distribution into a subdirectory named oski 1 0 1h At is possible to create Windows style DLLs using the nable shared configure flag but we cur rently consider their use to be experimental Proceed with caution 2 2 How to install OSKI 6 3 Configure OSKI for your platform oo mkdir build 1 0 1h cd build 1 0 1h oski 1 0 1h configure prefix OSKIDIR oe oe As done in this example we strongly recommend building OSKI in a directory sep arate from the source tree you unpacked in Step 2 Here we use a directory named build 1 0 1h The configure script t
51. arse matrix using a particular user specified OSKI Lua tuning transformation e oskibench autotune Txy This utility is the same as oskibench Txy except OSKI selects the OSKI Lua trans formation based on the workload Thus you can use it to test OSKI s automatic tuning Run these utilities with no options to see their usage REFERENCES 32 Acknowledgements We thank Rajesh Nishtala Rob Schreiber Matt Knepley Michael Heroux Viral Shah Christof V mel and Iain Duff for discussion on the key ideas behind and early drafts of this docu ment References 1 S BALAY K BUSCHELMAN W D GROPP D KAUSHIK M KNEPLEY L C MCINNES B F SMITH AND H ZHANG PETSc User s Manual Tech Rep ANL 95 11 Revision 2 1 5 Argonne National Laboratory 2002 www mcs anl gov petsc 2 S BALAY W D GROPP L C MCINNES AND B F SMITH Efficient management of parallelism in object oriented numerical software libraries in Modern Software Tools in Scientific Computing E Arge A M Bruaset and H P Langtangen eds Birkhauser Press 1997 pp 163 202 3 S L BLACKFORD J W DEMMEL J DONGARRA I S DUFF S HAMMARLING G HENRY M HEROUX L KAUFMAN A LUMSDAINE A PETITET R Pozo K REMINGTON AND R C WHALEY An updated set of basic linear algebra subpro grams BLAS ACM Transactions on Mathematical Software 28 2002 pp 135 151 4 D A BURGESS AND M B GILES Renumbering unstructured grids to impro
52. ault error handler called when one of the OSKI routines detects an error condition Parameters message input A descriptive string message describing the error or its context The string message NULL if no message is available source filename input The name of the source file in which the error occurred or NULL if not applicable line number input The line number at which the error occurred or a non positive value if not applicable format string input A printf compatible format string A formatting string for use with the printf routine This argument and any remaining arguments may be passed to a printf like function to provide any supplemental information Actions and Returns This routine dumps a message describing the error to standard error C Mixing Types The OSKI implementation supports mixing of basic data types Section 5 1 on page 16 within a source file by providing type specific subroutine bindings which a user may call explicitly OSKI uses the C preprocessor to generate these bindings and the user can access their prototypes as shown below For example suppose a user needs bindings for both one type of matrix which uses int for the indices and float for the values i e a Tis matrix and another which uses long for the indices and double for the values T1d She should first include the type specific headers as follows include oski oski Tis h int float x define OSKI_REBIND Res
53. bject when k gt 2 5 3 4 Compatible dimensions for matrix multiplication All of the kernels apply a matrix op A to a multi vector x and store the result in another multi vector y Let m x n be the dimensions of op A let p x k be the dimensions of x and let q x l be the dimensions of y We say these dimensions are compatible ifm q n p and k I 5 3 5 Floating point exceptions None of the kernels attempt to detect or to trap floating point exceptions 5 4 Tuning The user tunes a valid matrix object at any time and as frequently as she desires for a given matrix object of type oski matrix t The library tunes by selecting a data structure customized for the user s matrix kernel workload and machine The interface defines three groups of tuning operations listed in Table 8 on the following page and summarized as follows e Workload specification Section 5 4 1 on the next page These primitives allow the user to specify which kernels she will execute and how frequently she expects to execute each one There is one workload specification routine per kernel Structural hint specification Section 5 4 2 on the following page The user may op tionally influence the tuning decisions by providing hints about the non zero struc ture of the matrix For example the user may tell the library that she believes the structure of the matrix consists predominantly of uniformly aligned 6 x 6 dense blocks Explicit tuning Section 5
54. calls explicitly the user may substi tute the symbolic constant ALWAYS TUNE or ALWAYS TUNE AGGRESSIVELY to tell the library to go ahead and assume the application can amortize cost see Table 10 on page 25 The use of two constants allows a library implementation to provide two levels of tuning when the user cannot estimate the number of calls Where a kernel expects a vector view object to be passed as an argument the user may pass to the workload hint either SYMBOLIC VEC or SYMBOLIC MULTIVEC instead of an actual vector view object Table 11 on page 25 The user should use SYMBOLIC VEC if she anticipates using a single vector or SYMBOLIC MULTIVEC if she anticipates using at least two vectors Specifying actual vector view objects is preferred since they will contain additional information relevant to tuning including storage layout for multivectors i e row vs column major and strides or leading dimensions 5 4 2 Providing structural hints A user provides one or more structural hints by calling oski SetHint as illustrated in Sec tions 4 1 4 2 Providing these hints is entirely optional but a library implementation may use these hints to constrain a tuning search Some hints allow the user to provide additional information For instance consider the hint HINT SINGLE BLOCKSIZE which tells the library that the matrix structure is 5 4 Tuning 24 Hint Arguments Description HINT_NO_BLOCKS none Matrix contai
55. corresponding to A These arrays satisfy the following conditions B Bindings Reference 35 1 Aptr is of length at least Aptr n 1 Aptr 0 gt 1 and for all 0 j lt n Aptr j Aptr j 1 2 Aind is of length at least Aptr n Each element of Aind lies between 1 and m inclusive 3 Avalis of length at least Aptr n A matrix element A i j is computed from this representation as follows Let Kj be the set k Aptr j 1 k lt Aptr j and Aind k i Then A i j rex Aval k Repeated elements are summed B Bindings Reference We define each routine in the interface using the formatting conventions used in the fol lowing example for a function to compute the factorial of a non negative integer int factorial int n Given an integer n gt 0 returns n n x n 1 x x3x2xlifn gt 1 orlifn 0 Parameters n input n0 Non negative integer of which to compute a factorial Actions and Returns An integer whose value equals n if n is greater than 1 or 1 if n equals 0 The return value is undefined if n exceeds the maximum positive integer of type int Error conditions and actions Aborts program if n is less than 0 Example int n 4 int ans factorial n printf Sd d n n ans Should print 4 24 The specification indicates any argument preconditions under Parameters return val ues and side effects Actions and Returns possible error
56. created Error conditions and actions Possible error conditions include 1 Invalid matrix ERR BAD MAT 2 The position row col is out of range ERR BAD ENTRY 3 The position row col was not explicitly stored when A tunable was created i e the spec ified entry should always be logically zero ERR ZERO ENTRY This condition cannot always be enforced e g if tuning has replaced the data structure and freed the original so this error will not always be generated 4 Changing row col would violate one of the asserted semantic properties when A tunable was created ERR INMATPROP CONFLICT For instance suppose A i j is in the upper triangle of a matrix in which MAT TRI LOWER was asserted is an error condition or suppose the caller asks CURES a diagonal element to a non unit value when MAT UNIT DIAG IMPLICIT was asserted Example 1 2 5 First create A 2 1 0 asparse symmetric matrix with a unit diagonal ED 0 1 int Aptr 1 3 3 3 Aind 1 2 Uses 1 based indices double Aval 2 0 5 oski matrix t A_tunable oski CreateMatCSR Aptr Aind Aval 3 3 SHARE INPUTMAT 2 MAT SYMM UPPER MAT_UNIT_DIAG_ IMPLICIT printf A 1 3 n oski_ GetMatEntry A tunable 1 3 prints A 1 3 0 5 printf A 3 1 n oski GetMatEntry A tunable 3 1 prints A 3 1 0 5 Change A 3 1 and A 1 3 to 5 B 1 Matrix object creation and modification 39 o
57. d to the matrix represented by A tunable The conditions xforms NULL and xforms equivalent to the empty string are both equivalent to requesting no changes to the data structure Actions and Returns Returns 0 if the transformations were successfully applied or an error code otherwise On success the data structure specified by xforms replaces the existing tuned data structure if any Error conditions and actions Possible error conditions include an invalid matrix ERR BAD MATT a syntax error while process ing xforms ERR BAD SYNTAX and an out of memory condition ERR OUT OF MEMORY Example See Listing 7 on page 29 B 7 Error handling oski errhandler t oski GetErrorHandler void Returns a pointer to the current error handling routine Actions and Returns Returns a pointer to the current error handler or NULL if there is no registered error handler oski errhandler t oski SetErrorHandler oski errhandler t new handler C Mixing Types 60 Changes the current error handler Parameters new handler input A valid error handling routine or NULL Pointer to a new function to handle errors Actions and Returns This routine changes the curent error handler to be new handler and returns a pointer to the pre vious error handler void oski HandleErrorDefault int error code const charx message const charx source filename unsigned long line number const chars format string The def
58. de if tuning will be profitable given its run time cost However workload hints are optional if not given OSKI will try to infer the workload implicitly based on D 1 Sparse BLAS 62 what kernels the user calls An implicit profiling style is the only mode available in the Sparse BLAS e Additional kernels In addition to level 2 and level 3 versions of matrix vector mul tiply and triangular solve OSKI includes kernels that have more inherent reuse of the matrix simultaneous multiplication by A and AT sparse ATA z and matrix powers multiplication A x If these kernels prove useful they should be considered in a future revision of the Sparse BLAS interface e Explicit permutation representation One possible optimization is to permute the rows and columns of the matrix to improve temporal or spatial locality and the user can access these permutations explicitly through the OSKI interface if her algorithm can amortize their cost see Section 5 4 4 on page 25 e Tuning save restore mechanism OSKI provides a way to determine what tuning transformation was applied and also to request a specific transformation Section 5 5 on page 28 The following outline shows how a Sparse BLAS could be implemented using OSKI by mapping the various components of OSKI described in Sections 5 1 5 6 to the Sparse BLAS interface 1 Basic scalar types Section 5 1 on page 16 OSKI supports the same basic floating point types single double and
59. e denote NaN s by NaN_VALUE throughout 5 2 Creating and modifying matrix and vector objects Our interface defines two basic abstract data types for matrices and vectors oski matrix t and oski vecview t respectively Available primitives to create and manipulate objects of these types appears in Table 1 on the next page and C bindings appear in Appendix B 1 on page 35 5 2 1 Creating matrix objects The user creates a matrix object of type oski matrix t from a valid input matrix Logically such an object represents at most one copy of a user s input matrix tuned for some kernel workload with a fixed non zero pattern for the entire lifetime of the object 5 2 Creating and modifying matrix and vector objects 17 Matrix oski CreateMatCSR Create a valid tunable matrix object from a objects CSR input matrix oski CreateMatCSC Create a valid tunable matrix object from a compressed sparse column CSC format input matrix oski CopyMat Clone a matrix object oski DestroyMat Free a matrix object oski GetMatEntry Get the value of a specific matrix entry oski SetMatEntry Set the value of a specific non zero entry oski GetMatClique Get a block of values specified as a clique oski SetMatClique Change a block of non zero values specified as a clique oski GetMatDiagValues Get values along a diagonal of a matrix oski SetMatDiag Values Change values along a diagonal Vector oski CreateVecView Create a view objec
60. e ere we ee ew 64 List of Tables 1 Creating and modifying matrix and vector objects 17 2 Copy modes type oski copymode t 6 4 650 bed eee oe eee 19 3 Inputmatrix properties type oskiinmatpropt 4 4 2 65658 565 20 4 Dense multivector dense matrix storage modes type oski storage t 20 5 o IIA 21 6 Matrix transpose options typeoski matop t o 21 7 Matrix transpose times matrix options typeoski ataopt 21 8 ga ce quc qo Roo Rosen a Wess ETS APR AR eR 23 9 X Available structural hints type oski_tunehintt 24 10 Symbolic calling frequency constants typeint 00 0 25 11 Symbolic vector views for workload hints type oski vecview t 25 12 dune status CODE uoce phone rro RO Cook E RO eese 26 13 Extracting and applying permuted forms 6 5 6 tr 26 14 Saving and restoring tuning transformations 0 28 15 Eeror anlinge routines a o o eye RR EERO ho eS 30 Listings J Ling OSKIA frstexampie Lus e e a EA OER EN SRE eR E 9 2 Makefile forthe Te CMRI s ose A ga A a i ord 11 3 Anexampleof basicexplicit Wining cscs 9o o9 ok b EE we 13 4 Anexample of implicit tuning 4o km 9 eR RE RO 15 5 Anesxampleof extracting permutations lt lt oo or o o 27 6 Anexampleofsavingtransformations les 28 7 Anexampleofapplying transformations o o o 29 1 What is OSKI and Who Should Use It 4 1 What is OS
61. e overhead of tuning can be amor tized but you may not always be able to estimate this workload before execution Rather than specifying the workload explicitly you may rely on the library to mon itor kernel calls to determine the workload dynamically You still explicitly call the same tune routine as above to perform optimizations but this routine optimizes based on an inferred workload To get a better sense of which style is most appropriate for your application see the exam ples below 41 Tuning style 1 Providing explicit hints In this style you tune a sparse matrix object by optionally providing one or more hints followed by an explicit call to the matrix tuning routine oski TuneMat Hints describe 20 4 Tuning style 1 Providing explicit hints 13 Listing 3 An example of basic explicit tuning This example creates a sparse matrix object A tunable and then tunes it for a workload in which we expect to call SpMV 500 times In addition we provide an additional hint to the library that the matrix non zero structure is dominated by a dense blocks of a single size uniformly aligned Later in the application we actually call SpMV a total of 500 times in some doubly nested loop Create a tunable sparse matrix object A tunable oski CreateMatCSR Tell the library we expect to perform 500 SpMV operations with a 1 8 1 oski SetHintMatMult A tunable OP NORMAL 1 0 SYMBOLIC VEC 1 0 SYMBOLIC VE
62. ed in Section 2 1 on page 4 That completes installation of OSKI You should be ready for the first example in Section 3 on page 8 2 3 Customizing your OSKI build using configure 7 23 Customizing your OSKI build using configure The Configure step can be customized in many ways The most commonly used options are discussed below For a complete list run from the build tree in our previous example oe oski 1 0 1h configure help 2 3 1 Overriding the default compiler and or flags To specify the compiler and or compiler flags yourself define the cc and or CFLAGS en vironment variables accordingly For example to use the Intel C compiler to build OSKI specifically for a Pentium M machine env CC icc CFLAGS 03 xB oski 1 0 1h configure We very much welcome your contributions and suggestions if you find compiler and optimization flag combinations that improve on the defaults The Configure step will try to auto detect a Fortran 77 compiler so that OSKI can later check for and link against a native BLAS library If you do not want or have a Fortran com piler and run into configuration problems try specifying the disable fortran flag To override the F77 compiler or F77 compiler flags use the F77 or FFLAGS environment variables as with CC CFLAGS above 2 3 2 Selecting other or additional scalar type precisions By default OSKI uses your C compiler s int data type to store integer indices and double for f
63. et bindings include oski oski Tld h long double x The first include directive uses the C preprocessor to bind the standard OSKI subroutine names to mangled names that include the type information The effect is to introduce a D OSKI Library Integration Notes 61 define directive of the form shown below along with a C prototype for the type specific name as if the user had entered the following define oski MatMult oski MatMult Tis int oski MatMult int x int x float x The second include directive preceeded by the macro definition OSKI REBIND first un defines the standard names previously bound to the Tis type and then redefines them to the T1d type The user should insert additional OSKI_REBIND and include directives for other types as necesary The header files declare all the type specific prototypes and the type mangled names are now available for the user to call explicitly Note that the basic types e g oski matrix t are similarly available in type mangled forms e g oski matrix t Tis oski matrix t Tld Matrix A in int float CSR format int Aptr Aind float Aval Matrix B in long double CSR format x long Bptr Bind double Bval eo x oski matrix t Tis A oski CreateMatCSR Tis Aptr Aind Aval oski matrix t Tld B oski CreateMatCSR Tld Bptr Bind Bval eo x oski MatMult Tis A oski MatMult Tld B eo x
64. f California Berkeley OSKI is based in large part on the SPARSITY framework for automatically tuning sparse matrix kernels 14 For information on the philosophy of the OSKI design and pointers to the related research see the OSKI Design Document 20 a draft of which is available at http bebop cs berkeley edu oski How to read this document The fastest way to get started is to look at the material in Sections 2 4 which contains enough information to build and install OSKI and to write your first test program If you run into problems see Section 6 on page 30 for debugging hints or post a question in the online forum through the OSKI home page 2 Installation This section describes how to compile and install OSKI 1 0 1h from the source There may also be pre compiled binaries for your platform check the OSKI home page Automatic tuning in OSKI happens in two phases an installation time phase which benchmarks your machine and a run time phase when you call OSKI with a particular matrix You will need to run a benchmark during installation but this step is automated and should not require any manual intevention on your part 2 1 What you will need to get started To compile and install OSKI 1 0 1h you will need the following e An ANSI C compiler e Anenvironment providing standard UNIX tools such as make grep awk and sed OSKI has been tested in the following environments Linux FreeBSD NetBSD AIX 10SKI is also
65. g and modifying sparse matrix and dense vector objects Section 5 2 on the following page Table 1 on page 17 A sparse matrix object must be created from an existing user allocated pre assembled matrix We refer to this user assembled matrix as the input matrix Appendix A on page 34 defines currently supported input matrix formats The user may specify whether the library and the user share the input matrix arrays Section 5 2 1 on the next page When the library tunes a matrix object it may choose a new internal representation sparse data structure Dense vector objects are wrappers around user allocated dense arrays Executing kernels Section 5 3 on page 19 Table 5 on page 21 e g sparse matrix vector multiply triangular solve The interfaces to our kernel routines mimic the look and feel of the BLAS Tuning Section 5 4 on page 22 Table 8 on page 23 Tuning occurs only if and when the user calls a particular routine in our interface In addition to this tune routine we also provide auxiliary routines that allow users to provide optional tuning hints Saving and restoring tuning transformations Section 5 5 on page 28 Table 14 on page 28 We provide a routine to allow the user to see a precise description repre sented by a string of the transformations that convert the input matrix data structure to the tuned data structure The user may then call an additional routine to execute this p
66. he OSKI home page Comments and contributions are welcome PETSc is written in C in an object oriented style which defines an abstract matrix in terface type Mat specific matrix formats are concrete types derived from this interface that implement the interface Available formats at the time of this writing include serial CSR type MatAIJ distributed CSR MatMPIAIJ serial and distributed block compressed sparse row format MatBAIJ and MatMPIBAIJ respectively for square block sizes up to 8 x 8 and a matrix free format MatShell implemented using callbacks to user defined routines among others The simplest way to integrate OSKI into PETSc is to define a new concrete type say MatOSKI based on the distributed compressed sparse row format MatMPIAIJ with the following modifications e PETSc distributes blocks of consecutive rows of the matrix across the processors Internally each processor stores its rows in two packed 3 array 0 based CSR data structures Appendix A on page 34 one for a diagonal block and one for all re maining elements PETSc determines the distribution and sets up the corresponding data structures when the user calls a special matrix assembly routine The MatOSKT type would store two additional handles on each processor corresponding to the rep resentation e Since the abstract matrix interface defines a large number of possible methods each BeBOP handle would be created using the shared copy mode SHAR
67. he block of values The array rows must be of length at least num rows and cols must be of length at least num cols The entries of rows and cols must satisfy 1 rows lt m for all 0 lt i lt num rows and 1 lt cols j lt n for all 0 lt j lt num cols vals input vals is valid The object vals is a multivector view see Section 5 2 3 on page 19 of a logical two dimensional array to be used to store the block of values We use a view here to allow the user to specify row or column major storage and the leading dimension of the array Actions and Returns Let X be the num rowsxnum cols matrix corresponding to vals If rows and cols are valid as discussed above then this routine sets A i j X r c where i rows r 1 and j cols c 1 for all 1 lt r lt num_rows and 1 lt c num cols and returns 0 Otherwise this routine returns an error code and leaves X unchanged If the matrix was created as either symmetric or Hermitian including the semantic properties MAT SYMM FULL and MAT HERM FULL this routine logically sets A i j and A 7 i If both i j and j i are explicitly specified by the clique the behavior is undefined if the corresponding values in vals are inconsistent If an entry A i j is specified by the clique and appears multiple times within the clique with inconsistent values in vals the behavior is undefined NOTE When A tunable is tuned the tuned data structure may store additio
68. i rows r 1 and j cols c 1 for all 1 lt r lt num rows and 1 lt c lt num cols and returns 0 Otherwise this routine returns an error code and leaves X unchanged Error conditions and actions Possible errors conditions include 1 Invalid matrix ERR BAD MAT 2 An invalid row col was given ERR BAD ENTRY T Example Let A be the matrix shown in Listing 1 on page 9 and stored in A tunable int rows 1 3 int cols 1 3 E double vals 1 1 1 1 oski vecview t vals view oski_CreateMultiVecView vals 2 2 LAYOUT ROWMAJ 2 oski GetMatClique A tunable rows 2 cols 2 vals view printf A 1 1 n vals 0 prints A 1 1 1 print A 1 3 n vals 1 prints A 1 3 0 printf A 3 1 f n vals 2 prints A 3 1 0 5 print A 3 3 n vals 3 prints A 3 3 1 B 1 Matrix object creation and modification 40 int oski_SetMatClique oski matrix t A tunable const oski index t rows oski index t num rows const oski index t cols oski index t num cols const oski vecview t vals Changes a block of values defined by a clique in a matrix Parameters A tunable output A tunable is valid The object representing some m x n matrix A num rows num cols input num rows gt 1 num cols gt 1 Dimensions of the block of values rows cols input rows ZNULL cols ZNULL Indices defining t
69. iagValues A tunable 1 diag vals view Prints First superdiagonal 1 2 diag vals 0 0 diag vals 1 0 oski GetMatDiagValues A tunable 1 diag vals view printf First superdiagonal f f n diag_vals 0 diag vals 1 oski matrix t oski CopyMat const oski matrix t A tunable Creates a copy of a matrix object Parameters A tunable input A tunable is valid The object representing some m x n matrix A Actions and Returns Returns a new matrix object or INVALID MAT on error The new matrix object is equivalent to the matrix object the user would obtain if she performed the following steps 1 Re execute the original call to oski CreateMatCSR oski CreateMatCSC to create a new untuned matrix object A copy in the copy mode COPY INPUTMAT Thus A copy may exist independently of A tunable and of any data upon which A tunable might depend 2 Get the tuning transformations that have been applied to A tunable by the time of this call Equivalently execute oski GetMatTransformations A tunable and store the result 3 Apply these transformations to A copy Error conditions and actions B 1 Matrix object creation and modification 44 Possible error conditions include an invalid source matrix object ERR BAD_MAT or an out of memory condition while creating the clone ERR OUT_OF_MEMORY Example Let A be the matrix shown in Listing 1 on page 9 and stored in A tunable ass
70. iagonal 2 0 oski_GetMatDiag Values A tunable 1 diag vals view printf First superdiagonal f f n diag_vals 0 diag vals 1 Prints Second subdiagonal 0 5 oski GetMatDiagValues A tunable 2 diag vals view printf Second subdiagonal f n diag_vals 0 int oski_SetMatDiag Values oski_matrix_t A tunable oski index t diag num const oski vecview t diag vals Sets the values along diagonal d of A i e all entries A i j such that j i d Parameters A tunable input output A tunable is valid The m x n matrix A in which to change diagonal entries diag num input 1 m xdiag num n 1 Number d of the diagonal to change diag vals output diag vals is a valid view Let X be the r x s multi vector object into which to store the diagonal values such that s gt 1 and r is at least the length of the diagonal i e r gt min max m n d min m njj Actions and Returns For all j i d such that A i j was an explicitly stored entry when A tunable was created sets A i j X k 1 where k min i j and returns 0 On error returns an error code and leaves A tunable unchanged If the matrix was created as either symmetric or Hermitian including the semantic properties MAT SYMM FULL and MAT HERM FULL this routine also logically changes the correspond ing symmetric diagonal diag num NOTE When A tunable is tuned the tuned data structure may st
71. icance of this shared mode is that in the absence of explicit calls to the tuning routine only one copy of the matrix will exist i e the user may consider A_tunable to be a wrapper or view of the input matrix Properties property 17 through property k7 in this example are optional and the user should specify as many as needed for the library to interpret the non zero pattern cor rectly For instance Listing 1 on page 9 creates a matrix with implicit ones on the diagonal which are not stored so the user must specify MAT UNIT DIAG IMPLICIT as a prop erty A list of available properties appears in Table 3 on page 20 where default properties assumed by the library are marked with a red asterisk The user may create a copy of A tunable by calling oski CopyMat This copy is log ically equivalent to creating a matrix object in the COPY BUFFERS mode The user frees A tunable or its copies by a call to oski DestroyMat In addition to user created matrix objects there is one immutable pre defined matrix object with a special meaning INVALID MAT This matrix is returned when matrix cre ation fails and is conceptually a constant analogous to the NULL constant for pointers in C 5 2 2 Changing matrix non zero values The non zero pattern of the input matrix fixes the non zero pattern of A tunable but the user may modify the non zero values If the input matrix contains explicit zeros the library treats these entries as logical non zer
72. ing static linking The w1 whole archive flag asks the linker to include all of the objects from the list of libraries that follow If linking against OSKI statically this flag is important to ensure all OSKI symbols are available at run time The matching W1 no whole archive flag turns off the whole archive inclusion for subsequent libraries listed on the link command line If you have difficulties or questions please post them at the help forums on the OSKI home page 4 Overview of Tuning by Example The example of Section 3 on page 8 shows the basics of calling OSKI but does not perform any tuning There are two primary tuning styles in OSKI summarized as follows 1 Tuning using explicit workload and structural hints Section 4 1 Any information you can provide a priori is information the library in principle does not need to re discover thereby reducing the potential overhead of tuning In this tuning style you provide the library with structured hints that describe for example the expected workload i e which kernels will be used and how frequently or whether there is special non zero structure e 2 uniformly aligned dense blocks symmetry You then call a tune routine which uses this information to choose a new data structure whose performance is optimized for the specified workload 2 Tuning using an implicit workload Section 4 2 on page 14 The library needs to know the anticipated workload to decide when th
73. input output x view is valid A vector view object to destroy No action is taken if x view is one of the predefined symbolic vectors such as INVALID VEC SYMBOLIC VEC or SYMBOLIC MULTIVEC Actions and Returns B 3 Kernels 47 Returns 0 if the object memory excluding the data on which this object views was successfully freed or an error code otherwise Error conditions and actions Regardless of the return value x_view should not be returned after this call unless x_view is equal to one of the predefined vector constants The global error handler is called on error Possible error conditions include providing an invalid vector ERR BAD_VECVIEW Example See Listing 1 on page 9 and the example for the routine oski CreateMultiVecView oski vecview t oski Copy VecView const oski vecview t x view Creates a copy of the given multi vector view Parameters x view input x view is valid A vector view object to clone Actions and Returns Returns another view object that views the same data as the source view object If x view is one of the symbolic vector constants e g INVALID VEC SYMBOLIC VEC SYMBOLIC MULTIVEC then that same constant is returned and no new object is created On error returns INVALID VEC Error conditions and actions Returns INVALID VEC on error and calls the global error handler Error conditions include speci iying an invalid vector view object ERR BAD_VECVIEW or an o
74. int y ves OU 0s ies 1 z 010203 printf y sf sf y 0 y 1 y 2 printf z sf f f z 0 z 1 z 2 int oski MatPowMult const oski matrix t A tunable oski matop t opA int power oski value t alpha const oski vecview t x view oski value t beta oski vecview t y view oski vecview t T view Computes a power of a matrix times a vector or y a op A x y Also optionally computes T ti t a where tp op A x foralll lt k lt p A tunable input A tunable is valid An object for a matrix A If p gt 1 then A must be square opA input See Table 6 on page 21 Specifies op A power input power gt 0 Power p of the matrix A to apply alpha beta input The scalar constants a 3 respectively x view input x view is a valid single vector View object for the vector z y view input output y view is valid single vector View object for the vector y T view output T view is a valid multivector view of at least p 1 vectors orNULL If non NULL T view is a view object for the multivector T t1 t5 3 Actions and Returns Let A be an n x n matrix The vectors x and y must be single vectors of length n If T is specified via a valid T view object then T must have dimensions n x p 1 If all these conditions are satisfied B 4 Tuning 52 then this routine computes y A 2 3 y tk A a foralll lt k lt p if appr
75. ix object A tunable Similarly any changes to the matrix object do not affect any of the input matrix arrays If the user does not or cannot free the input matrix arrays then two copies of the matrix will exist 2 SHARE INPUTMAT The user and the library agree to share the input matrix arrays subject to the following conditions a The user promises that the input matrix arrays will not be freed or reallocated before a call to oski DestroyMat on the handle A tunable b The user promises not to modify the elements of the input matrix arrays except through the interface s set value routines Section 5 2 2 on the next page This condition helps the library keep any of its internally maintained tuned copies consistent with the input matrix data 5 2 Creating and modifying matrix and vector objects 18 c The library promises not to change any of the values in Aptr Aind This con dition fixes the pattern and maintains the properties of the input matrix data given at creation time Elements of Aval may change only on calls to the set value routines e The library promises to keep the input matrix arrays and any tuned copies syn chronized This condition allows the user to continue to read these arrays as needed That is if the user calls a set value routine to change a non zero value the library will update its internal tuned copy if any and the corresponding non zero value stored in the input matrix array Aval The signif
76. lectively or gradually over time 3 5 Linking The exact procedure for linking depends on your platform and whether you have chosen to use static libraries or shared libraries Listing 2 shows a sample Makefile in that builds Listing 1 on page 9 on a Linux system Take note of a few aspects of this Makefile e OSKI installation directories Lines 4 6 refer to the various OSKI installation paths relative to the default root of usr local see Section 2 2 on page 5 4 Overview of Tuning by Example 12 e Site module files lines 10 and 12 OSKI is divided into a number of smaller mod ules including one module for each of the possible sparse matrix data structures OSKI will choose from when tuning To keep you from having to remember what modules are available in your installa tion the installer places two files in OSKIDIR 1ib oski called site modules shared txt and site modules static txt which list the names of the available shared li braries and static libraries respectively as link flags These files can be pasted right onto the command line as done in this example Setting the run time link path In this example W1 rpath W1 OSKILIBS_DIR sets this executable s run time path to point to the directory containing the OSKI li brary This path can also usually be specified when the program runs by including it in an environment variable like LD_LIBRARY_PATH on Linux or LIBPATH on AIX Including whole modules dur
77. loating point values Select different combinations by adding the conf igure time flag nable lt INT gt lt VAL gt where valid INT types are int and long and valid floating point value types are single double scomplex and dcomplex You may specify any number of combinations as well as disable the default For example to disable the default int double version and instead build both the int single complex and 1ong double complex versions run oe oski 1 0 1h configure prefix OSKIDIR disable int double Y nable int scomplex nable long dcomplex To use these types in your application see Section 5 1 on page 16 2 3 3 Optional tools OSKI can use any of the freely available highly tuned dense BLAS libraries such as AT LAS 23 or Goto s BLAS 9 or commercial hardware vendor implementations The configure script will attempt to detect their presence but you can also manually specify how to link against them with the with blas lt LINK FLAGS gt option For example oe oski 1 0 1h configure prefix OSKIDIR with blas L my blas path lmy_blas asks OSKI to link against the 1ibmy blas a library in the my b1as path directory To obtain high resolution timing OSKI internally uses the hardware cycle counter reader subpackage of FFTW 3 0 1 8 Alternatively you can ask OSKI to use your local installation of the Performance API PAPI cycle counter reader using the configure time
78. matrix assembly Another possible style is to assume an arbitrarily large workload e g always assume 100 SpMV operations or use ALWAYS TUNE AGGRESSIVELY and specify this in formation using a workload hint at matrix assembly time Then the Sparse BLAS implementation would call oski TuneMat before returning from the end assembly routine thus hiding the cost of tuning in the assembly cost This style is the explicit tuning style described in Section 4 1 on page 12 D 2 PEISc 63 A third more explicit tuning style is possible Any OSKI implementation will make its data structure tuning transformations available via the tuning save restore inter face Thus all the tuning logic could be implemented internally in the Sparse BLAS using OSKI only to implement a particular transformation 5 Error handling Section 5 6 on page 29 OSKI provides both error return codes and a way to call a user defined error handler Since the Sparse BLAS uses only error code returns to communicate errors an implementation would simply disable the error handler by passing oski SetErrorHandler a NULL function argument D 2 PETSc The PETSc Portable Extensible Toolkit for Scientific computing library provides a portable interface for distributed iterative sparse linear solvers based on MPI and is a primary in tegration target of our interface 2 1 We are currently implementing such an extension and a pre alpha implementation is available from t
79. mutually exclusive Default properties marked by a red asterisk LAYOUT ROWMAj The multivector is stored in row major format as in C C LAYOUT COLMAJ The multivector is stored in column major format as in Fortran Table 4 Dense multivector dense matrix storage modes type oski storage t Storage modes for the dense multivector creation routines 5 3 Executing kernels 21 oski MatMult Sparse matrix vector multiply SpMV y a op A x where op 4 A AT 4 oski MatTrisolve Sparse triangular solve SpTS x a op A t x oski_MatTransMatMult y a op A x B y where op A AT A AP A AAT AAH oski MatMultAndMatTransMult Simultaneous computation of y 0a A 2 0B y AND z lt w op A w z oski MatPowMult Matrix power multiplication Computes y a op A x 8 y Table 5 Sparse kernels This table summarizes all of the available sparse kernel routines The user selects single or multivector versions by passing in an appropriate vector view Section 5 2 3 on page 19 See Appendix B 3 on page 47 for bindings OP NORMAL Apply A OP_TRANS Apply AT OP CONJ TRANS Apply AP A the conjugate transpose of A Table 6 Matrix transpose options type oski matop t Constants that allow a user to apply a matrix A its transpose A or for complex valued matrices its conjugate transpose A These options are called op A in Table 5
80. nal explicit zeros to improve performance The user should avoid changing entries that were not explicitly stored when A tunable was created Error conditions and actions Possible error conditions include 1 Invalid matrix ERR BAD MATT 2 The position row col is out of range ERR BAD ENTRY 3 The position row col was not explicitly stored when A tunable was created i e the spec ified entry should always be logically zero ERR ZERO ENTRY This condition cannot always be enforced e g if tuning has replaced the data structure and freed the original so this error will not always be generated 4 Changing row col would violate one of the asserted semantic properties when A tunable was created ERR INMATPROP CONFLICT For instance suppose A i j is in the upper triangle of a matrix in which MAT TRI LOWER was asserted is an error condition or suppose the caller asks ME a diagonal element to a non unit value when MAT UNIT DIAG IMPLICIT was asserted Example 1 2 First create A 2 1 5 0 5 0 a sparse symmetric matrix with a unit diagonal 1 int Aptr 1 3 3 3 Aind 1 2 Uses 1 based indices B 1 Matrix object creation and modification 41 double Aval 2 0 5 oski matrix t A_tunable oski CreateMatCSR Aptr Aind Aval 3 3 SHARE INPUTMAT 2 MAT SYMM UPPER MAT UNIT DIAG IMPLICIT Clique of values to change using 1 based indices to match matrix 1 125
81. ng dimension or stride as is possible with the dense BLAS Thus users who need the BLAS can continue to mix BLAS operations on their data with calls to the OSKI kernels In addition to user created vector views we define two special immutable vector view objects SYMBOLIC VEC and SYMBOLIC MULTIVEC Conceptually these objects are constants that may be used with the tuning workload specification routines to indicate tuning for single vectors or multivectors instead of specifying instantiated view objects See Section 5 4 on page 22 5 3 Executing kernels We summarize the available kernels in Table 5 on page 21 and present their bindings in Appendix B 3 on page 47 In addition to both single vector and multivector versions of sparse matrix vector multiply oski MatMult and sparse triangular solve oski Mat Trisolve we provide interfaces for three high level sparse kernels that provide more opportunities to reuse the elements of A e Simultaneous multiplication of A and AT or A by a dense multivector oski Mat MultAndMatTransMult e Multiplication of AT A or A AT or conjugate transpose variants by a dense mul tivector oski MatIransMatMult e Multiplication of a non negative integer power of a matrix oski MatPowMult We have recently reported on experimental justifications and suggested implementations for these kernels 21 22 5 3 Executing kernels 20 MAT GENERAL MAT TRI UPPER MAT TRI LOWER MAT
82. ns Possible error conditions include 1 Any of the argument preconditions above are not satisfied ERR BAD ARC 2 More than 1 property from the same group are specified see Table 3 on page 20 ERR_IN MATPROP_CONFLICT 3 The input matrix arrays do not correspond to a valid CSC representation ERR NOT_CSC or are incompatible with any of the asserted properties ERR FALSE INMATPROP oski_value_t oski GetMatEntry const oski matrix t A tunable oski index t row oski_index t col Returns the value of a matrix element Parameters A tunable input A tunableis valid The object representing some m x n matrix A row col input 1 row xm 1 lt col lt n Specifies the element whose value is to be returned The precondition above must be satisfied Note that matrix entries are referenced using 1 based indices regardless of the convention used when the matrix was created Actions and Returns If row and col are valid then this routine returns the value of the element A row col Otherwise it returns NaN VALUE Error conditions and actions Possible error conditions include 1 Invalid matrix ERR BAD MAT 2 Position row col is out of range ERR BAD ENTRY Example Let A be the matrix shown in Listing 1 on page 9 and stored in A tunable The following should prints A 22 1 A 2 3 0 and A 3 1 5 B 1 Matrix object creation and modification 38 printf A 2 2 n oski_
83. ns little or no dense block substructure HINT_SINGLE_BLOCKSIZE int r c Matrix structure is dominated by a single block size r x c HINT MULTIPLE BLOCKSIZES int k r c1 eng Tks cr Matrix structure consists of at least k gt 1 multiple block sizes These sizes include r x Cl Tk X Ck HINT ALIGNED BLOCKS none Any dense blocks are uniformly aligned That is let i j be the 1 1 element of a block of size rxc Then 1 mod r j 1 mod c 0 HINT UNALIGNED BLOCKS none Any dense blocks are not aligned or the alignment is unknown HINT SYMM PATTERN none The matrix non zero pattern is structurally symmetric or nearly SO HINT_NONSYMM_PATTERN none The matrix non zero pattern is structurally very unsymmetric HINT RANDOM PATTERN none The matrix non zeros or non zero blocks are nearly dis tributed uniformly randomly over all positions HINT_CORRELATED_PATTERN none The row indices and column in dices for non zeros are highly cor related HINT_NO_DIAGS HINT_DIAGS none int k d ves n The matrix contains little if any explicit diagonal structure The matrix has structure best represented by multiple diago nals The diagonal lengths in clude dj d possibly among other lengths Table 9 Available structural hints type oski_tunehint_t The user may provide ad ditional hints
84. ns of the input matrix Aptr Aind Aval input Aptr Aind Aval ZNULL The input matrix pattern and values must correspond to a valid CSR representation as defined in Appendix A on page 34 mode input See Table 2 on page 19 Specifies the copy mode for the arrays Aptr Aind and Aval See Section 5 2 1 on page 16 for a detailed explanation k input k 20 The number of qualifying properties property 1 property k input optional See Table 3 on page 20 The user may assert that the input matrix satisfies zero or more properties listed in Table 3 on page 20 Grouped properties are mutually exclusive and specifying two or more properties from the same group generates an error see below The user must supply exactlyk properties Actions and Returns A valid tunable matrix object or INVALID MAT on error Any kernel operations or tuning opera tions may be called using this object Error conditions and actions Possible error conditions include 1 Any of the argument preconditions above are not satisfied ERR BAD ARC 2 More than 1 property from the same group are specified see Table 3 on page 20 ERR IN MATPROP CONFLICT 3 The input matrix arrays do not correspond to a valid CSR representation ERR NOT CSR or are incompatible with any of the asserted properties ERR FALSE INMATPROP As an example of the latter error if the user asserts that the matrix is symmetric but the number of rows is not equal to the number
85. nts For instance suppose that we cannot know statically the number of iterations that the innermost while loop executes in Listing 4 on the following page At run time the library implementation can log the calls to oski MatMult so that if and when the application calls oski TuneMat the library can make an educated guess about whether SpMV is called a sufficient number of times to hide the cost of tuning 5 Guide to the Complete Interface The available library routines fall into 6 broad categories summarized as follows 1 Initializing OSKI You must call oski Init as shown in Listing 1 on page 9 before calling other routines in the library 10 5 Guide to the Complete Interface 15 Listing 4 An example of implicit tuning This example calls oski TuneMat periodically without explicitly providing any hints At each call to oski TuneMat the library poten tially knows more and more about how the user is using A tunable and may therefore tune accordingly oski matrix t A tunable oski CreateMatCSR oski vecview t x view oski CreateVecView oski_vecview_t y view oski CreateVecView oski_SetHint A tunable HINT SINGLE BLOCKSIZE 6 6 i6 for i 20 i num times i FF iss while converged is kar oski MatMult A tunable OP NORMAL 1 0 x view 1 0 y view TF o oski_TuneMat A tunable maybe change a few non zero values for the next solve 2 Creatin
86. o use we are specifying it explicitly We illustrate the usage of these two routines in Listing 6 and Listing 7 on the following page 22 5 6 Handling errors 29 Listing 7 An example of applying transformations This example reads a string repre sentation of tuning transformations from a file and applies them to an untuned matrix FILE fp_saved_xforms fopen my_xform txt rt x text file for input Buffer for storing transformation read from the file The actual size of this buffer should should be the file size but we use some maximum constant here for illustration purposes char xforms SOME MAX BUFFER SIZE int num chars read oski matrix t A tunable oski CreateMat CSR Read transformations num chars read fread xforms sizeof char SOME MAX BUFFER SIZE 1 fp_saved_xforms xforms num chars read NULL Execute one un tuned matrix multiply oski MatMult A tunable Change matrix data structure oski ApplyMatTransforms A tunable xforms Now perform matrix multiply in the new data structure oski MatMult A tunable Mdots 5 6 Handling errors The OSKI interface provides two methods for handling errors 1 Error code returns Many of the routines in the interface return an integer error code of type int All of the possible error codes are negative providing a simple way for an application to check for an err
87. oit exceptions ss se we kk Rok RR km e Rem Re o A n a a a e a a ao a i aa a aa G ai 54 1 Providing workload hints explicitly 542 Providing structural Rints 2 os o RA RE 255 Miann RENE es cues oni cee eo See eon e Saeed 5 4 4 Accessing the permuted form aaao 5 5 Saving and restoring tuning transformations De Mondine ODE cs ra ER EN RS ER ERREAG TEC XA GIUER SE LDPE 6 Troubleshooting ml Taslanan problems e coa cars ION OE X AA A 6 me AA eti 62 a SEEOE cond eS exce e Eoo o Yen Ree dU me ub A E eX e eS bo Tuning ISS e r rares RR Rer oe de eek om RR OS 64 Pre built synthetic benchmarks 4 6524464 20 be we ee REG NNNNG ER BO Oo 00 10 11 12 12 14 LIST OF TABLES 3 References 32 A Valid input matrix representations 34 Bindings Reference 35 B 1 Matrix object creation and modification osos RR x x n 35 B2 Yecorobect GIBSON a uus dede 9 ngo cdd ey dex dC CR RO eee 45 Ba Kemmel o ei a eh AY oe ee ee ee ee ee mi 47 Ba TMG oo vice ee eh ve Pap Dapa pu ii tpg dig eee pile 52 BS Permutations e o 6 6 see be Awe poA UR eee Awe OE ee 56 B 6 Saving and restoring tuning transformations 58 Br BDOPRDABUBOE Lungo ado Ee RRS RE BSS ERS LSS 59 C Mixing Types 60 OSKI Library Integration Notes 61 PD Opare Pac Ce A EORR pup X aue d auper Ses PES 61 DE gc APP LS ED CHES ERE EE RSS EDS SAS ESS 63 ok ick Bee vue BO e Ae ad ee ols SARNA 64 DI Kokkos UlnoBh e e sersa a S
88. opriate and returns 0 Otherwise no action is taken and an error code is returned Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD_ARG ERR BAD MAT ERR BAD VECVIEW and incompatible operand dimensions ERR DIM MISMATCH Example 23 d UU First create a matrix A 0 75 0 in CSR format using 1 based indices do de d int Aptr 2 3 6 ji 1 based 1 int Aind 1 2 1 2 3 1 based double Aval 25 75 75 25 1 a A tunable oski_CreateMatCSR Aptr Aind Aval 3 3 SHARE INPUTMAT 0 Create a vector a 1 1 1 doublex 1 1 1 y oski_vecview_t x view oski_CreateVecView x 3 STRIDE UNIT Result vector y double y 1 1 1 oski vecview t y view oski CreateVecView y 3 STRIDE UNIT Storage space to keep intermediate vectors T t t2 lnipally leti 1 l 1d 2 2 2 double T 1 1 1 2 2 2 incolumn major order oski vecview t T view oski CreateMultiVecView T 3 2 LAYOUT COLMA 3 Compute y A x and the intermediate vectors Tui oski MatPowMult A tunable OP NORMAL 3 1 0 x view 0 0 y view T view Print results ELeSAx2 10229 027532 17 2 AC2 x 0 0625 0 5625 2 375 PE y A 8 x 0 015625 0 421875 2 x bL printf t1 Axx S S ant T 0 T 1 T 2 printf t2 A 2 x amp f bf d
89. or on return from any OSKI routine 2 Error handling routines The library calls an error handling routine when an error occurs By default this routine is oski HandleErrorDefault However the user may also replace this routine with her own handler or replace it with NULL to disable error handler calling entirely When an error condition is detected within a OSKI routine it is always handled using the following procedure e The routine calls the current error handler e Regardless of which error handler is called if any the routine may return an error code A user may change the error handler by calling oski SetErrorHandler or retrieve the current error handler by calling oski GetErrorHandler The error handling routines are summarized in Table 15 on the next page An error handling routine has the following signature oski errhandler t void your handler int error code const charx message const charx source filename unsigned long line number const char format string 507 508 510 512 6 Troubleshooting 30 oski_GetErrorHandler Returns a pointer to the current error han dling routine oski_SetErrorHandler Changes the current error handling routine to a user supplied handler oski HandleErrorDefault The default error handler provided by the li brary Table 15 Error handling routines Bindings appear in Appendix B 7 on page 59 The first 4 parameters describe the error
90. ore additional explicit zeros to improve performance The user should avoid changing entries that were not explicitly stored when A tunable was created If the user attempts to change such an entry by specifying a non zero value in a corresponding entry of diag vals the value may or may not be changed B 1 Matrix object creation and modification 43 Error conditions and actions Possible error conditions include 1 Providing an invalid matrix ERR BAD MATT 2 Providing an invalid vector view or a vector view with invalid dimensions ERR BAD_ VECVIEW 3 Specifying an invalid diagonal ERR BAD ENTRY 4 Specifying the main diagonal when A tunable was created with MAT UNIT DIAG IMPLI CIT Example 1 2 6 First create A 2 1 25 asparse symmetric matrix with a unit diagonal 5 0 1 int Aptr 1 3 4 4 Aind 1 2 3 Uses 1 based indices double Aval 2 0 5 0 25 double diag vals 0 0 0 oski vecview t diag vals view oski_Create Vec View diag vals 3 STRIDE UNIT oski matrix t A tunable oski CreateMat CSR Aptr Aind Aval 3 3 SHARE INPUTMAT 2 MAT SYMM UPPER MAT UNIT DIAG IMPLICIT Prints First superdiagonal 2 0 25 oski GetMatDiagValues A tunable 1 diag vals view printf First superdiagonal f f n diag_vals 0 diag vals 1 Change first superdiagonal to be 1 2 diag vals 0 1 diag vals 1 2 oski SetMatD
91. os whose values may be modified later We provide several routines to change non zero values To change individual entries the user may call oski SetMatEntry and to change a block of values defined by a a clique the user may call oski SetMatClique If A tunable is a shallow copy of the user s matrix the user s values array will also change Logical non zero values are subject to properties asserted at matrix creation time see Appendix B 1 on page 35 We also define primitives for obtaining all of the values along an arbitrary diagonal and storing them into a dense array oski GetMatDiag Values and for setting all of the non zero values along an arbitrary diagonal from a dense array oski SetMatDiagValues The same restriction on altering only non zero values in the original matrix applies for these routines Tuning may select a new data structure in which explicit zero entries are stored that were implicitly 0 e not stored in the input matrix The behavior if the user tries to change these entries is not defined for two reasons First allowing the user to change these entries would yield inconsistent behavior across platforms for the same matrix since whether a filled in entry could be changed would depend on what data structure the library chooses Second requiring that the library detect all such attempts to change these entries might in the worst case require keeping a copy of the original input matrix pattern creating memory o
92. patible but any dimension is 0 this routine returns 0 but y view is left unchanged Otherwise returns an error code and leaves y view unchanged Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD ARG ERR BAD MAT ERR BAD VECVIEW or incompatible input output operand dimensions ERR DIM MIS MATCH Example See Listing 1 on page 9 int oski MatTrisolve const oski matrix t T tunable oski matop t opT oski value t alpha oski vecview t x view Computes z a op T x where T is a triangular matrix Parameters T tunable input T tunable is valid square and triangular Matrix object for an n x n upper or lower triangular matrix T opT input See Table 6 on page 21 Specifies op T alpha input Scalar constant a x_view input output x_view is valid View object for a multi vector z Actions and Returns If op T and x have compatible dimensions computes x a op T 7 x and returns 0 Otherwise returns an error code B 3 Kernels 49 Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD_ARG ERR BAD MAT ERR_BAD_VECVIEW and incompatible operand dimensions ERR DIM MISMATCH Example Let A tunable be object corresponding to the the sparse lower triangular matrix A shown in Listing 1 on page 9 The following example solves A x b where iP sil 3 double x
93. ries to detect your operating system and CPU and chooses a compiler and default compiler optimization flags accordingly These choices will be displayed when the script finishes If you wish to specify these flags yourself or if configure is unable to choose a compiler or default flags see Section 2 3 1 on the following page This step as shown builds a int double precision library by default To build other type combinations see Section 2 3 2 on the next page 4 Compile OSKI o make Compilation times vary widely across platforms and could take anywhere between half an hour to a several hours This is a good time for that much needed coffee break 5 Benchmark OSKI o make benchmarks Like compilation benchmark times vary widely across machines Allow for approx imately half an hour to an hour The sufficiently curious may view the raw benchmark data located in bench dat files in the build tree Section 6 3 on page 30 describes these files in additional detail 6 OPTIONAL Test OSKI o make check This step runs your build of OSKI through an extensive series of tests The test bat tery includes all precision combinations selected at configure time These tests are designed primarily for developer regression testing and could take an hour or more to run 7 Install OSKI o make install This step installs all of the OSKI files and benchmarking data into the subdirectories of OSKIDIR specifi
94. rogram on the same or similar input matrix thereby providing a way to save and restore tuning transfor mations across application runs in the spirit of FFTW s wisdom mechanism 7 More over the save restore facility is an additional way for an advanced user to specify her own sequence of optimizing transformations 5 1 Basic scalar types 16 The interface itself does not define the format of these string based transformations However we suggest a procedural high level scripting language OSKI Lua de rived from the Lua language 12 for representing such transformations We pro vide a high level overview in the OSKI design document 20 We expect the details of this language to be continually refined in future releases 6 Error handling Section 5 6 on page 29 Table 15 on page 30 In addition to the error codes and values returned by every routine in the interface a user may optionally specify her own handler to be called when errors occur to access additional diagnos tic information Tables 1 15 summarize the available routines A user who only needs BLAS like kernel functionality for her numerical algorithms or applications only needs to know about the object creation and kernel routines Categories 1 and 2 above Although tuning i e Cat egories 3 and 4 is an important part of our overall design its use is strictly optional The C bindings are presented in detail in Appendix B on page 35 The following text provides
95. s and Returns Returns a valid multivector view on the data stored in X or INVALID VEC on error Error conditions and actions Returns INVALID VEC and calls the global error handler on an error Possible error conditions include 1 Any of the above argument preconditions are not satisfied ERR BAD ARC 2 The leading dimension is invalid for the specified storage orientation ERR BAD LEAD DIM Example Let A be the matrix shown in Listing 1 on page 9 and stored in A tunable assuming zero based indices 1 1 20 25 LetY y y2 1 1 initially and let X x 22 45 45 1 1 65 65 The following example computes Y Y A X double Y 1 1 1 1 1 1 in row major order oski vecview t Y view oski CreateMultiVecView Y 3 2 LAYOUT ROWMAJ 2 double X 25 45 65 25 45 65 in column major order oski_vecview_t X view oski_CreateMultiVecView X 3 2 LAYOUT COLMA 3 oski MatMult A tunable OP NORMAL 1 X view 1 Y view Views no longer needed oski Destroy VecView X view oski Destroy VecView Y view Print result Should be Pl 12550985 17278 I7 1 25 0 95 1 775 printf bie S Sf Sf n Y O Y 2 Y 4 printf y2 Sf Nn Y 1 Y 3 Y 5 int oski Destroy VecView oski vecview t x view Destroy a vector view Parameters x view
96. sfied ERR BAD_ARG Example See Listing 1 on page 9 oski_vecview_t oski CreateMultiVecView oski value tx X oski index t length oski index t num vecs oski storage t orient oski index t lead dim Creates a view on k dense column vectors X z1 stored as a submatrix in the user s data Parameters length input length gt 0 Number of elements in each column vector num vecs input num vecs gt 0 The number of column vectors i e k as shown above orient input Specifies whether the multivector is stored in row major LAYOUT ROWMAJ or column major LAYOUT _COLMAJ order lead dim input lead dim gt 0 This parameter specifies the leading dimension as specified in the BLAS standard The leading dimension is the distance in X between the first element of each row vector and must be at least B 2 Vector object creation 46 num_vecs if orient LAYOUT_ROWMAJ If instead orient LAYOUT_COLMAJ then the lead ing dimension i is the distance in X between the first element of each column vector and must be at least length X input X ZNULL Pointer to the user s dense array representation of X Foreach1 lt i lt lengthand1 lt j lt num_vecs element z the i element of the j column is stored at one of the following positions 1 If orient LAYOUT ROWMAJ then x is stored at element X i lead dim j 2 If orient LAYOUT COLMAJ then x is stored at element X i lead dim Action
97. ski_SetMatEntry A_tunable 3 1 5 printf A 1 3 f n oski_ GetMatEntry A_tunable 1 3 prints A 1 3 0 5 printf A 3 1 f n oski GetMatEntry A tunable 3 1 prints A 3 1 0 5 int oski_GetMatClique const oski_matrix_t A_tunable const oski index t rows oski index t num rows const oski index t cols oski index t num cols oski vecview t vals Returns a block of values defined by a clique from a matrix Parameters A tunable input A tunable is valid The object representing some m x n matrix A num rows num cols input num rows gt 1 num cols gt 1 Dimensions of the block of values rows cols input rows ZNULL cols ZNULL Indices defining the block of values The array rows must be of length at least num rows and cols must be of length at least num cols The entries of rows and cols must satisfy 1 lt rows lt m for all 0 lt i lt num rows and 1 lt cols j lt n for all 0 lt j lt num cols vals output vals is valid The object vals is a multivector view see Section 5 2 3 on page 19 of a logical two dimensional array to be used to store the block of values We use a view here to allow the user to specify row or column major storage and the leading dimension of the array Actions and Returns Let X be the num rowsxnum cols matrix corresponding to vals If rows and cols are valid as discussed above then this routine sets X r c A i j where
98. stings 1 What is OSKI and Who Should Use It 2 Installation 21 Whatyouwilneedtogetstarted cs cee de dee ee Ha RY ES 22 How toinstalOSKI osr geom ek ee ede be wee A wea 23 Customizing your OSKI build using configure sss 2 3 1 Overriding the default compiler and or flags 2 3 2 Selecting other or additional scalar type precisions Z4 Qptionaltools ud ve tae eee ene bah eo Rr Roo RR 3 Using OSKE A First Example Dl INESIS uda wore or dub eR BERS td UR ob OR UR LAG ER UR Oc LISIIDIES MANDA ciebat ec be be Pee x4 orc yo Xx ea BORSE RO Rye bou oo pec ying he dense VECIONS e so cp dorade e EROR REESE Bs 3 4 Calling sparse matrix vector multiply 64 6 6s ce ee ro E 35 IONS IR 4 Overview of Tuning by Example 41 Tuning style 1 Providing explicit hints ek bee n nmm RR n 42 Tuningstyle 2 Implicit profiling lt acs og o EE RE 5 Guide to the Complete Interface 51 Basi scalor ea roro xG ERRARE VETE ERERESPQs 5 2 Creating and modifying matrix and vector objects S21 Creslng malus Obes uox ooo oe MEDS Kee oo ee en 522 Changing matrix non zero Vales sco om Rh Bes a Lar en poe ea eed Vox Bor S phiri nd AN o os eae 6 6 8 b ee de oe red uk ox ror P auis p Ard 5 3 1 Applying the transposeofamatrix less A oc crudi eod oer dot Fo qo Soo e d oe C eR A Do BSulusve xm copes cosacos ARA A 5 3 4 Compatible dimensions for matrix multiplication 5 25 Ploslhng p
99. t for a single vector objects oski CreateMultiVecView Create a view object for a multivector oski Copy VecView Clone a vector view object oski Destroy VecView Free a multi vector view object Table 1 Creating and modifying matrix and vector objects Bindings appear in Ap pendix B 1 on page 35 At present we support 0 and 1 based CSR and CSC representations for the input ma trix For detailed definitions of valid input formats refer to Appendix A on page 34 All of the supported input matrix formats use array representations and a typical call to create a matrix object from say CSR format looks like A tunable oski CreateMatCSR Aptr Aind Aval num rows num cols copy mode lt k gt property l lt property k gt y where A tunable is the newly created matrix object Aptr Aind and Aval are user created arrays that store the input matrix here in a valid CSR format copy mode specifies how the library should copy the input matrix data and property 1 through property k specify how the library should interpret that data To make memory usage logically explicit the interface supports two data copy modes These modes defined by the scalar type oski copymode t Table 2 on page 19 are 1 COPY INPUTMAT The library makes a copy of the input matrix arrays Aptr Aind and Aval The user may modify or free any of these arrays after the return from oski CreateMat without affecting the matr
100. t num calls Workload hint for the kernel operation oski_MatTrisolve which computes x o op T x where T is a triangular matrix Parameters T tunable input output T tunable is valid square and triangular Matrix object for an n x n upper or lower triangular matrix T opT input See Table 6 on page 21 Specifies op T alpha input Scalar constant a x_view input x_view is valid or symbolic see Table 11 on page 25 View object for a multi vector z num calls input num calls is non negative or symbolic see Table 10 on page 25 The number of times this kernel will be called with these arguments Actions and Returns Registers the workload hint with A tunable and returns 0 only if the dimensions of op T and x have compatible dimensions Otherwise returns an error code Error conditions and actions Possible error conditions include unsatisfied argument preconditions ERR BAD ARG ERR BAD MAT ERR BAD VECVIEW and incompatible operand dimensions ERR DIM_ MISMATCH int B 4 Tuning 54 oski SetHintMatTransMatMult oski matrix t A tunable oski_ataop_t opA oski value t alpha const oski vecview t x view oski value t beta const oski vecview t y view const oski vecview t t view int num calls Workload hint for the kernel operation oski MatTransMatMult which computes y a op A xz B y where op A AAT AT A AA AP A and also optionally computes to A xif op A E A
101. tity matrix J then this routine returns 1 If P P I then this routine returns 0 Returns an error code on error Error conditions and actions Possible error conditions include providing an invalid matrix ERR BAD MAT Example See Listing 5 on page 27 const oski matrix t oski ViewPermutedMat const oski matrix t A tunable Given a matrix A possibly reordered during tuning to the form P A PT returns a read only object corresponding to A Parameters A tunable input A tunable is valid A matrix object corresponding to some matrix A Actions and Returns Returns a read only matrix object representing A This return is exactly equal to A tunable if the matrix is not reordered i e if P P I the identity matrix Returns INVALID MAT on error Error conditions and actions Possible error conditions include providing an invalid matrix ERR BAD MATT Example See Listing 5 on page 27 const oski perm t oski ViewPermutedMatRowPerm const oski matrix t A tunable Given a matrix A possibly reordered during tuning to the form P A PT returns a read only object corresponding to P Parameters A_tunable input A tunable is valid A matrix object corresponding to some matrix A Actions and Returns Returns a read only permutation object representing P This return is exactly equal to PERM IDENTITY if the matrix is not reordered i e if P P I the identity matrix Returns INVA
102. ture TUNESTAT NEW or not TUNESTAT AS IS 5 4 4 Accessing the permuted form The interface defines several routines Table 13 on the next page that allow the user to determine whether the library has optimized kernel performance by reordering the rows 5 4 Tuning 26 TUNESTAT NEW The library selected a new data structure for the matrix based on the current workload data and hints TUNESTAT AS IS Thelibrary did not change the data structure Table 12 Tuning status codes Status codes returned by oski TuneMat in the event that no error occurred during tuning oski IsMatPermuted Determine whether a matrix has been tuned by reordering its rows or columns oski ViewPermutedMat Returns a read only matrix object for the re ordered copy of A oski ViewPermutedMatRowPerm Returns the row permutation P oski ViewPermutedMatColPerm Returns the column permutation P oski PermuteVecView Apply a permutation object or its inverse transpose to a vector view Table 13 Extracting and applying permuted forms If tuning produces a tuned matrix A P A PT the above routines allow the user to detect and to extract read only views of P P and A and apply the permutations P and P Bindings appear in Appendix B 5 on page 56 and columns of the matrix by calling oski IsMatPermuted to extract the corresponding permutations oski ViewPermutedMat oski ViewPermutedMatRowPerm oski View
103. uming zero based indices int rows 0 2 int cols 0 2 E double vals 1 1 1 1 oski vecview t vals view oski_CreateMultiVecView vals 2 2 LAYOUT ROWMA 2 oski matrix t A copy For testing purposes record and print a 2x2 clique of values oski GetMatClique A tunable rows 2 cols 2 vals view printf A 1 1 An vals 0 prints A 1 1 1 printf A 1 3 f n vals 1 prints A 1 3 0 printf A 3 1 n vals 2 prints A 3 1 0 5 printf A 3 3 n vals 3 prints A 3 3 1 Create a clone A copy oski CopyMat A tunable The clone is independent of the original so we may delete the original oski DestroyMat A tunable Clear temporary clique value storage memset vals 0 sizeof double x 4 clear vals array printf vals 0 f n vals 0 prints vals 0 0 printf vals 1 f n vals 1 prints vals 1 0 printf vals 2 f n vals 2 prints vals 2 0 printf vals 3 f n vals 3 prints vals 3 0 Verify that the correct values were copied oski GetMatClique A copy rows 2 cols 2 vals view printf A 1 1 f n vals 0 prints A 1 1 1 prints A 1 3 0 printf A 1 3 n vals 1 printf A 3 1 n vals 2 prints A 3 1 0 5 printf A 3
104. ut of memory condition ERR UT OF MEMORY Example Let A z and y be as specified in Listing 1 on page 9 and stored in A tunable x view and y view respectively Make a copy of the original view on x oski_vecview_t x copy view oski Copy VecView x view Dispose of original view oski Destroy VecView x view Multiply with the copy oski MatMult A tunable OP NORMAL 1 x copy view 1 y view Finished with all objects oski DestroyMat A tunable oski Destroy VecView x copy view oski Destroy VecView y view Print result y Should be 75 1 05 225 printf Answer y f f Wn y 0 y 1 y 2 B 3 Kernels int oski MatMult const oski matrix t A tunable oski matop t opA B 3 Kernels 48 oski_value_t alpha const oski_vecview_t x view oski_value_t beta oski_vecview_t y_view Computes y a op A x B y where op A A AT A Parameters A tunable input A tunable is valid An object for a matrix A opA input See Table 6 on page 21 Specifies op A alpha beta input Scalar constants a 8 respectively x_view input x_view is valid View object for a multi vector x y view input output y view is valid View object for a multi vector y Actions and Returns Computes y a op A 8 y and returns 0 only if the dimensions of op A x and y are compatible If the dimensions are com
105. ve the performance of codes on hierarchical memory machines tech rep Numerical Analy sis Group Oxford University Computing Laboratory Wolfson Building Parks Road Oxford OX1 3QD 1995 5 E CUTHILL AND J MCKEE Reducing the bandwidth of sparse symmetric matrices in Proceedings of the ACM National Conference 1969 6 I S DUFF M A HEROUX AND R POZO An overview of the sparse basic linear al gebra subprograms The new standard from the BLAS technical forum ACM Trans actions on Mathematical Software 28 2002 pp 239 267 7 M FRIGO AND S JOHNSON FFTW An adaptive software architecture for the FFT in Proceedings of the International Conference on Acoustics Speech and Signal Pro cessing Seattle Washington May 1998 8 FFTW 3 home page 2005 fftw org 9 K Goro High Performance BLAS 2005 www cs utexas edu users flame goto 10 G HEBER R BISWAS AND G R RAO Self avoiding walks over adaptive unstruc tured grids Concurrency Practice and Experience 12 2000 pp 85 109 11 M A HEROUX R A BARTLETT V E HOWLE R J HOEKSTRA J J Hu T G KOLDA R B LEHOUCQ K R LONG R P PAWLOWSKI E T PHIPPS A G SALINGER H K THORNQUIST R S TUMINARO J M WILLENBRING A WILLIAMS AND K S STANLEY An Overview of the Trilinos Project ACM Trans actions on Mathematical Software 31 2005 pp 397 423 12 R IERUSALIMSCHY L H DE FIGEIREDO AND W CELES Lua 5 0 Reference M
106. verhead The specifications in Appendix B 1 on page 35 allow but do 5 3 Executing kernels 19 SHARE INPUTMAT User and library agree to share the input matrix ar rays COPY INPUTMAT The library copies the input matrix arrays and the user may free them immediately upon return from the handle creation routine Table 2 Copy modes type oski copymode t Copy modes for the matrix creation rou tines as discussed in detail in Section 5 2 1 on page 16 not require the library implementation to report attempts to change implicit zeros to non zero values as errors 5 2 3 Vector objects Vector objects type oski vecview t are always views on the user s dense array data Such objects may be views of either single column vectors created by a call to oski CreateVec View or multiple column vectors multivectors created by a call to oski CreateMulti VecView A multivector consisting of k gt 1 vectors of length n each is just a dense n x k matrix but we use the term multivector to suggest a common case in applications in which k is on the order of a small constant e g 10 or less A single vector is the same as the multivector with k 1 This interface expects the user to store her multivector data as a dense matrix in either row major C default or column major Fortran default array storage Table 4 on the fol lowing page The interface also supports submatrices by allowing the user to provide the leadi
107. via a call to the routine oski_SetHint about the non zero structure of the matrix which may be useful to tuning Several of the hints allow the user to specify addi tional arguments shown in column 2 All arguments are optional The table groups hints into 5 mutually exclusive sets e g a user should only specify one of HINT NO BLOCKS HINT_SINGLE_BLOCKSIZE and HINT MULTIPLE BLOCKSIZES if she specifies any of these hints at all 5 4 Tuning 25 ALWAYS TUNE The user expects many calls and the li brary may therefore elect to do some basic tuning ALWAYS TUNE AGGRESSIVELY The user expects a sufficient number of calls that the library may tune aggressively Table 10 Symbolic calling frequency constants type int Instead of providing a numer ical estimate of the number of calls the user expects to make when specifying a workload hint the user may use one of the above symbolic constants SYMBOLIC VEC A symbolic single vector view SYMBOLIC MULTIVEC A symbolic multivector view consisting of at least two vectors Table 11 Symbolic vector views for workload hints type oski vecview t Instead of passing an actual vector view object to the workload hint routine Table 8 on page 23 the user may pass in one of the above symbolic views dominated by dense blocks of a particular size Rather than just indicate the presence of a single block size by the following call oski SetHint A tunable

Download Pdf Manuals

image

Related Search

Related Contents

Accuview WM170E touch screen monitor  Istruzioni d`uso - VEGADIS 176 - Indicatore digitale senza  Sony Notebook Benutzerhandbuch  くらなび通信52号 - ふくい・くらしの研究所  LF Series  IDG-2000 EVB User Guide  SERVICE MANUAL  Pegasus Manual - Ctechinfo.net  Nicast  OpenVox V100 User Manual  

Copyright © All rights reserved.
Failed to retrieve file