Home

SMP superscalar user`s manual v.1.x

1. With the objective of tuning the behaviour of the SMPSs runtime a configuration file where some variables are set is introduced However we do not recommend to play with this variables unless the user considers that it is required to improve the performance of her his applications The current set of variables is the following values in parenthesis denote the default task graph task count_high_mark 1000 defines the maximum number of non executed tasks that the graph will hold The purpose of this variable is to control the memory usage task graph task count_low_mark 900 whenever the task graph reaches task _graph task count_high_mark tasks the task graph generation is suspended until the number of task _graph task count_low_mark non executed tasks goes below These variables are set in a plain text file with the following syntax task _graph task count_high_mark 2000 task _graph task count_low_mark 1500 The file where the variables are set is indicated by setting the CSS CONFIG FILE environment variable For example if the file file cfg contains the above variable settings the following command can be used gt export CSS CONFIG FILE file cfg 12 9 References 1 SMP Superscalar website http www bsc es plantillaG php cat_id 385 2 Josep M Perez Rosa M Badia and Jesus Labarta A flexible and portable _ programming model for SMP and multi cores Technical report 03 2007 Barcelon
2. depending on the data dependencies The runtime will find the data dependencies and will determine based on them which functions can be run in parallel with others and which are not Therefore SMPSs provides programmers with a more flexible programming model with an adaptive parallelism level depending on the application input data 2 Installation SMP Superscalar is distributed in source code form and must be compiled and installed before using it 2 1 Compilation requirements The SMP Superscalar compilation process requires the following system components e GCC 4 0 or later e GNU make e Optional automake autoconf libtool bison flex 2 2 Compilation To compile and install SMP Superscalar please follow the following steps 1 Decompres the source tarball S tar Xvzt smpss 1 0 tars gz 2 Enter into the source directory ed Sips s Ll 0 3 Run the configure script specifying the installation directory as the prefix argument More information can be obtained by running configure help configure prefix opt SMPSS 4 Run make make 5 Run make install make install 2 3 User environment If SMP Superscalar has not been installed into a system directory then the user must set the following environment variables 1 The PATH environment variable must contain the bin subdirectory of the installation export PATH SPATH opt SMPSS bin 2 The LD LIBRARY PATH environment variable must contain the ib subdirect
3. n output result void factorial unsigned int n unsigned int result result 1 for n gt 1 n result result n The next example has two input vectors left of size leftSize and right of size rightSize and a single output result of size leftSize rightSize pragma css task input left leftSize right rightSize output result leftsize rightSize void merge float left unsigned int leftSize float right unsigned int rightSize float result The next example shows another feature In this case with the keyword highpriority the user is giving hints to the scheduler the u0 tasks will be when data dependencies allow it executed before the ones that are not marked as high priority pragma css task highpriority inout diag void luO float diag 64 64 Waiting on data Notation pragma css wait on lt ist of expressions gt On clause Comma separated list of expressions corresponding to the addresses that the system will wait for In Example 1 the vector data is generated by bubblesort The wait pragma waits for this function to finish before printing the result Example 1 pragma css task inout data size input size void bubblesort float data unsigned int size void main bubblesort data size pragma css wait on data for unsigned int i 0 i lt size i 4 printf f datali In Example 2 matrix N N is a 2 dimen
4. thread purpose is to populate the graph in order to feed tasks to the worker threads Nevertheless it may stop generating new tasks for several conditions too many tasks in the graph a wait on a barrier or and end of program In those situations it follows the same role as the worker threads by consuming the tasks until the blocking condition is no longer valid 8 Advanced features 8 1 Using Paraver To understand the behavior and performance of the applications the user can generate Paraver tracefiles of their SMP Superscalar applications If the t tracing flag is enabled at compilation time the application will generate a Paraver tracefile of the execution The default name for the tracefile is gss trace id prv The name can be changed by setting the environment variable CSS TRACE FILENAME For example if it is set as follows gt export CSS_TRACE_FILENAME tracefile after the execution the files tracefile 0001 row tracefile 0001 prv and tracefile 0001 pcf are generated All these files are required by the Paraver tool The traces generated by SMP Superscalar can be visualized and analyzed with Paraver Paraver is distributed independently of SMP Superscalar and can be obtained from http www cepba upc es paraver Several configuration files to visualise and analyse SMP superscalar tracefiles are provided in the SMP Superscalar distribution in the directory lt install dir gt share cellss paraver_cfgs The follow
5. SMP Superscalar SMPSs User s Manual Version 1 0 Barcelona Supercomputing Center July 2007 Barcelona Supercomputing Center Centro Nacional de Supercomputacion Table of Contents APOC en O le ana 3 PAW els toll O 6 eee E O EONS ON On Smee eee RE NC eee ae Ren eee er ee 3 Z VCOMPilaliOn redu ementes ida 3 ZAC OM Wal 07 a CENA PU on e auc tan A saan 3 Zoser CN VITO MING INU aras 4 ZAC OMPUCr CAVIFONIMEN esene a a 4 SPTOG FamminG Will SMPS S esere E EAA 4 Dil Pas KSC le CLIO Dieren r I E 4 Dees NE DSS YUU NC aa TE E E E N 5 ACCOMP NG urnas 7 ASAS E AE EEN E E E A E EA O AERE aa O shee oosoun ead aaaioee oezonooenasenae tents 8 oSetting the environment and executing esssesssssesreeseseesseressersereesrereseeesseeee 9 5 1Setting the number of CPUs and executingQ ococcnconcncononcnnoncnnononconannnonoso 9 Programming examples E S E EE 9 Maltrato 9 Internals SMP SUDers Calas ios 10 SAdvanced te aturdido 11 o VUSING PaTAVO Er idasl 11 02 CONMOUT AVION FING rd 12 DIRECTO TL CIC CS arroces tienta 13 1 Introduction This document is the user manual of the SMP Superscalar framework SMPSs which is based on a source to source compiler and a runtime library The supported programming model allows the programmers to write sequential applications and the framework is able to exploit the existing concurrency and to use the different cores of a multi core or SMP by means of automatic parallelization at execution time The requ
6. a Supercomuputing Center Centro Nacional de Supercomputaci n June 2007 3 Paraver website www cepba upc edu paraver 13
7. atform compiler and linker The list of supported options is the following gt mcc help Dmacro Option passed the preprocessors SQ Option passed to the native compilers g 3 Option passed to the native compilers hy he lo Prints Thais 1ntormet Lon Tie Option passed to the native preprocessors k keep Keep temporary files Llabirek y Option passed to the native compiler SLAE Option passed to the native compiler noincludes Don t try to regenerate include directives Option passed to the native compilers Option passed to the native compilers 02 Option passed to the native compilers 03 Option passed to the native compilers ofile Sets the name of the output file t trLacing Enable program tracing v verbose Enables some informational messages Wc OPTIONS Comma separated list of options passed to the native compiler Wi OPTIONS Comma separated list of options passed to the native linker Wp OPTIONS Comma separated list of options passed to the native preprocessor 4 2 Examples gt mec 05 matmulsc o mati Compilation of application file matmul c with O3 optimization level If there are no compilation errors the executable file matmul is created which can be called from the command line gt se AGE ML zo gt mecc keep cholesky c o cholesky Compilation with keep option of cholesky c application Option keep will not delete the intermediate files files generated by the prepr
8. es lt install dir gt being the installation directory More examples are provided in this directory also 6 1 Matrix multiply This example presents a SMPSs code for a block matrix multiply The block size is of 64 x 64 floats pragma css task input A B inout C static void block _addmultiply float C BS BS float A BS BS float B BS BS int i j K for i 0 i lt BS i for j 0 j lt BS j for k 0 k lt BS k CHII Ali k BIKI int main int argc char argv int i j K initialize argc argv A B C pragma css start for i 0 i lt N i for j 0 j lt N j for k 0 k lt N k block addmultiply Cl11 j1 AliJ k BLKI j pragma css finish This main code will run in the main thread while the block _addmultiply calls will be executed in all the threads It is important to note that the sequential code including the annotations can be compiled with the native compiler obtaining a sequential binary This is very useful for debugging the algorithms 7 Internals SMP Superscalar CPU CPU CPU Main thread Worker thread 1 Worker thread 2 SMPSs runtime library SMPSs runtime library SI User main z program o Scheduling Original Scheduling Original Task execution A Task execution Original task code task code Thread 1 Thread 2 Ready task queue Ready task q Global Ready task
9. ing table summarizes what is shown by each configuration file Configuration file Feature shown execution phases cfg Profile of percentage of time spent by each thread master worker at each of the major phases in the run time library ie generating tasks scheduling task execution flushing cfg Intervals dark blue where each thread is flushing its local trace buffer to disk The effect of the flushing overhead on the main thread is of significance since it prevents the main thread from adding newer tasks to the 11 graph This could lead to starvation on the worker threads that would not happen when running without tracing task cfg Outlined function being executed by each SPE task_number cfg Number in order of task generation of task being executed by each SPE Light green for the initial tasks in program order blue for the last tasks in program order Intermixed green and blue indicate out of order execution Task_profile cfg Time microseconds each thread executing the different task spent Change statistic to Fburst number of tasks of each type by thread Average burst time Avg duration of task type 3dh_ duration _tasks cfg Histogram of duration of SPE tasks One plane per task Fixed Value Selector Left column O microseconds right column 3000 ms Darker duration higher number of instances of that 8 2 Configuration file
10. irements we place on the programmer are that the application is composed of coarse grained functions for example by applying blocking and that these functions do not have collateral effects only local variables and parameters are accessed These functions are identified by annotations Somehow similar to the OpenMP ones and the runtime will try to parallelize the execution of the annotated functions also called tasks The source to source compiler separates the annotated functions from the main code and the library calls the annotated code However an annotation before a function does not indicate that this is a parallel region as it does in OpenMP The annotation just indicates the direction of the parameters input output or and inout To be able to exploit the parallelism the SMPSs runtime takes this information about the parameters and builds a data dependency graph where each node represents an instance of an annotated function and edges between nodes denote data dependencies From this graph the runtime is able to schedule for execution independent nodes to different cores at the same time Techniques imported from the computer architecture area like the data dependency analysis data renaming and data locality exploitation are applied to increase the performance of the application While OpenMP explicitly specifies what is parallel and what is not with SMPSs what is specified are functions whose invocations could be run in parallel
11. ocessor object files If there are no compilation errors the executable file cholesky is created gt mecc 02 t matmul c o matmul Compilation with the t tracing feature When executing matmul a tracefile of the execution of the application will be generated gt mec 02 Wc funroll loops ftree vectorize ftree vectorizer verbose 3 matnu ul c 0 marmul The list of flags after the Wc are passed to the native compiler for example c99 These options perform automatic vectorization of the code Note vectorization seems to not work properly on gcc with O3 5 Setting the environment and executing 5 1 Setting the number of CPUs and executing Before executing a SMP Superscalar application the number of processors to be used in the execution have to be defined The default value is 2 but it can be set to a different number with the CSS NUM_CPUS environment variable for example gt export CSS_NUM_CPUS 8 SMP Superscalar applications are started from the command line in the same way as any other application For example for the compilation examples of section 4 2 the applications can be Started as follows gt matmul lt pars gt gt cholesky lt pars gt 6 Programming examples This section presents a programming example for the block matrix multiplication The code is not complete but you can find the complete and working code in the directory lt install dir gt share docs cellss exampl
12. ory from the installation export LD LIBRARY _PATH LD_LIBRARY_ PATH opt SMPSS lib 2 4 Compiler environment The SMP Superscalar compiler requires the following programs to be available on the PATH environment variable e GNU indent 3 Programming with SMPSs SMP Superscalar applications are based on the parallelization at task level of sequential applications The tasks functions or subroutines selected by the programmer will be executed in the different cores Furthermore the runtime detects when tasks are data independent between them and is able to schedule the simultaneous execution of several of them on different cores All the above mentioned actions data dependency analysis and scheduling are performed transparently to the programmer However to take benefit of this automation the computations to be executed in the cores should be of certain granularity about 80 usecs or more A limitation on the tasks is that they should only access their parameters and local variables 3 1 Task selection In the current version of SMP Superscalar it is a responsibility of the application programmer to select tasks of a certain granularity For example blocking is a technique that can be applied to increase the granularity of the tasks in applications that operate on matrices Below there is a Sample code for a block matrix multiplication void block addmultiply double C BS BS double A BS BS double B BS BS int i j K fo
13. queues Thread 0 Ready task queue I Renaming table High pri THT me E SS Low pri Memory ni at oe A AAA T Ll n DOTA O TT Work stealing q Work stealing I ca I I Figure 1 SMP Superscalar runtime behavior When compiling a SMPSs application with mcc the resulting object files are linked with the SMPSs runtime library Then when the application is started the SMPSSs runtime is automatically invoked The runtime is decoupled in two parts one runs the main user code and the other runs the tasks The most important change in the original user code is that the SMPSs compiler replaces the calls to the css addTask function whenever a call to an annotated function appears At runtime these calls to the css addTask function will be responsible for the intended behavior of the application At each call to css addTask the main thread will do the following actions e Anode that represents the called task is added to a task graph e Data dependency analysis of the new task with other previously called tasks e Parameter renaming similarly to register renaming a technique from the superscalar processor area we do renaming of the output and input output parameters For every function call that has a parameter that will be written instead of writing to the original parameter location a new memory location will be used that is a new instance of that parame
14. r i 0 i lt BS i for j 0 j lt BS j for k 0 k lt BS k Clillj Ali k BIKI I 3 2 SMPSs syntax Starting and finishing SMPSs applications The following optional pragmas indicate the scope of the program that will use the SMPSs features pragma css start pragma css finish When the start pragma Is reached all the threads are initiated and run until the finish pragma is reached or the program finishes Annotated functions must be called between this two pragmas when present If they are not present in the user code the compiler will automatically insert the start pragma at the beginning of the application and the finish pragma at the end Specifying a task Notation pragma css task input lt input parameters gt Jo inout lt nout parameters gt Jopt output lt output parameters gt Jop highpriority Jop function declaration function definition Input clause List of parameters whose input value will be read Inout clause List of parameters that will be read and writen by the task Output clause List of parameters that will be written to Highpriority clause Specifies that the task will be sent for execution earlier than tasks without the highpriority clause Parameter notation lt parameter gt lt dimension gt J Examples In this example the factorial task has a single input parameter n and a single output parameter result pragma css task input
15. sion array of pointers to 2 dimension arrays of floats Each of this 2 dimension arrays of floats are generated in the application from annotated functions The pragma waits on the address to each of these blocks before printing the result in a file Example 2 void write_matrix FILE file matrix_t matrix int rows columns int i J ii jj fprintf file din d n N BSIZE N BSIZE for i 0 i lt N i for ii 0 ii lt BSIZE ii for j 0 j lt N j pragma css wait on matrix i j for jj 0 jj lt BSIZE jj fprintf file f matrix i j Lii jj fprintf file In 4 Compiling All steps of the SMPSs compiler have been integrated in a single step compilation called through mcc and the corresponding compilation options which are indicated in the usage section below The mcc compilation process consists in preprocessing the SMPSs pragmas compiling for the native architecture with the corresponding compiler and linking with the needed libraries including the SMPSs libraries The current version is only able to compile single source code applications A way of overcoming this limitation is to provide through libraries the code that does not contain annotations and that Is not calling to annotated functions 4 1 Usage The mcc compiler has been designed to mimic the options and behaviour of common C compilers We also provide a means to send non standart parameters to the pl
16. ter will be created and it will replace the original one becoming a renaming of the original parameter location This allows to execute that function call independently from any previous function call that would write or read that parameter This technique allows to effectively remove some data dependencies by using additional storage and 10 thus improving the chances to extract more parallelism Every thread has its own ready task queue including the main thread There is also a global queue with priority Whenever a task that has no predecessors is added to the graph it is also added to the global ready task queue The worker threads consume ready tasks from the queues in the following order of preference 1 High priority tasks from the global queue 2 Tasks from its their own queue in LIFO order 3 Tasks from any other thread queue in FIFO order Whenever a thread finishes executing a task it checks what tasks have become ready and adds them to its own queue This allows the thread to continue exploring the same area of the task graph unless there is a high priority task or that area has become empty In order to preserve temporal locality threads consume tasks of their own queue in LIFO order which allows them to reuse output parameters to a certain degree The task stealing policy tries to minimise adverse effects on the cache by stealing in FIFO order that is it tries to steal the coldest tasks of the stolen thread The main

SMP superscalar user`s manual v.1.x

Contents

Download Pdf Manuals

Related Search

Related Contents