Home

- Computer Science @ UC Davis

1. Summary of Current Work AeA CQ C 10 11 12 13 14 14 15 17 18 18 18 19 20 20 21 23 23 25 9 Future Work 9 1 Distributed Scheduling Enhancements 9 1 1 Designj soca c uu ea a aaa 9 1 2 Example 9 1 3 Implementation Optimization 9 1 4 Evaluation 9 2 Extending CATAPULTS With Realtime Capabilities 9 2 1 Requirements 9 22 JObstacles 9 23 Evaluation 9 3 Grammar Parser Generation Through Metaprogramming 10 Timeline iii List of Figures 1 Thread attribute declarations a e a a 2 Thread import declarations ee 3 Global data declarations 2 a o a s 4 Application variable imports les 5 Event handlers to initialize the scheduler and handle new thread creation events The main scheduling event handler ens 7 Event handlers for context switching away from a thread and performing a named So d o 0 N N en eee RRR RAO SE Pee ee OE Be Oe Ra ee ees 10 Rieu ce MU UELLE 10 9 CATAPULTS scheduler for the CoW web server part 1 of 2 24 10 CATAPULTS scheduler for the CoW web server part 2 of 2 25 11 Node family declarations 0000000000002 22 ee 29 12 Node collections lt a siaa ee ee 29 13 Distributed global scheduling variables len 29 14 Thread dispatch block 2 62 6 eee ee Ss 30 15 Thread a
2. feature of CATAPULTS is the use of imported application variables discussed in Section since it allows direct interaction between the application and the scheduler Importing application variables is an optional feature that allows more specialized scheduler development at the cost of tighter coupling between the application and scheduler the ap 10 plication developer can decide whether this tradeoff is worthwhile for the specific application Even if application specific schedulers that import application variables are not desired performance can often be enhanced simply by selecting an appropriate generic scheduler for the application In this case the scheduler can be developed and fine tuned by a third party which makes the use of a CATAPULTS scheduler just as safe and easy as using the original built in scheduler The CATAPULTS scheduling language was designed with three major goals in mind modular ization and pluggability of scheduling logic prevention of common programming errors encountered in schedulers and portability across different scheduling libraries with different capabilities This section discusses these three goals A fourth important goal good performance is implicit in the design decisions made for CATAPULTS and is discussed in Section 7 4 1 Modularization One reason thread schedulers are so hard to write is because scheduling logic is spread throughout the scheduling library The more complex a scheduling l
3. an effective tool for realtime scheduler development and Section 9 3 describes the requirements of a CATAPULTS metaprogramming model 9 1 Distributed Scheduling Enhancements CATAPULTS scheduling can be extended to distributed computing Application specific schedulers are most likely to be useful on distributed systems with heterogeneous nodes for reasons explained below Developing application specific schedulers for distributed programs provides several advan tages e For applications that do not explicitly manage the distribution of their threads threads can be assigned to computing nodes that most closely match their needs For example threads that are expected to do lots of I O can be placed on nodes with fast disk access threads that need to perform large memory intensive calculations can be assigned to nodes with lots of RAM etc Likewise dissimilar threads can be grouped together so that they can more efficiently overlap I O with computation so that e g two I O bound threads will not wind up competing for disk bandwidth on the same machine e Different nodes can easily utilize different local scheduling policies and can still use global scheduling information i e information about the state of the system as a whole to guide their local decisions e If responsibility for thread distribution is moved from the application to the scheduler it becomes easier to migrate the entire application to a new set of hardwar
4. Overcoming these obstacles while still allowing flexibility in the types of schedulers that can be created is our goal with realtime CATAPULTS Scheduler Execution Time Successful realtime scheduling depends on the accurate prediction and bounding of task execution times Since the scheduler is itself a task that must compete for processor time it is crucial that the execution time of the scheduler itself be bounded and predictable The CATAPULTS translator will be able to analyze scheduling code and determine the worst cost execution time WCET of various events especially the schedule event Accomplishing this will likely disallow the use of some of CATAPULTS more general programming features such as unbounded loops or verbatim statements expressions This is an inconvenience but a necessary tradeoff to obtain meaningful realtime guarantees Application Analysis In order to perform meaningful realtime guarantees a CATAPULTS scheduler will need to have relatively accurate estimates of the WCET of the various tasks that are to be scheduled Estimating execution times will require a means of analyzing the application source code or possibly the generated object code to determine the length of each task Since task scheduling is so important to realtime applications we feel that a tighter coupling of application and scheduler is acceptable in a realtime system than in a regular application so an application code preprocessor or source
5. all operations are guaranteed to be safe but this limits the overall power of the language For example Bossa does not allow any form of unbounded loop in contrast CATAPULTS provides traditional for while and do loops for cases where a safer foreach loop does not suffice Our compiler will generate a warning if it cannot be sure that the loop will terminate CATAPULTS also differs from Bossa in that Bossa is tightly coupled with a specific target lan guage and platform i e it generates Linux kernel C code CATAPULTS allows different backends to be written for different target platforms and languages Modularizing scheduling code has also started to receive some attention from Linux kernel de velopers A recent Linux kernel patch 12 separates all scheduling logic out into a separate kernel source file thus making it much easier to replace the kernel scheduler Although it appears that this pluggable scheduler framework is unlikely to be accepted into the mainline kernel it has received notable support and is being developed as an external patch to the kernel This pluggable sched uler framework provides some of the benefits that systems such as Bossa or CATAPULTS do modularization and ease of replacement but lacks the portability and safety benefits that can be obtained from using a domain specific language like CATAPULTS Had it existed early enough the pluggable scheduler framework would have been an excellent foundation on which
6. and threads are scheduled with the operating system s highly tuned scheduler which also allows them to be distributed among multiple processors Unlike processes kernel threads share an address space which makes inter thread communication easier although thread synchronization is required to prevent race conditions The disadvantage of using kernel level threads is that each context switch requires a transition into kernel code and then back again which is a relatively expensive operation Even though the kernel scheduler is highly tuned for optimal system performance it does not have any knowledge of the application level semantics and can often make poor decisions about which of a program s threads to run next User level threads differ from kernel level threads in that they are completely transparent to the operating system kernel This means that all context switches are performed in user space either preemptively or cooperatively by a separate library specific scheduler without kernel intervention which results in a significant performance gain However this performance benefit comes at a price user level threads run inside a single kernel process so they cannot be distributed among multiple processors furthermore I O or other blocking system calls must be wrapped or replaced with non blocking versions so that a single thread does not block the entire application Despite these limitations which they share with event driven applications
7. application it may also need to monitor or update regular global application variables For CATAPULTS to link a general application variable with an identifier in the scheduler the imported variable must be declared in an imports block along with a default value to use in case the application does not register the variable or the variable needs to be used by the scheduler before the application has a chance to register it In our example a single global variable temperature is imported from the application This variable will be used later in the scheduler s event handlers to determine whether data threadref current current thread threadref next next thread named yield stack NQ new threads queue standard sensors sensors queue slow sensors sensors monitored less frequently threadref display display driver queue calculations calculation threads Last time various thread types ran int last display last sensori last sensor2 const int UNKNOWNCLASS 0 SENSOR1CLASS 1 SENSOR2CLASS DISPLAYCLASS 3 CALCCLASS 2 4 Figure 3 Global data declarations or not the display output thread should be run imports 1 int temperature default 0 Figure 4 Application variable imports The remainder of the scheduler definition consists of event and query handlers These handlers which resemble C functions are callbacks that the base threading library has been modified to call when
8. can be performed by that thread the useless context switch and networking system calls performed add a great deal of unnecessary overhead e The more quickly HTTP requests are processed and flushed from the system the faster new requests can be accepted By giving greater priority to request handler threads that are near completion the overall throughput of the system can be significantly increased We developed a custom scheduler for CoW that addressed these issues scheduler specification can be downloaded from Reference 22 Only about 20 30 lines of code needed to be added modified in CoW in order to get it to run with CATAPULTS these modifications consisted of registering application variables with the scheduler We benchmarked the performance benefit of using the CATAPULTS scheduler by running the httperf 23 HTTP benchmark against against CoW running both with and without the CATA PULTS scheduler several times We varied both the number of handler threads used by the server and the size of the files being requested Our results are summarized in Tables 1 and B the numbers given are averages of several runs with low variances For the most part use of our application specific scheduler resulted in more successful responses and quicker response times although there were a few tests where performance decreased 7 2 Weather Monitoring Station Application In order to measure the benefit of using CATAPULTS on an embedded appli
9. code annotations may be suitable requirements for application development 33 9 2 3 Evaluation As we mentioned above realtime embedded control systems are one of our primary targets with this research Although it is difficult to accumulate the hardware requirements of a real embedded control system it is easy to develop synthetic control systems that simulate the behavior of real control hardware We intend to develop simulations of real world realtime control systems and we will measure our success by attempting to develop more effective schedulers for these systems with CATAPULTS In terms of realtime scheduling we will consider a scheduler effective if it allows more tasks to be run than conservative generic realtime schedulers while still maintaining realtime guarantees 9 3 Grammar Parser Generation Through Metaprogramming Both of the CATAPULTS extensions in the previous sections distributed and realtime schedulers will require modifications to the CATAPULTS grammar Some of these modifications are additions to the grammar e g the introduction of a time datatype or the addition of thread dispatch blocks in a distributed scheduler while others are removals e g unbounded loops should not be used in a realtime scheduler Although we have tried to keep the CATAPULTS language generic and portable our design decisions have likely been influenced by the design and requirements of the specific threading frameworks that we are cur
10. deals with concurrency by spawning multiple OS processes e g by using the forkO system call on UNIX like operating systems Multi process applications are popular for a number of reasons First all modern operating systems provide a means of spawning additional processes so programs that make use of multiple processes are easy to port between operating systems they require few if any modifications when moving to a new system Second multi process applications provide a layer of safety by isolating the different parts of a program from each other Since each process has its own address space it is more difficult for a misbehaving process to corrupt the memory of other tasks and bring down the entire application graceful recovery is easier to accomplish than in other concurrency models Finally since each process is a separate kernel level object the application will get the full benefit of the kernel process scheduler and I O scheduler both of which have been heavily tweaked and optimized However this model is the most heavy weight of the possible strategies for dealing with con currency Process context switching is a relatively expensive operation and further overhead is incurred if separate processes need to communicate with each other or share global data Further more the memory requirements of a multi process application are higher than in other models since some data must be duplicated 2 1 2 Event Driven Applications Even
11. has the additional advantage of eliminating the indirect function calls via function pointers used in the Pth backend although the overhead of calling scheduling code was too small to measure on a regular PC we expect that it would have played a much larger role in the embedded environments that Dynamic C is used on The modifications to cmthread lib to make it work with CATAPULTS are minor about 100 new lines of code were added to the original 457 lines Because this new code is being generated by CATAPULTS its formatting sometimes splits what would normally be one line of code over several So a fairer estimate is about 50 lines of new code Also a good portion of this code is functions that simply do callbacks 7 Experience This section describes our experience including performance results in using CATAPULTS to write custom schedulers for two main applications a web server on Linux and a weather station application on an embedded system We have also used CATAPULTS to write custom schedulers for a variety of other applications e g thread schedulers that use various prioritization schemes We are currently developing CATAPULTS schedulers for MySQL 2 and a web server on an embedded system 7 1 The CoW Web Server on Linux To measure the performance benefits of using CATAPULTS on a real application we adapted CoW a cooperatively multithreaded web server that we had previously written to use a CATAPULTS scheduler CoW was devel
12. means of generating random numbers even though CATAPULTS does not have any instructions to do this Likewise if the programmer wishes to write some values to the screen while debugging he will need some way of generating output since CATAPULTS does not include any output commands To overcome these limitations CATAPULTS provides verbatim statements and verbatim expres sions which allow the programmer to include a block of code either a statement level block or a single expression of the target language directly in the scheduler Thus the programmer can express anything that could be coded in the target scheduler s programming language at the expense of some portability i e the verbatim statements and verbatim expressions will need to be re written for 17 each target language library to which the scheduler is compiled 6 Implementation This section describes the implementation of the CATAPULTS frontend and both of its existing backends As noted earlier CATAPULTS supports multiple backend code generators Each is written as a separate module which makes it easy to add new backends for other languages and libraries We have currently implemented two such code generation modules for CATAPULTS a GNU Pth backend for PC s and a Dynamic C backend for embedded systems 6 1 The Frontend The CATAPULTS translator is written in Python using PLY Python Lex Yacc I8 The trans lator uses very simple propagation based static analy
13. of CATAPULTS to a typical embedded application and obtained a performance gain of about 12 696 at the expense of about 12 796 increase in code size Future directions for the CATAPULTS project include extending the language to facilitate the development of schedulers for both realtime applications and distributed applications as well as developing a metaprogramming model for generating CATAPULTS grammars specialized for a given threading package GNU Pth cmthread lib etc and application class classic multithreaded realtime distributed etc Contents 1 Introduction 2 Background and Related Work Me s sou on dou a Le s poe eee 2 1 1 Multi Process Applications 2 aa Dow oan es Ges See ee eee GPA A eat oe Ras ead avi EER ee SESS i wie ae ae af a Ba ae ates a ed ly St ata at eae Be eee 23 Related Work sre 22a mone 0p 4 AO Re RR X RR eee PERE Sean ee Dai 3 An Introductory Example of a CATAPULTS Scheduler 4 1 Modularization 4 2 Error Prevention 4 3 Portability 5 Language Details 5 1 Data Types 5 2 Imported Application Variables e 5 3 Verbatim Statements and Expressions o oo o a e 6 Implementation 6 1 The Frontend 6 2 The GNU Pth Backend 6 3 The Dynamic C Backend 8 7 1 The CoW Web Server on Linux 22s 7 2 Weather Monitoring Station Application o a e 7 3 Embedded CoW Web Server 7 4 The MySQL Database
14. specification will require two additional sections not present in regular CATAPULTS schedulers The first section lists all globally distributed scheduler variables that the node scheduler uses These variables are visible to the schedulers on all nodes and to the distribution specification described below via a Distributed Shared Memory DSM model The second new section is a list of all properties of a specific node that should be exposed to code in the distribution specification e g amount of free RAM at a node Each node family will be defined in a separate file not only does this organization break a large distributed scheduler into smaller more manageable chunks it also facilitates the reuse of certain node families between applications that have different overarching distributed schedulers Distribution Specification As described above a separate file is used to define each node family of a distributed CATAPULTS scheduler An additional file the distribution specification specifies behavior of the scheduler as a whole This file will contain the following e Declarations of node collections described below e Any globally available variables that should be made available to all node schedulers e Default implementations of event or query handlers for operations that are implemented in the same way across several node families 27 e A node arrival block that contains logic for organizing the sets of nodes currently pre
15. 1 Data Types CATAPULTS provides a typical set of primitive types In addition CATAPULTS provides several thread container types for organizing the threads in the system queue stack pqueue pstack and threadref A pqueue or pstack is similar to a queue or stack but its threads are ordered by a user specified key A threadref can hold at most one thread As mentioned in Section 4 2 all threads must be present in one and only one of the scheduler s thread containers at any time and the thread transfer operator is used to move threads between containers Each container type has a designated insertion point and removal point that controls how the transfer statement manipulates the container The container types provided by CATAPULTS are threadref A thread container with a slot for a single thread Using an empty threadref as the 14 source of a transfer operation or using a full threadref as the destination of a thread transfer are errors stack Threads are both inserted at and removed from the beginning of the list This data structure is unbounded but attempting to transfer a thread out of an empty stack is an error queue Threads are inserted at the end of the list and removed from the beginning As with the stack this data structure is unbounded but using an empty queue as the source of a transfer operation is an error pqueue Threads are maintained in a sorted order based on the pqueue s designated sort key An inserte
16. Application specific User Level Thread Schedulers Qualifying Examination Paper Matthew D Roper roperGcs ucdavis edu May 2005 Department of Computer Science University of California Davis Application specific User Level Thread Schedulers Matthew D Roper roperGcs ucdavis edu Abstract This paper describes CATAPULTS a domain specific language for creating and testing application specific user level thread schedulers Using a domain specific language to write user level thread schedulers provides three advantages First it modularizes the thread scheduler making it easy to plug in and experiment with different thread scheduling strategies Second using a domain specific language for scheduling code helps prevent several of the common pro gramming mistakes that are easy to make when developing thread schedulers Finally the CATAPULTS translator has multiple backends that generate code for different languages and libraries This makes it easy to prototype an application in a high level language and then later port it to a low level language the CATAPULTS translator will take care of generating the appropriate code for both the prototype and the final version of the program from a sin gle scheduler specification Using our preliminary implementation of CATAPULTS for GNU Pth we were able to significantly improve both the throughput and response time of CoW a multi threaded web server We also applied our embedded system version
17. S t gt standard sensors else if t threadclass SENSOR2CLASS t gt slow sensors else if t threadclass DISPLAYCLASS t gt display else if t threadclass CALCCLASS t gt calculations event set next thread t 1 t gt next Figure 7 Event handlers for context switching away from a thread and performing a named yield to a specific thread query threads_ready return standard_sensors slow_sensors display calculations Figure 8 An example query handler that returns to the base threading library the number of threads currently ready to run in the system different CATAPULTS backend to generate scheduling code for whatever language and library was being used for the prototype 4 Design Allowing application programmers to replace the thread scheduler a very highly tuned component of most software systems is a controversial approach Although errors introduced in the schedul ing specification can result in poor performance or instability well written schedulers can result in significantly improved performance The use of CATAPULTS is an optimization with a tradeoff higher performance at the cost of additional work writing a CATAPULTS scheduler and less as surance of stability We expect applications to be written without regards for the scheduler and then if higher thread performance is necessary an application specific scheduler can be written and plugged in The most dangerous
18. Scheduler CATAPULTS is most easily introduced by providing an example of applying it to a simple hypo thetical multi threaded application the embedded control system of a weather monitoring station The application has to monitor several temperature sensors which have to be checked with different frequencies drive a display that changes when the temperature reaches a certain threshold and perform various calculations while the hardware is idle Such a situation is relatively easy to model in a multi threaded application one thread is assigned to each temperature sensor one thread drives the output display and one or more threads perform miscellaneous calculations during the processor s idle time Control systems of this form are common applications for languages on embedded systems such as Dynamic C 16 an extended subset of C that runs on Z World s 8 bit Rabbit processors Dy namic C and our implementation of CATAPULTS on it are discussed in Section 6 3 Although straightforward to implement a standard Dynamic C implementation as described would fail to utilize the processor fully because Dynamic C s native thread scheduler uses a simple first come first serve algorithm Even though some threads do not need to run as often as other threads or only really need to run when certain application level conditions occur the Dynamic C scheduler has no such knowledge It schedules the threads in an inefficient manner resulting in unnecessary con
19. ary for Linux http people redhat com 2003 Ralf S Engelschall GNU Pth the GNU Portable Threads http www gnu org software pth L Barreto and G Muller Bossa A language based approach for the design of real time schedulers In 10th International Conference on Real Time Systems RTS 2002 Con Kolivas Pluggable CPU scheduler framework October 2004 http groups beta google com group fa Travis Newhouse and Joseph Pasquale A user level framework for scheduling within service execution environ ments Proc of the 2004 IEEE International Conference om Services Computing September 2004 Thomas E Anderson Brian N Bershad Edward D Lazowska and Henry M Levy Scheduler Activations Ef fective Kernel Support for the User Level Management of Parallelism ACM Transactions on Computer Systems 10 1 53 79 February 1992 David Keppel Tools and techniques for building fast portable threads packages Technical Report UWCSE 93 05 06 University of Washington Department of Computer Science and Engineering May 1993 Dynamic C user s manual http www zworld com documentation docs manuals DC DCUserManual index htm Takashi Ishihara and Matthew Roper CoW A cooperative multithreading web server edu roper cow David Beazley PLY Python Lex Yacc http systems cs uchicago edu ply Mike Stevens Linker hijacking http neworder box sk newsread php newsid 4735 Matthew Roper Dynamic threading and scheduling with D
20. can only be sorted on thread attributes not on imported application variables We are currently getting CoW to work again on the Rabbits The problem appears to be with the Dynamic C networking code which has changed since the previous version with which CoW worked This problem is not related to CATAPULTS since we have the same problem with non CATAPULTS code We expect to have fixed this problem and report for CoW the same kind of data as in Table Blin the next version of this paper 7 4 The MySQL Database We are also in the process of developing a CATAPULTS scheduler for the MySQL database server MySQL is written using the Pthreads API and can be compiled against Pth s implementation of Pthreads MySQL uses a separate thread per connection and also has some additional threads for advanced features such as replication We see two possible sources of speedup in implementing 23 scheduler cow thread int state int bytesleft internal threadimports int bytesleft default 0 data threadref current current thread threadref next next thread named yield ready threads pqueue RQ sortable on bytesleft internal queue SQ suspended int READY 1 SUSPENDED 2 Y event init noop event newthread t t state READY t gt RQ event schedule Find next thread to run if next 1 next gt current else RQ gt current dispatch current Figure 9 CATAPULTS
21. cation we simulated the weather monitoring station example described in Section Since we do not have access to 21 Table 2 Average response time in ms of CoW running without and with a CATAPULTS application specific scheduler Generic Scheduler CATAPULTS scheduler Improvement 5 handlers 90K files 6014 4691 22 0 20 handlers 90K files 4570 1527 66 6 100 handlers 90K files 1893 1963 3 7 5 handlers 250K files 10687 9708 9 2 20 handlers 250K files 7354 5698 22 5 100 handlers 250K files 4865 4248 12 7 5 handlers 4 byte files 732 794 8 5 20 handlers 4 byte files 589 455 22 8 100 handlers 4 byte files 333 346 3 9 Table 3 Comparison of CATAPULTS threading library and generic cmthread lib Lines of Code Compiled Code Size Simulation Duration Generic cmthread lib 457 21120 bytes 76 508 sec Generated CATAPULTS library lt 546 517 23808 bytes 66 837 sec real weather monitoring hardware we wrote a Dynamic C application with the appropriate control logic and replaced actual sensor and display hardware I O with small loops The CATAPULTS scheduler specification described in the example in Section 3 was used to control thread scheduling The complete specification including the minor details omitted in Figures was a total of 174 lines of code and was translated into 546 lines of Dynamic C In contrast the original cmthread lib libra
22. d 6 3 The Dynamic C Backend While our Pth backend illustrated the the benefits of using CATAPULTS with a large complex threading library we also wanted to apply CATAPULTS in an embedded environment For this purpose we chose to develop a CATAPULTS backend for Dynamic C an extended subset of C that is used to program Z World s 8 bit Rabbit devices Dynamic C includes builtin cooperative multithreading in the form of costatements and cofunctions but only allows a round robin scheduling policy for the threads Although it is possible to use tricks to accomplish dynamic scheduling in Dynamic C 20 doing so requires invasive changes to the application itself which results in confusing code and does not integrate well with CATAPULTS Instead we chose to integrate CATAPULTS with the cmthread lib threading library that we had previously written for Dynamic C cmthread lib is a substitute for Dynamic C s language level multithreading and provides an API that is more consistent with other popular threading API s such as Pth or Pthreads cmthread lib also provides 19 better performance in many cases Dynamic C applications run in an embedded environment with no operating system so dy namically loading schedulers at runtime as our Pth implementation does is neither possible nor advantageous Instead our Dynamic C backend generates a custom version of the cmthread lib library that contains the custom generated scheduling code inline This approach
23. d thread will be placed after all other threads with the same sort key Threads are removed from the beginning of the list This data structure is unbounded but using an empty pqueue as the source of a transfer operation will result in an error pstack Identical to a pqueue except that an inserted thread will be placed before all other threads with the same sort key Collection types are declared using type var syntax e g queue ready threads or threadref current thread with the exception of pstack s and pqueue s These two data types maintain a sorted collection of threads so their definition must also specify the sorting key The syn tax is type var reverse sortable on attribute where attribute is one of the attributes declared in the thread definition section of the scheduler The threads will be sorted such that the thread with the highest sort key will be removed first unless the reverse keyword is specified in which case the thread with the lowest sort key will be first on the queue CATAPULTS also allows arrays of both primitive types and thread containers Note that an array of threadrefs is not considered a container itself and cannot be used as the source or destination of the thread transfer operation instead a specific element of the array must be indexed directly 5 2 Imported Application Variables The primary goal of CATAPULTS is to not only make it easier to write new thread schedulers in general but to allow the
24. d threading libraries Ideally CATAPULTS would be able recompile schedulers for different threading packages with no modifications to the scheduler specification at all However since CATAPULTS allows programmers to write callback routines for various scheduling events it may be necessary to add code to the scheduler specification when switching to a more featureful output module For example a scheduler developed for use with Dynamic C need only specify callback code for a basic set of thread events thread creation thread selection etc If that scheduler specification is then used to generate a scheduler for a more advanced threading library such as Pth additional code will need to be written to specify what actions to perform on Pth s more advanced scheduling events e g OS signal received I O operation complete etc Each CATAPULTS backend code generation module includes a list of the scheduling events that are must be specified in order to create a complete scheduler if a code generation module is used with a scheduler specification that does not include one or more of the required events an error will be returned and translation will stop 5 Language Details This section provides details about the language mechanisms used in the example in Section 3 and also introduces a few additional features that were not used in the example This section can safely be skipped without affecting understanding of the rest of the paper 5
25. definition must start with a section that declares these families and indicates where their specification files are located as shown in Figure After node families are specified node collections are declared as shown scheduler distrib Node families nodefamilies nodetypes fastdisk family fastdisk nodetypes lowram family lowram nodetypes embedded family slowcpu Figure 11 Node family declarations in Figure 12 These collections which function the same way as collections of threads in a regular scheduler provide a way for the distributed scheduler to organize and manage the processing nodes to which it can dispatch threads Next the scheduler must define any globally distributed scheduling Node collections nodes pqueue numcrunch sortable on cpuspeed pqueue decompress sortable on freeram Stack busynodes queue sensors node database only a single DB node node display 3 exactly 3 display nodes Figure 12 Node collections variables that should be available to the schedulers on all nodes This section as illustrated in Figure is similar to the data section in a traditional CATAPULTS scheduler As described Distributed global data global int lowram threads 0 int total ram 0 Figure 13 Distributed global scheduling variables earlier a CATAPULTS scheduler may require a dispatch block which describes how new threads are allocated among the available process
26. development of application specific schedulers for the absolute maximum performance on a specific application Since optimal scheduling decisions often require knowledge about the internal state of an application CATAPULTS provides a means for applications to register their internal variables with the scheduler Once a variable is registered with the scheduler it is linked with a corresponding variable declared in the imports section of the scheduler specification 15 Section 3 Any changes that the application makes to the variable will immediately be visible through the linked scheduler variable and vice versa Imported application variables are the most controversial feature of CATAPULTS since mixing application level data with scheduler logic can be seen as a dangerous entanglement of separate system levels This optional feature provides a tradeoff to the application programmer it becomes harder to change the application without also making changes to the scheduler but performance can be significantly improved by making use of application level information As seen in Section 8 CATAPULTS allows two types of variables to be registered imported with the scheduler general global application variables and per thread instance variables Registering general variables is useful for providing the scheduler with information about the status or load of the system as a whole common examples include the number of open network connections i
27. discussed in depth in Section Figure 2 illustrates the per thread imports for our example a single application variable threadclass is imported which allows the scheduler to determine the scheduling class slow sensor display etc to which a given thread belongs threadimports possible values are thread class constants defined in data section int threadclass default 0 Figure 2 Thread import declarations The scheduler specification must also include declarations for any global objects used by the scheduler including both global variables and constants of primitive types i e integers and floats and thread collections queues stacks etc provided by the runtime system see Section 5 1 FigureB shows this data declaration section for our example scheduler Several thread collections are declared to hold different classes of threads new i e just created threads are placed on a stack sensor threads are divided depending on their speed between two queues a thread reference is used to hold the single thread that drives the display and another queue is used to hold the calculation threads Regular variables of primitive types just integers in this case are also defined here to keep track of the last time a thread of a specific class ran and constants are defined for the different scheduling classes to which a thread can belong Just as a scheduler may need to import thread specific attributes from the
28. e In this case only the scheduler needs to be modified rather than the actual application code There is a lot more variation in the design and style of distributed applications than there is in traditional threading libraries Distributed applications vary widely in areas such as explicit versus 26 implicit distribution communication models shared memory versus message passing etc Because these differences are so significant and have such a large impact on how the application is written we will refer to the libraries and software used for distribution as distributed threading frameworks rather than just threading libraries 9 1 1 Design CATAPULTS as described in Section 5 currently assumes a single scheduler on a single node In CATAPULTS view of a distributed system a scheduler specification will consist of several cooper ating local schedulers with some centralized logic to glue them together Node Families Since different nodes are expected to have different resources and the exact number of nodes may or may not be known before runtime CATAPULTS will be extended with the concept of node families A node family will be specified much as a regular CATAPULTS scheduler is specified its data section lists variables that will be used locally by the scheduler on the node but that are not available to schedulers on other nodes and its set of event handlers will be executed on nodes of that family A node family
29. e 5 Event handlers to initialize the scheduler and handle new thread creation events event schedule threadref tmp Move new threads to their appropriate containers if we know what type of thread they are yet foreach tmp in NQ if tmp threadclass SENSOR1CLASS tmp gt standard sensors else if tmp threadclass SENSOR2CLASS tmp gt slow sensors else if tmp threadclass DISPLAYCLASS tmp gt display else if tmp threadclass CALCCLASS tmp gt calculations Update last run times last_display last sensorit last_sensor2 Determine next thread to run run target of named yield if any run display if temperature gt 100 and display hasn t been updated in over 10 ticks run regular sensor if none run in 3 ticks run slow sensor if none run in 6 ticks else run calculation thread if next 1 next size of next next gt current next is 1 or O here else if temperature gt 100 amp amp last display 10 display gt current last display 0 else if last_sensori gt 3 amp amp standard_sensors gt 0 1 standard_sensors gt current last sensor 0 else if last_sensor2 gt 6 amp amp slow sensors 0 slow_sensors gt current 0 last_sensor2 else calculations gt current dispatch current Figure 6 The main scheduling event handler event switch out t 1 if t threadclass SENSOR1CLAS
30. ems are a primary target for CATAPULTS schedulers 9 2 1 Requirements A couple of straightforward modifications to the CATAPULTS design will be necessary to facilitate realtime scheduling Time Datatype Clearly the concept of time is extremely important to the development of a re altime scheduler At the moment CATAPULTS provides no portable way of specifying and manip ulating time values although some systems can use verbatim statements and verbatim expressions to accomplish this The creation of a time datatype will be necessary for realtime scheduling 32 Event Failure In the current CATAPULTS model event handlers can never fail to execute That is operations like create thread schedule or suspend always succeed While this assumption is generally valid for regular schedulers it is unsuitable for realtime schedulers Realtime schedulers need to perform admission control i e determine whether or not the system is capable of supporting another thread while still maintaining realtime deadlines Therefore CATAPULTS must be modified to allow event handlers to return a failure status if an event cannot be fulfilled This failure will be returned to the application level when events such as create thread are denied 9 2 2 Obstacles In addition to the straightforward changes outlined above several more difficult obstacles will need to be overcome to allow the development of realtime schedulers with CATAPULTS
31. enience individual threadrefs can be grouped into arrays but each element of the array must be accessed directly the array itself is not considered to be a thread collection CATAPULTS enforces the invariant that each thread in the system is contained in exactly one collection at any time this is a strength of CATAPULTS because thread references can never be duplicated or lost due to programmer error although they can be explicitly destroyed when no longer needed The only way to add or remove a thread from a container is to use the thread transfer operator whose syntax is src gt dest Each type of thread container has predefined logic that specifies how threads are inserted and removed by the transfer statement e g removal occurs at the end of queues but at the beginning of stacks When this transfer operator is encountered in a scheduler specification the CATAPULTS translator attempts to verify statically that there will be at least one element in the source container if this cannot be guaranteed the CATAPULTS translator inserts runtime assertions into the generated code Similar checks are made to ensure that a bare thread reference used as the destination of a transfer statement does not already contain a thread All thread transfer operations fall into one of four cases and cause the following types of checks to be performed threadref gt threadref Attempt to statically determine that the source threadref is full and that the tar
32. fic language for writing application specific schedulers T his approach has three major benefits First all scheduling code is collected into a single replaceable component The programmer need only fill in the body of various scheduling events e g new thread thread blocked on I O thread terminated etc in an aspect oriented manner Second using a domain specific language allows much better static analysis to be performed than if the scheduler were directly written in a lower level language such as C For example it is impossible to lose a reference to a thread using our language Finally using a domain specific language allows multiple translation backends to be developed in order to target different threading libraries or programming languages Our primary target for CATAPULTS is highly concurrent Internet servers web database mail etc that run on uniprocessor machines since these applications will gain the most benefit from custom schedulers Another important area that CATAPULTS has been applied to is applications for embedded systems has been applied to We intend to further develop the CATAPULTS language to provide mechanisms for the development of schedulers for realtime applications and also facilitate the creation of schedulers for distributed mul tithreaded applications Because realtime and distributed schedulers require significantly different features beyond what a traditional multithreaded application need
33. get threadref is empty If either of these cannot be determined with certainty runtime assertions are inserted into the generated code 12 threadref gt unbounded container Attempt to statically determine that the source threadref is full If unable to determine statically a runtime assertion is inserted into the generated code unbounded container gt threadref Attempt to statically determine that the source container contains at least one thread and that the target threadref is empty If either of these cannot be determined with certainty runtime assertions are inserted into the generated code unbounded container gt unbounded container Attempt to statically determine that the source container contains at least one thread If unable to determine statically a runtime assertion is inserted into the generated code It should be noted that due to CATAPULTS use of callback like event and query handlers the empty or full status of a container can only be inferred intra handler and not inter handler Since event handlers can be called in any order the contents of all containers with the exception of threadrefs used as parameters to a handler are completely unknown at the start of an event handler As thread transfers are performed it will become statically apparent that some containers are not empty i e they have had threads transferred into them or that some threadrefs are definitely empty they have had a thread transfer
34. hat is specialized for the specific application Thus it is very easy to try out different scheduling strategies 11 4 2 Error Prevention User level threading libraries are often written in C since they generally need low level access to operating system facilities such as signal processing or event handling On embedded systems C and assembly are common languages for threading libraries since these are the languages that are most often used for application development Although such low level languages are very powerful they do very little to prevent programming errors especially when manipulating complex data structures via pointers When such errors occur in thread schedulers they often take the form of a thread coexisting on two different data structures at once essentially duplicating a thread or of a thread s reference being lost by the scheduler Even in higher level languages these types of mistakes are often easy to make and they are often very hard to track down and debug since the exact scheduling conditions that trigger the bug may not be easy to reproduce CATAPULTS provides a very simple yet seemingly sufficient for most schedulers in our expe rience set of data structures for storing collections of threads threadrefs stacks queues and lists that are sortable on any thread attribute e g priority All of these containers are unbounded except for threadrefs which are bounded containers with a single slot For conv
35. ibrary is the more places scheduling code will have to be modified For example although the GNU Pth threading library isolates its main scheduling routine in a file pth_sched c making any large changes to the Pth scheduler such as replacing Pth s new thread queue ready queue etc with different data structures and thread organizations can require code modifications to more than 20 files This is because other routines in the Pth library such as I O functions that place blocked threads onto a waiting queue make assumptions about how threads will be scheduled and manipulate the scheduling data structures directly Furthermore if the developer wishes to actually change the data structures used to store threads e g add a new queue for threads of a specific type the modifications required become even more invasive Simpler threading libraries such as those used by embedded systems may require fewer changes especially if they do not handle other programming concerns such as I O and event handling but replacing a scheduler is usually still non trivial since all uses of the scheduling data structures must be tracked down and possibly modified With CATAPULTS we overcome this problem by allowing the developer to write the scheduler specification independently from the rest of the threading library The CATAPULTS translator will then weave the user specified scheduling code into the rest of the threading library to create a version of the library t
36. ing nodes Some threading libraries may require that the 29 application specify this explicitly at thread creation time but schedulers for libraries that do not will require logic to dispatch new threads The dispatch block shown in Figure has access to all globally distributed variables but cannot access the local scheduling variables on any node After the dispatch block is defined node arrival and node departure blocks may be defined to Dispatch is like an event but doesn t operate in any specific node It can only access distributed globals not any node local data dispatch node tmp if decompress gt 0 decompress gt tmp else if numcrunch gt 0 numcrunch gt tmp select tmp tmp gt busynodes Figure 14 Thread dispatch block specify how new processing nodes should be managed As with the dispatch block these blocks are unnecessary for schedulers targeted at frameworks that handle thread placement at the application level Examples are shown in Figures I5 and 16 Finally the main distributed scheduler specification arrival n if n is fastdisk n gt database else if n is lowram amp amp lowram MAXDECOMPRESS n gt decompress other cases else n gt sensors Figure 15 Thread arrival block concludes with default implementations for event or query handlers 9 1 3 Implementation Optimization The usage of globally distributed scheduling va
37. it needs to perform a scheduling action or get information from the scheduler see Section 6 3 The difference between an event handler and a query handler is the type of action performed Event handlers are used when the base threading library is directing the scheduler to perform a specific action e g suspend this thread Event handlers are intended to perform side effects by manipulating the scheduler s global data structures they return no value In contrast query handlers are used when the internals of the base threading library need to know something about the scheduler e g how many threads are currently in the system query handlers return a value and must not have any side effects Figures pf contain a subset of the example scheduler s event and query handlers the full set of event and query handlers is not reproduced here to save space After writing an entire specification such as that in Figures 1 through the developer then runs the CATAPULTS translator on the specification It produces a scheduler targeted for a particular backend The developer then links that scheduler together with the application code If the developer decided to prototype simulate the system on a regular PC before actually de veloping the embedded Dynamic C version the scheduler specification could be passed through a event init last display 0 event newthread t t gt NQ Place t on new thread queue Figur
38. le task is now distributed throughout the program 2 1 3 Multi threaded Applications The use of threads is the third main concurrency model and provides a compromise between the ease of programming of process based applications and the high performance of event driven applications Threads use fewer system resources and usually have faster context switches than processes which allows them to scale better as the concurrency of a system increases threaded applications can even match or exceed the performance of event driven applications 6 Threads also provide a program ming model similar to that of process based applications which makes threaded applications much easier to develop and maintain than event based applications Threads can be further subdivided into kernel level threads and user level threads as discussed in Section 2 2 2 2 Types of Threads Threads can be divided into two categories kernel level threads and user level threads Kernel level threads or Light Weight Processes as they are sometimes called are created by system 1 The use of asynchronous I O AIO could overcome this limitation but at this time AIO is still immature and poorly supported by many popular operating systems calls into the OS kernel and are scheduled preemptively by the kernel This provides some of the same benefits as using processes each thread can perform I O or other blocking operations without suspending the entire application
39. lers and measure the performance impact that application specific schedulers have on pro grams running on these frameworks Distributed threading frameworks is a very broad target so multiple frameworks and applications will be examined Kaffe JVM Kaffe 24 is a clean room implementation of the Java virtual machine Kaffe would be a good framework to update for use with CATAPULTS because by default it uses its own internal threading library although native Pthreads are also supported under Linux and it also includes an RMI implementation for developing distributed applications Since Kaffe is highly portable and the threading routines are already modularized 25 extending it to use CATAPULTS should be straightforward ports already exist for several embedded systems and RTOS s A large number of distributed applications already exist that use Java RMI so it should be easy to find some good applications to use as benchmarks 31 DesCaRTeS Runtime DesCaRTeS 26 is a platform for developing distributed applications on a network of embedded processors DesCaRTeS is currently written using the multithreading features of the Dynamic C programming language 16 but we know that it should be relatively straightfor ward to port this cnthread lib a cooperative multithreading library we developed for the Rabbit processors and that already has a traditional CATAPULTS backend described in Section 6 3 Al though no real world applications exi
40. n a multithreaded Internet server or the number of calculations completed in a scientific application In contrast registering per thread instance variables with the scheduler is useful for tracking informa tion that the application stores for each thread Per thread instance variables are useful not only for monitoring information that the application would be tracking anyway e g the number of packets that have been processed on a network connection for an Internet server but also for specifically directing scheduler behavior from the application e g threadclass declared in Figure 2 and used in Figures 6 and Although registering variables requires some modification to the base application and removes the transparency of CATAPULTS the modifications required are minimal only a single registration statement is necessary near the beginning of the program for each variable that is to be registered with the scheduler As a real world example when we modified our existing CoW web server 17 to use a CATAPULTS generated scheduler for GNU Pth as described in Section 7 1 we only had to add the following code near the beginning of main Catapults specific register of open files with scheduler pth_schedvar_register noof amp cow_noof and the following code to the routine that spawns the thread pool Register our per thread information with the scheduler pth_threadvar_register isresponding OFFSETOF thrrespo
41. nages collections of threads available for scheduling To facilitate this kind of organization CATAPULTS will make some of the thread collection types stacks queues pstacks and pqueues available for organizing processing nodes A node type will be provided for maintaining references to individual nodes similar to the way threadref works and node objects can also be organized into arrays Further Extensions Further distribution oriented extensions to CATAPULTS can be imple mented as additional event handlers in the CATAPULTS scheduler For example load balancing is often a concern in distributed systems and may or may not be handled explicitly at the application level One possible way to accomplish implicit load balancing would be to introduce work stealing event handlers A scheduler could maintain two separate ready queues one of regular fixed threads and one of stealable threads A new steal event could be added to allow the scheduler to select a thread to migrate to another idle node and the schedule event handler could be extended to steal from other nodes if no useful work is available on the current node 28 9 1 2 Example This section provides an example of what a CATAPULTS specification for a distributed scheduler might look like As mentioned in Section 9 1 1 a complete distributed scheduler consists of several node families each of which is defined in a separate file The general scheduler
42. nd yield immediately this eliminates the additional overhead of useless hardware I O but still incurs the overhead of an unnecessary context switch As shown in Table B3 the simulation completed almost 10 seconds faster when using the CATAPULTS generated scheduler a 12 696 speedup 7 3 Embedded CoW Web Server We also adapted CoW I7 a cooperatively multithreaded web server to use a CATAPULTS sched uler The version of CoW that runs on the Rabbits uses the standard Dynamic C first come first serve scheduler We identified that we could obtain better performance if the scheduler was tailored to the specific requirements of CoW In particular the more quickly HTTP requests are processed and flushed from the system the faster new requests can be accepted By giving greater priority to request handler threads that are near completion the overall throughput of the system can be significantly increased Figures 9 and 10 give the complete CATAPULTS scheduler specification for CoW To give pri ority to handler threads that are near completion the scheduler imports the variable bytesleft from each thread The scheduler represents the ready list as a pqueue see Section sorted by bytesleft internal This scheduler variable bytesleft internal mirrors the application vari able bytesleft it is updated each time a thread is context switched out by the switch out event This extra level of indirection is necessary because pqueues and pstacks
43. nse isresponding 16 pth_threadvar_register bytesleft OFFSETOF thrresponse bytesleft Create the Catapults scheduler data scheddata malloc sizeof thrresponse if scheddata NULL perror Failed to allocate scheduler data exit errno scheddata gt isresponding 0 scheddata gt bytesleft 0 pth_set_userdata tid scheddata A similar number of lines of code would have to be changed for other typical applications If an application tries to link an application variable to a scheduler variable that does not exist in the currently loaded scheduler the registration command will cause a fatal error or will be silently ignored as per user specified option The silent choice makes it easier to swap in and out schedulers that make use of different application variables For example a web server may try to register the number of open network connections with the scheduler but if it is executed with a scheduler that does not make use of this information the registration command will be ignored 5 3 Verbatim Statements and Expressions Although CATAPULTS portability between target languages and libraries is generally considered beneficial it is sometimes a disadvantage such as when the programmer wishes to make a library or system call that is not exposed through the CATAPULTS language For example a scheduler that uses a random number generator to make some of its scheduling decisions will need some
44. oped with the GNU Pth threading library described in Section 6 2 When initially developed CoW relied on the generic first come first serve scheduler built into the Pth li brary Although this scheduler provided adequate performance CoW s performance was comparable to that of other popular servers such as Apache I and Boa 21 we realized that better performance could be obtained if the scheduler was tailored to the specific requirements of CoW We identified two cases in which more intelligent scheduling decisions could result in superior performance 20 Table 1 Pages served by CoW running without and with a CATAPULTS application specific sched uler Generic Scheduler CATAPULTS scheduler Improvement 5 handlers 90K files 8658 10101 16 6796 20 handlers 90K files 9807 10983 11 99 100 handlers 90K files 10917 10653 2 48 5 handlers 250K files 5604 5760 2 78 20 handlers 250K files 5478 5109 7 22 100 handlers 250K files 5184 5406 4 28 5 handlers 4 byte files 89883 86343 3 94 20 handlers 4 byte files 202938 295425 45 57 100 handlers 4 byte files 463093 464457 0 3 e If CoW is processing a large number 1024 on Linux systems of HTTP requests simultane ously it will exhaust its supply of file descriptors and be unable to accept any more network connections In this case context switching to the thread that accepts new connections is undesirable since no actual work
45. ppended to its contents and the resulting filename will be loaded via the dlopen library call If the PTH SCHED variable is not set the original builtin scheduler will be used Loading schedulers at runtime from shared libraries introduces a slight amount of additional overhead since scheduling code that was embedded directly in the Pth library before is now replaced with an invocation of a function pointer that is loaded from a shared library Fortunately we found the overhead of calling scheduling code via a function pointer rather than executing it directly to be minimal and we measured no decrease in performance after instrumenting the Pth library The one disadvantage of using this shared library technique is that it poses a security risk for applications that run setuid A malicious user could add any code he wanted to the scheduler after CATAPULTS has translated it to C code but before it has been compiled into a so library This code would then be executed with target user s permission when the application tried to perform regular scheduling actions This security problem is somewhat similar to the one that arises when LD PRELDAD is honored for setuid programs a situation that is no longer allowed by most modern operating systems 19 However we expect that CATAPULTS will be most applicable to server type applications these types of applications generally do not need to be started by regular users and are unlikely to be marked as setui
46. red out of them So in general less than half of this kind of container checks can be done statically at compile time only when a container has been previously operated on by the current event or query handler can any information about its contents be inferred 4 3 Portability One of the primary goals of CATAPULTS is to make it as portable and retargettable as possible CATAPULTS is not restricted to generating code for any one threading library different code generation modules can be plugged in to allow generation of scheduling code for different libraries or even different languages At the moment we have backend code generators for Dynamic C a C like language with threading features that is used to program ZWorld s embedded Rabbit controllers 16 and GNU Pth 10 a powerful cooperative threading library for PC s writing more backends for other languages will be straightforward The fact that CATAPULTS can compile to different target languages and libraries and has backends for both embedded systems and regular PC s is especially advantageous when simulat ing or prototyping a system on a PC and then re writing a final version on the actual embedded hardware CATAPULTS allows the programmer to develop a scheduler once and then assuming 13 a CATAPULTS code generation module exists for both languages simply recompile to generate schedulers for both the prototype and final system even though they are using different languages an
47. rently working with GNU Pth Dynamic C etc It is likely that other frameworks and scheduling domains will introduce concepts or requirements that we had not foreseen when we first developed the CATAPULTS language Rather than de velop a separate version of CATAPULTS with a different grammar for each of these situations e g CATAPULTS generic CATAPULTS rt CATAPULTS distrib etc we will explore the possibility of using metaprogramming to generate instances of the CATAPULTS grammar that are specialized for the target framework When CATAPULTS is first ported to a new threading framework the porter should be able to write a specification for the new framework e g yes a time datatype is required no event failures are not required etc and have the grammar and translator frontend automatically generated Although this approach somewhat decreases the portability of CATA PULTS schedulers it is unlikely that the same scheduler would make sense for applications running on threading frameworks with such different structures For example our existing schedulers for the Pth threading library would not make sense in a distributed framework since they include no concept of separate processing nodes or distribution The frontend and backend of the translator would still be separate since many frameworks might be able to make use of the same frontend as 34 Pth and Dynamic C already do 10 Timeline This timeline is only tentative as it i
48. riables is likely to vary significantly between appli cations Some applications will use globally distributed variables heavily and relatively uniformly among nodes Other applications will use certain variables primarily at a single node and not at all at other nodes Due to these different usage patterns no one method of distributing the values is ap 30 arrival n 1 if n is fastdisk n gt database else if n is lowram amp amp lowram MAXDECOMPRESS 1 total ram n freeram n gt decompress Other cases else n gt sensors Figure 16 Thread departure block propriate for all types of applications To provide the best performance possible the CATAPULTS compiler should be able to analyze where and how often global variables are used by a scheduler s node families and generate appropriate code for distributing the value For example if all nodes need to use a scheduling variable frequently but only a single node updates the variable s value a broadcast method may be the best way to distribute changes to the value On the other hand if the value is used primarily by a single node the value can be stored at that node and other nodes will issue requests if they occasionally need the value 9 1 4 Evaluation In order to evaluate the benefit of using CATAPULTS to generate distributed schedulers we will need to update existing distributed threading frameworks to make use of CATAPULTS generated schedu
49. rrival block a a e i ai d a a ee ee 30 16 Thread departure block aoaaa a 31 List of Tables 1 Pages served by CoW running without and with a CATAPULTS application specific See bs PP bee ee Pe el oe UIT 21 2 Average response time in ms of CoW running without and with a CATAPULTS cp 22 3 Comparison of CATAPULTS threading library and generic cmthread lib 22 iv 1 Introduction Multi threaded designs have become a popular architecture for highly concurrent applications such as Internet servers Well known examples of such multi threaded applications include the Apache 2 0 web server 1 MySQL 2 and OpenLDAP 3 Threads are a popular model for these types of applications because they offer an intuitive programming style similar to multi process applications while providing performance similar to that of event driven applications Although highly concurrent multi threaded applications scale much better than their multi process counterparts performance can suffer if the algorithm used to schedule the multitude of threads makes poor decisions This paper introduces CATAPULTS a system for developing application specific schedulers for user space threading libraries Modifying schedulers for threading libraries has historically been a difficult undertaking because scheduling is a cross cutting issue and the relevant code is usually not collected as an easily pluggable component Our solution is to provide a domain speci
50. rs web mail database etc to have similar scheduling requirements and obtain a similar performance benefit from application specific scheduling Although the scheduling requirements for embedded applica tions are much less consistent we believe that our embedded test of CATAPULTS to be representa tive of a typical embedded system so our very positive results are encouraging Performance gains may vary depending on the complexity of the scheduling algorithm required by a given system and by the penalty incurred by inefficient thread scheduling but we have shown that developing cus tom thread schedulers with CATAPULTS is relatively straightforward and can provide significant performance gains Moreover using CATAPULTS provides additional safety and portability 25 9 Future Work CATAPULTS success with traditional centralized schedulers has inspired us to explore the exten sion of CATAPULTS to other scheduling domains such as realtime applications and distributed applications The modifications necessary to facilitate these extensions have also led us to explore the use of a metaprogramming model for generating CATAPULTS grammars specialized for a given threading package GNU Pth cmthread lib etc and application class classic multithreaded re altime distributed etc Section 9 1 discusses our plans for using CATAPULTS with distributed systems Section 9 2 discusses the obstacles that will need to be overcome to make CATAPULTS
51. ry on which CATAPULTS output is based is a total of 457 lines of Dynamic C code Although space is a scarce resource on embedded systems this size increase is quite reasonable considering how much more sophisticated the generated scheduler is than the simple first come first serve scheduler in cmthread lib The CATAPULTS library also links with another 517 line auxiliary library that contains implementations of the various thread container types provided by CATAPULTS The Dynamic C compiler will only link in the functions from this auxiliary library that are actually used by the specific application so only a couple hundred of these lines are likely to be included in any given application So comparing only lines of code is somewhat misleading comparing code size is more useful After compiling the simulation application along with the threading library the total code size downloaded to the Rabbit processor was as shown in Table 3 23808 bytes when the generated CATAPULTS library was used as compared to 21120 bytes when the generic cmthread lib was used i e a 12 7 increase in size To measure the performance difference between the CATAPULTS generated scheduler and generic cmthread lib scheduler we executed the control simulation until a total of 10000 execu 22 tions of the calculation threads had run and then measured the total runtime When using the generic cmthread lib we allow threads to notice that they have no work to do a
52. s we also intend to explore the use of metaprogramming to create dialects of the CATAPULTS language specialized for a specific application domain The rest of this paper is organized as follows Section 2 presents background material on de sign of concurrent applications and on threading alternatives and gives an overview of related work Section 3 provides an introductory example of a CATAPULTS scheduler Section 4 describes the purpose and design of the CATAPULTS system Section 5 provides details on the CATAPULTS domain specific language Section 6 describes how the components of the system were implemented Section 7 describes our experience using CATAPULTS on several applications Section 8 provides a brief summary of our existing work and preliminary results Section 9 discusses future directions for CATAPULTS including distributed application scheduling realtime scheduling and using metapro gramming to generate a CATAPULTS grammar and parser that contains the exact subset of thread features required by a particular scheduler target Section 10 provides a timeline for completion of the future work 2 Background and Related Work 2 1 Models For Highly Concurrent Applications Concurrent applications such as Internet servers or multimedia applications generally follow one of three main design strategies multi process event driven or multi threaded 2 1 1 Multi Process Applications A multi process application
53. s difficult to determine the amount of time necessary for each phase of the project The numbers below are estimates of the amount of time required to perform background research and to implement test and benchmark each feature They do not include the time required to write conference or journal papers or to maintain the current implementation Task Start Month End Month Distributed Frontend 1 1 Distributed DesCaRTeS Backend 1 4 Distributed Kaffe Backend 5 9 Distributed Rthreads Backend 10 13 Realtime Scheduling 14 21 CATAPULTS Metaprogramming 22 25 35 References 10 11 12 13 14 15 16 17 18 19 20 21 22 The Apache HTTP server project bttp httpd apache org MySQL OpenLDAP itep Tu opeidap org Davide Libenzi dev epoll home page Jonathan Lemon Kqueue A generic and scalable event notification facility In Proceedings of the FREENIX Track 2001 USENIX Annual Technical Conference 2001 R Behren J Condit F Zhou G Necula and E Brewer Capriccio Scalable threads for internet services In Proceedings of the Ninteenth Symposium on Operating System Principles SOSP 2003 Posix threads programming http www llnl gov computing tutorials workshops workshop pthreads S Walton LinurzThreads 1997 http www ibiblio org pub Linux docs faqs Threads FAQ html Ulrich Drepper and Ingo Molnar The native POSIX thread libr
54. scheduler for the CoW web server part 1 of 2 a MySQL specific scheduler First giving priority to threads that have exclusive write locks on database tables will improve performance by quickly unblocking multiple threads that need shared read locks on those tables Second the threads that MySQL uses to implement advanced database features only need to be run when certain database events occur e g a database replication thread need only be run after an update to the database We are just now beginning work on our specification of the MySQL scheduler and hope to have performance results for the next iteration of this paper 24 event switch out t 1 t bytesleft internal t bytesleft t gt RQ event event_raised t t state READY t gt RQ event set_next_thread t t gt next query threads_ready return RQ query threads suspended return SQ query threads_total return IRQI ISQI 1 query is_ready tid return tid state READY query can_switch_to tid return tid state READY event suspend_thread tid tid state SUSPENDED tid gt SQ event resume thread tid 1 tid state READY tid gt RQ Figure 10 CATAPULTS scheduler for the CoW web server part 2 of 2 8 Summary of Current Work Our experience developing a customized scheduler for CoW shows CATAPULTS applicability to Internet servers running on uniprocessor machines We expect most types of serve
55. sent in the system In some distributed applications this code may only be called a single time at the beginning of the program if nodes are static and in other applications it may be called at various points during execution of the application if nodes come and go dynamically e A node departure block that is executed when nodes leave the system during execution e dispatch block that specifies on which node in the system a new thread should be created Some frameworks for distributed applications manage this explicitly at the application level this section will not be required in schedulers for such frameworks On the other hand some distributed applications will not explicitly manage their own distribution e g Pthreads appli cations automatically distributed via the Rthreads framework as described in Section 1 4 leaving this to be handled by the scheduler Node Collections Distributed applications may or may not have a static set of machines onto which they can distribute their threads Even for applications that do assume a static set of machines the optimal distribution of threads to those machines may depend on application specific information that is only available at runtime e g how many threads are already running on a machine how much disk space is left at a node etc This requires that the distribution specification be able to collect and organize collections of available nodes similar to how a scheduler ma
56. sis to track the various invariants described in Section 4 2 Specifically this static analysis is used to track the following information e presence or absence of threads on a container or in a thread reference e failure to store a thread passed as a parameter into a permanent container in an event handler that requires this e g newthread t e failure of a query handler to return a value e failure of an event handler to produce a side effect e code following a dispatch or return statement e statically known variable values As discussed in Section CATAPULTS generates runtime assertions in the generated code for scheduler code that it cannot analyze statically 6 2 The GNU Pth Backend GNU Pth I0 is a relatively advanced cooperative threading library written in C In addition to providing traditional thread facilities create thread suspend thread yield etc Pth also provides thread aware wrappers for I O operations per thread signal handling message ports and more The many advanced features of Pth make it a large complex library and scheduling related code is spread over Pth s many source files To integrate CATAPULTS with Pth we made modifications to allow 18 Pth to load custom schedulers at application startup time from so files When an application linked against our modified version of Pth begins execution it will check the contents of the PTH SCHED environment variable If this variable is not empty so will be a
57. st for DesCaRTeS yet we plan to develop a synthetic benchmark that will simulate the processing and resource requirements of a real embedded control system Rthreads Rthreads 27 is another distributed threading framework that could be extended for use with CATAPULTS Rthreads provides a distributed threading model that is closely related to Pthreads both syntactically and semantically Global variables are transparently shared between processing nodes via a DSM approach A preprocessor is available to transform traditional Pthreads programs into distributed Rthreads programs Rthreads is an excellent target for CATAPULTS because unlike Kaffe and DesCaRTeS distribution does not have to be explicitly defined at the application layer and can be handled at the scheduler level Since Rthreads can translate and run traditional Pthreads programs it should be straightforward to find suitable benchmark applications 9 2 Extending CATAPULTS With Realtime Capabilities Although embedded systems are already a functioning target of CATAPULTS we would like to extend CATAPULTS to provide better support for developing soft realtime schedulers Hard realtime systems generally perform schedulability analysis off line and preallocate resources to tasks which makes them less suitable for the CATAPULTS but soft realtime systems are capable of dynamic scheduling and adaptation which makes them a more suitable match for CATAPULTS Realtime embedded control syst
58. t driven servers run as a single process and use non blocking I O and event notification mech anisms such as select or pollO to internally multiplex control between active connections Because there is no kernel level context switching event driven servers are light weight and perform very well with low memory and CPU usage making them ideal for use on low resource embedded systems However event driven servers suffer from a number of limitations First event driven applications run inside a single OS process which means they can only submit a single I O request to the kernel at a time This prevents them from taking advantage of the operating system s drive head scheduling or multiple hard drives thus damaging performance Furthermore Internet servers such as web servers often need to handle a large number of open network connections since each network connection requires a file descriptor the supply of descriptors available to a single process can be exhausted quickly Another factor that can limit the performance of event driven servers is the event notification model used most event driven servers use select OO which does not scale well over large numbers of connections although it is the most portable mechanism Use of less portable event models such as epoll on Linux 4 or Kqueue on FreeBSD 5 can provide much better performance Finally event driven applications are difficult to develop and debug since the control flow of a sing
59. text switches and additional overhead In our weather monitoring example the slow sensors will be queried for information as often as the fast sensors even though they will not be ready to report information each time Using CATAPULTS can make such an application more efficient It allows the programmer to quickly and easily create a thread scheduler tailored specifically for this application Figures through B show the scheduler specification some minor details are omitted to save space Our example scheduler begins with a thread definition section shown in Figure It specifies what attributes the scheduler should track for each thread In this example only a single attribute state is declared to track the status of a thread i e whether the thread is new running suspended blocked on I O etc Next the scheduler declares which per thread application variables should be imported into the scheduler Importing an application variable into the scheduler allows the scheduler to monitor the variable for changes made by the application and also allows the scheduler to modify the variable s thread int state new running suspended etc Figure 1 Thread attribute declarations contents which is a useful way of communicating information back to the application Per thread variables imported this way can be referenced exactly like regular thread attributes in event handler code Application variable imports are
60. to build Bossa or other kernel based frameworks Other user level scheduling work includes superd 13 a user level daemon that provides coarse grained process scheduling control for Unix systems scheduler activations I4 a framework by which an OS kernel can operate more efficiently with user level threads and QuickThreads I5 a toolkit for building thread packages that allows arbitrary scheduling code Superd focuses on restricting the set of kernel level processes that can run at a given time in order to guarantee that various process classes get specific shares of the processor time Fine grained scheduling of intra application threads is not possible Scheduler activations are a means by which an operating system kernel can provide notifications activations to a multiprocessor aware user level threading library when scheduling decisions are to be performed while still retaining control over physical processor allocation Scheduler activations move all intra application scheduling to user space so this approach to OS design would integrate nicely with CATAPULTS generated schedulers QuickThreads is a toolkit that provides a very low level set of thread operations for context switching between threads Higher level concerns such as scheduling or stack allocation are left to the threading library or application which makes it possible to develop application specific scheduling routines 3 An Introductory Example of a CATAPULTS
61. user level threads are ideal for highly concurrent applications that are meant to run on uniprocessor systems Threading implementations should not be confused with threading interfaces API s The most well known threading API is Pthreads the POSIX threading API 7 The Pthreads can be imple mented either by an operating system kernel or by a user space threading library For example Linux 2 4 used a thread implementation called Linux Threads 8 and exposed via glibc the Pthreads API to user applications In Linux 2 6 the threading implementation was replaced with the Native POSIX Thread Library NPTL 9 but the interface with applications remained the Pthreads API And user space libraries can also provide implementations of the Pthreads API GNU Pth I0 a popular user space threading library can also be compiled with a Pthreads API which makes it easy to run multi threaded applications on different thread implementations 2 3 Related Work Very little work has been done in the area of domain specific languages for writing schedulers The most closely related project is Bossa 11 a system for generating Linux kernel schedulers using a domain specific language Although Bossa is similar in nature to CATAPULTS it aims to solve a different set of problems Since Bossa deals with operating system schedulers instead of application level schedulers its primary focus is on safety rather than performance or expressibility In Bossa
62. ynamic C http www cs ucdavis edu roper Larry Doolittle and Jon Nelson BOA web server 2004 http www boa org Matthew D Roper CATAPULTS scheduler code for CoW webserver http www cs ucdavis edu roper 36 23 David Mosberger and Tai Jin httperf A tool for measuring web server performance In First Workshop on Internet Server Performance pages 59 67 ACM June 1998 http www hpl hp com personal David_ Mosberger httperf ps 24 Kaffe org http www kaffe org 25 Michael Barr and Jason Steinhorn Kaffe anyone implementing a Java virtual machine Embedded Systems Programming Embedded com February 1998 26 Justin T Maris Matthew D Roper and Ronald A Olsson DesCaRTes a run time system with SR like functionality for programming a network of embedded systems Computer Languages Systems and Structures 29 4 75 100 December 2003 27 B Dreier M Zahn and T Ungerer The Rthreads distributed shared memory system In Third International Conference on Massively Parallel Computing Systems MPCS 98 1998 37

- Computer Science @ UC Davis

Contents

Download Pdf Manuals

Related Search

Related Contents