Home

Catenation and Specialization for Tcl Virtual Machine Performance

1. stantially less than a full blown JIT compiler but can still improve performance It eschews expensive optimizations and reuses code from the interpreter which also reduces semantic errors The compiler uses fixed size native code templates and precomputes as much information as possi ble and thus runs quickly Catenation builds on techniques used by Piumarta and Ric cardi to implement their selective inlining 15 This involves making relocated copies of the native code used by the in terpreter to implement each virtual opcode The original technique placed several constraints on which virtual op codes could be compiled because not all code e g call instructions to pc relative targets is trivially relocatable These constraints do not apply to most of the short low level opcodes of Forth or the VM used to evaluate selective inlining Ocaml However the constraints present a severe problem for a virtual machine such as Tcl s which has much higher level opcodes We remove the constraints so that all Tcl opcodes code can be moved albeit at the expense of portability Catenation allows compilation of every instruction in a byte code program into a separate unit of native code This prop erty along with the infrastructure we built to relocate all code allows us to perform several optimizations We 1 specialize the operands of bytecode instructions into the operands of the native instructions in the template and emp
2. Dynamic native optimization of interpreters In Proc of the 2003 workshop on Interpreters Virtual Machines and Emulators Sun Microelectronics UltraSPARC Ili User s Manual 1997 Tcl Core Team TclLib benchmarks online 2003 Available from http www tcl tk software tcllib B Vitale Catenation and Operand Specialization for Tcl Virtual Machine Performance Master s thesis University of Toronto 2004
3. map of the native code 1By unit of code we mean an object in the Tcl object system whose type is code This typically corresponds to a proc procedure but might also be an anonymous block of code such as a command typed interactively at the interpreter prompt or passed to the eval command etc typedef void ntv_addr struct ntv_addr start ntv_addr end inst_table Labels as Values amp amp inst_incr_immed_begin amp amp inst_incr_immed_end 0 define NEXT_INSTR goto inst_table vpc start increment integer object at top of stack by arg inst_incr_immed_begin int incr_amount Tcl_Obj object tosPtr tos is Top Of Stack Label skip opcode extract arg incr_amount uchar vpc object gt int_val incr_amount vpc advance upc over argument inst_incr_immed_end NEXT_INSTR Figure 4 Using gcc s labels as values extension to delineate the interpreter case for an opcode Label offset for each bytecode The offset map is used to deter mine the native destination address for jump instructions and for exception handling described below The second pass then copies the native code for each tem plate For example to compile the incr_immed opcode shown in Figure 4 the essential operation consists of looking up the starting and ending addresses of the native code for this op code in the interpreter by referring to the inst_
4. so dynamic execution paths do not look as they do in this case exactly like static code in memory Catenation handles control flow by changing virtual machine jumps into native code jumps In Figure 3d the native code for the second virtual instruction push 3 starts at native address 4 Thus a virtual jump to virtual pc 2 would be compiled into a native jump to address 4 Catenation is based on the use of fixed size templates of native code for each virtual opcode Each template is self contained and does not depend on the context of the opcode being compiled The separate cases of an interpreter largely meet this description and indeed we generate our templates directly from the interpreter Following Piumarta and Ric cardi 15 we leverage interpreter oriented extensions in the GNU C compiler to treat C language labels as values delin eating the boundaries of native code for each virtual opcode as shown in Figure 4 Catenating a unit of Tcl bytecodes is a two pass process The first pass computes the total size of the resulting native code by adding up the sizes of the templates for each byte code instruction in the unit It then allocates a native code buffer of the appropriate size The size of a template may be slightly larger than the interpreter case to allow room for instructions we synthesize and emit at the end of a template These are used for jumps debugging and profiling instru mentation The pass also builds a
5. the separate initialization pass While it was still profitable to compile code that was executed many times the faster technique vastly broadens the applicability of catenation The final patch type is UPDATE_PC While we don t main tain the vpc during execution the nature of catenated code makes it is easy to rematerialize UPDATE_PC is used on the right hand side of assignments to the interpreter variable vpc and is patched with the constant value vpc would have had during execution of the original interpreter In cate nated code there is a separate body of native code for every static bytecode instruction and so the value of vpc is im plicitly known For example an UPDATE_PC in the native code emitted for the virtual instruction at vpc 5 is patched to the constant 5 We set the value only in exception han dling paths in the interpreter code To conserve space we will not describe Tcl exception handling here except to say that stack unwinding and handler resolution is done at run time using a map from vpc of exception to vpc of handler Another map allows us to report source line number diag nostics on uncaught errors We could have translated all these maps to native addresses entirely eliminating the vpc in our case but the current implementation works well and it s useful to demonstrate rematerialization of vpc because one can imagine other interpreters or dynamic recompilation systems requiring it 5 4 Implementati
6. Catenation and Specialization for Tcl Virtual Machine Performance Benjamin Vitale bv cs toronto edu Department of Computer Science Tarek S Abdelrahman tsa eecg toronto edu Edward S Rogers Sr Department of Electrical and Computer Engineering University of Toronto Toronto M5S 3G4 Canada ABSTRACT We present techniques for eliminating dispatch overhead in a virtual machine interpreter using a lightweight just in time native code compilation In the context of the Tcl VM we convert bytecodes to native Sparc code by concatenating the native instructions used by the VM to implement each bytecode instruction We thus eliminate the dispatch loop Furthermore immediate arguments of bytecode instructions are substituted into the native code using runtime special ization Native code output from the C compiler is not amenable to relocation by copying fix up of the code is re quired for correct execution The dynamic instruction count improvement from eliding dispatch depends on the length in native instructions of each bytecode opcode implementation These are relatively long in Tcl but dispatch is still a signif icant overhead However their length also causes our tech nique to overflow the instruction cache Furthermore our native compilation consumes runtime Some benchmarks run up to three times faster but roughly half slow down or exhibit little change Categories and Subject Descriptors D 3 4 Programming Langua
7. Microsystems SunBlade 100 Execution Time E bc_compile O dispatch E ntv_compile 200 l cache stall E work a a oO o Oo z J Cycles relative to bytecode a Oo oy he he he he he he he he ox vee Se gt see S i ee ee A i Y SS EXPR MD5 FACT MATRIX IF Figure 8 Performance counter benchmarks with a 502 MHz Ultrasparc Ile 640 MB RAM a 16 KB 2 way set associative instruction cache and a 16 KB direct mapped data cache The 4 way unified Level 2 cache is 256 KB It runs the Solaris 8 operating system The Tcl bench marks are runtime evaluation of a simple arithmetic expres sion of constants and variables a hash function involving many bitwise operations the factorial function multiplica tion of two 15x15 matrices and a long if then else tree The results are depicted in Figure 8 For each benchmark the bar on the left shows the performance of the benchmark with the original bytecode interpreter The bar on the right shows the same but with our catenating compiler The performance of each benchmark is measured by the total number of execution cycles normalized with respect to the original bytecode interpreter Cycles are broken down into those spent on useful work those devoted to dispatch those devoted to bytecode compilation and to native compilation and those wasted in instruction cache stalls As intended catenation removes all dispatch overhe
8. ad in all cases Furthermore the three optimizations it enables operand specialization virtual to native branch conversion and elimination of the virtual program counter substan tially reduce the amount of work cycles in all cases For three cases FACT MATRIX and IF the result is a significant improvement in total execution time Our techniques do not always reduce execution time Some times it is mostly unchanged or actually increased There are two main problems one of which shows up in the EXPR benchmark and the other in MD5 While not shown we measured dynamic instruction counts in addition to cycles In every case besides EXPR catenation reduces instruction counts because it saves dispatch and benefits from the sub sequent optimizations EXPR unbraced however requires a large amount of compilation time In fact this benchmark is a contrived idiom to force continuous late recompilation due to Tcl semantics This idiom is well known by experi enced programmers and generally avoided We include it here to underscore that any workload which spends lots of time recompiling will do poorly in any JIT compiler includ ing ours The JIT must regenerate native code each time bytecode is regenerated Our JIT is relatively fast but typ ically requires between 100 and 150 as much time as the bytecode compilation The MD5 benchmark exhibits a more serious problem The catenated code does slightly less work than the int
9. ad in Tcl s large opcode bodies and that its removal via catenation signifi cantly improves performance for some benchmarks Thus we believe our techniques are applicable to VMs with large opcode bodies Furthermore Ertl and Gregg found 8 found that I cache stalls induced by code growth due to replication are not a major problem In contrast in our system we find that I cache stalls are a problem This is perhaps not surprising given the larger opcodes in the Tcl VM Exploiting the C compiler to build templates for code copying is a clever and largely portable technique We have extended it allowing all opcode bodies to be moved but sacrificed portability Furthermore our experience leaves us with the opinion that the approach is too brittle for general experi mentation depending excessively on the compiler and ma chine architecture A more explicit code generation infras tructure would be more flexible for exploring the interesting issues surrounding native compilation and optimization of dynamic languages such as Tcl Catenation implicates the classic inlining tradeoff and a more selective technique is required perhaps preferring code expansion to dispatch overhead only where profitable Mixed mode execution might facilitate this and we would like to explore this in future work A key related question is decid ing when to compile and when to interpret A useful heuris tic might be based on the potential correlation b
10. ading Alternatively in direct threaded code the bytecode program represents opcodes by the native addresses of their implementations rather than one byte opcode numbers saving the table lookup This dispatch can be expressed in roughly three machine instruc tions Its execution time depends largely on the time to handle the indirect branch to the next opcode but on typi cal workloads it averages about 12 cycles on a modern CPU such as the Ultrasparc III we use On our Sparc switch dispatch requires roughly 12 dynamic instructions or 19 cycles We found token threading takes about 14 cycles If the bodies the real work of the typi cal instruction are small as with Forth this overhead can dominate interpreter performance Even with larger bodies such as Tcl s the overhead is significant For example the Tcl code to compute the factorial function shown in Fig ure 2 uses on average about 100 cycles per virtual opcode Thus indirect threaded dispatch accounts for 16 overhead Even direct threading at 12 cycles takes significant time We thus pose the question Can all dispatch be eliminated And if so is this profitable 3 CATENATION To remove all dispatch we must instead execute native code The typical approach is a full blown compiler including an intermediate representation and code generator which can do much more than simply avoid dispatch In this section we describe a simpler approach to just avoid di
11. am Among other things this will reduce the need to maintain the virtual program counter Catenated code provides the foundation for another a technique that removes the fetches Next we describe this technique which we call Operand Specialization The implementation of a virtual opcode in the original in terpreter is generic in the sense that it must handle any instance of that opcode in any location in any bytecode program and with any valid values for operands After catenation on the other hand there is a separate instance of native code implementing each static bytecode instruc tion in the program During code generation we know the operands of each instruction and can treat them as run time constants We specialize a template with the values of these constants Essentially we are compiling bytecode instruction operands into native instruction operands Now we will describe how to enhance the templates so they are appropriate for specialization Given a virtual op code with operands one of the first tasks of the interpreter body implementing that opcode is to fetch and decode the operands They are usually placed in variables We remove the interpreter code that implements this fetch and replace the variable with a magic number of an appropriate size At specialization time we substitute the magic number with the real operands from the bytecode program We include ifdef INTERPRET define MAGIC_OP1_U1_LITERAL codePt
12. argets and thus the processor s branch predictor does poorly because it uses the address of the branch instruction as the main key when predicting the target 8 Interpreters can employ many other dispatch techniques 3 6 7 In the Function table approach each opcode body is a separate function Dispatch consists of a for loop looking up the address of each function in a table and an indirect function call By making the table lookup explicit in the code instead of letting the compiler generate it as with switch it can sometimes be more efficient avoiding the spurious bounds check for example However the function call with associated stack frame setup register saving etc is more expensive than the simple jump required in switch compute n i e factorial function proc factorial n set fact 1 while n gt 1 set fact expr fact n incr n 1 return fact Figure 2 Tcl code to compute factorial function Another approach is to remove the dispatch from the ex ternal for switch loop and instead place it at the end of each opcode body The interpreter community refers to this as shared dispatch This saves the jump from the end of each body to the single dispatch code More importantly it gives the branch predictor many static indirect jumps to work with dramatically improving its chances of success If switch like table lookup dispatch is placed in each opcode this is known as token thre
13. ased to the research community and is no longer actively developed 8 CONCLUSION AND FUTURE WORK In this paper we presented techniques that allow us to com pile away the bytecode program acting as a very naive JIT compiler However almost all the runtime infrastructure of the interpreter remains so it may be better to think of this system as an advanced interpretation technique rather than a true compiler In practice there is a range of techniques from simple interpretation to advanced dynamic compila tion which trade off implementation size and complexity start up performance and portability We offer a new point on this spectrum and implement it in a non portable but transparent and complete Tcl interpreter system Our experimental evaluation indicates that catenation and the optimizations it enables improve the performance of sev eral benchmarks often significantly On the other hand for some benchmarks the I cache stalls often induced by code growth from catenation degrade performance The overall effect on a typical microprocessor is that about half our benchmarks speed up by as much as a factor of 3 but the other half slow down Ertl and Gregg 9 suggest that dispatch is more of a problem for small opcode bodies and that the architecture of inef ficient popular VMs should be improved before advanced interpretation techniques are applied On the other hand we find that dispatch is a source of overhe
14. bed in more detail below Refer to Figure 5 to see the interpreter C code for the push opcode with our conditionally compiled variation to produce a template for specialization Figures 6 and 7 show the assembly language output by the C compiler for each variation In addition to removing the load of the operand from the bytecode stream Figure 7 shows that several related in structions were also removed In the Tcl VM the operand of the push opcode is not the address of the object to be pushed Rather the operand is an unsigned integer index into the literal table a table of objects stored along with the bytecode We treat the index and the table as run time constants at specialization time then perform a simple constant propagation The result is that push is reduced from twelve Sparc instructions twenty including dispatch to seven including one load instruction instead of four This is an extreme example because push is much shorter than most Tcl opcodes whose bodies can average hundreds of instructions However it is significant because push is an extremely common opcode both statically and dynam ically We also employ this table lookup constant propaga tion on the builtin_math_func opcode which takes a small unsigned integer argument to indicate which of several math functions should be invoked e g sin sqrt etc Catenation runs very fast because it requires only copying native code followed by a few fixups and lin
15. char opcode pc amp program 0 int sum for opcode pc switch opcode case INST_ADD sum POP POP PUSH sum pe 1 break case INST_SUB break other instruction cases Figure 1 Simple interpreter dispatch using C for and switch statements details of our implementation including coaxing the C com piler into generating templates and subsequent object code processing We evaluate the performance of these techniques in Section 6 Finally in Sections 7 and 8 we discuss related and future work and conclude 2 TRADITIONAL INTERPRETERS The most fundamental job of the virtual machine is to dis patch virtual instructions in sequence In this section we compare some of the techniques used in practice A typ ical dispatch loop is shown in Figure 1 Here dispatch is implemented as a C switch statement and kept separate from the implementation of opcode semantics A C compiler will likely translate this to a table lookup While efficient in principle the compiler often includes extra instructions for a bounds check Furthermore there is only one copy of this table lookup code and each opcode body must end with a jump back to it The two loads get opcode from program then lookup opcode in switch table and indirect branch required are pipeline hazards on typical micropro cessors Worse there is only one static copy of the indirect branch to dispatch to many possible opcode t
16. code case we append a macro for token threaded dispatch so that the optimizer is constrained to produce templates suitable for catenation as discussed in Section 3 We redefine operand fetch macros to various magic con stants instead of code to perform actual fetches from the bytecode stream We use a constant of the appropriate size and type so that the template is efficient For exam ple if a virtual opcode takes a one byte integer operand we use a magic constant less than 256 For a four byte operand we use a constant that forces the compiler to gen erate code general enough to load any four byte operand On the Sparc this is a two instruction sequence Occasionally we had to manually introduce uses of the C volatile key word or other tricks to force the optimizer to not propagate the magic constants too deeply into the opcode implemen tation Other times our code rewriting system discussed below handles the optimizer s output At build time the bodies are assembled into a single C func tion with each delineated by labels as shown in Figure 4 We generate C code to store all these labels in a table in dexed by opcode number Then the function is compiled to assembly with gcc set to optimize The resulting assembly is largely suitable for templates but we must post process it to undo the work of the C optimizer in two situations The first occurs when a native branch instruction in the middle of a body branches to the
17. d target architectures some including full system emulation Where we use magic numbers to identify points for specialization in our templates QEMU stores operands in global variables and exploits the ELF relocation meta data generated by the linker Some unwarranted chummi ness is still required to elide function prologues etc CPU opcode instruction bodies tend to be small with complex implementations outlined to called functions and thus catenation performs well Brouhaha 13 is a portable Smalltalk VM that executes bytecode using a direct threaded interpreter built from na tive code templates The templates are written in a style similar to function table dispatch but then like our sys tem after compilation the assembly is post processed to re move prologues epilogues and make other transformations Where we post process using Tcl Brouhaha uses sed and does significantly more rewriting Little runtime rewriting seems required although neither superinstructions replica tion nor operand specialization are implemented DyC 10 is a dynamic compilation system based on pro grammer annotations of runtime constants which are used in dynamic code generation to exploit specialization by par tial evaluation The authors motivate one of their papers using the example of a bytecode interpreter in this case m88ksim a simulation of a Motorola 88000 CPU The input data for this benchmark is a bytecode program DyC trea
18. e a drop in replacement for an inter preter preserving the interactive use of a virtual machine system especially important for scripting Unfortunately a JIT compiler is typically large and complex to develop and portably deploy Furthermore the delay for compilation at runtime may interfere with interactive use In this paper we present an implementation of a relatively new point in the design space between interpretation and JIT compilation Our technique which we call catenation eliminates the overhead of instruction dispatch in the virtual machine Catenation creates a native code version of a byte code program by making copies of the native code used by the interpreter to execute each of the virtual instructions in the original program The resulting native code has certain properties that enable several optimizations in particular the removal of operand fetch code from the bytecode imple mentations bodies We have implemented this technique for the Tcl scripting language Originally Tcl interpreted scripts directly from source but in 1995 evolved to compile scripts to bytecodes on the fly and then interpret the bytecodes 11 While the bytecode compiler improves performance significantly many Tcl applications still demand more speed Our system extends the bytecode compilation with an additional native compilation step targeting the Sparc microprocessor The programming effort required in our approach is sub
19. ed and the entire Tcl virtual machine is linked together in the tradi tional fashion In the rest of this section we describe the runtime manipulations of the templates to accomplish com pilation of Tcl bytecode 5 2 Compiler Initialization During and after catenation we apply many small changes to the fixed length templates to complete the process of com pilation These changes fix the code so that it executes cor rectly after it is relocated and also handle operand special ization We call these fixups patches and refer to the overall process as patching For example when a native call in struction targeting a C library subroutine e g malloc is relocated during catenation the target will be wrong be cause Sparc calls are pc relative that is the target de pends on the address of the call instruction itself A patch is used to correct this problem by adjusting the target ad dress in the relocated instruction This patch type actually applies to any pc relative instruction whose target lies out side the template being moved A few other patch types are described below To accelerate catenation in general and patching in partic ular we analyze the native code once at initialization time We locate the starting and ending points of the template for each opcode and find and store the native code offsets types and sizes of each patch required by each opcode Us ing this information a patch can be applied very quickl
20. egg Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters In Proc of PLDI 2003 9 10 11 12 13 14 15 16 17 18 19 20 21 22 M A Ertl and D Gregg The Structure and Performance of Efficient Interpreters Journal of Instruction Level Parallelism 5 1 25 2003 B Grant M Mock M Philipose C Chambers and S J Eggers DyC an expressive annotation directed dynamic compiler for C Theoretical Computer Science 248 1 2 147 199 2000 B Lewis An on the fly bytecode compiler for Tcl In Proc of the 4th Annual Tcl Tk Workshop 1996 P S Magnusson and F L et al SimICS sun4m A Virtual Workstation In Proc of the Useniz Annual Technical Conference 1998 E Miranda BrouHaHa A Portable Smalltalk Interpreter In Proc of OOPSLA 87 pages 354 365 S Muchnick Advanced Compiler Design and Implementation Morgan Kaufmann 1997 I Piumarta and F Riccardi Optimizing direct threaded code by selective inlining In Proc of PLDI pages 291 300 1998 F Rouse and W Christopher A Typing System for an Optimizing Multiple Backend Tcl Compiler In Proc of the 5th Annual Tcl Tk Workshop 1997 M Sofer Tcl Engines online Available from http sourceforge net projects tclengine SPARC International Inc A The SPARC Architecture Manual Version 8 1992 G T Sullivan D L Bruening I Baron T Garnett and S Amarasinghe
21. erpreted version but the overall execution time stays the same As the chart shows increased I cache misses defeat the advan tages of removing dispatch and improving work efficiency Now in many cases catenated code exhibits better I cache performance because useful code is tightly packed in mem ory without intervening unused opcode implementations However catenation causes major code growth on average we measure a factor of 30 The expanded code s working set can easily exceed a typical 32 KB I cache and consequently I cache stall cycles can overwhelm the savings from remov ing dispatch and operand fetch 6 2 Varying I cache Size To further explore the effect of I cache misses induced by code growth we ran the entire set of 520 Tcl benchmarks with the original and catenating interpreters under the Sim ics 12 full machine simulator configured in separate exper iments with I caches of 32 128 and 1024 KB A fourth experiment simulated an infinite I cache by running with no latency to the external L2 cache The benchmarks are all quite short but some solve interesting problems such as gathering statistics on DNA sequences At the realistic 32 KB I cache size 54 of benchmarks ac tually run slower using catenated code The larger 128 and 1024 KB caches slowed 45 and 34 of benchmarks respec tively Even with the infinite I cache 18 of benchmarks slow down This is because there is not enough execution time to am
22. etween op code body size and I cache performance To study this we would like to apply our technique to other interpreters in cluding some with shorter lower level bodies than Tcl Fi nally we would like to explore outlining of large instruction bodies perhaps by moving them to separate functions and calling these from the body 9 ACKNOWLEDGMENTS We thank Angela Demke Brown and Michael Stumm for their ideas and encouragement Colleagues Mathew Zaleski and Marc Bernd offered stimulating brainstorms Virtutech Inc provided an academic license for its excellent Simics full system simulator Finally the anonymous reviewers feedback was invaluable 10 REFERENCES 1 J Aycock Converting Python Virtual Machine Code to C In Proc of 7th Intl Python Conf 1998 2 V Bala E Duesterwald and S Banerjia Dynamo a transparent dynamic optimization system In Proc of PLDI 2000 3 J R Bell Threaded code Communications of the ACM 16 370 372 1973 4 F Bellard Qemu x86 cpu emulator online 2004 Available from http fabrice bellard free fr qemu 5 D Cuthbert The Kanga Tcl to C converter online 2000 Available from http sourceforge net projects kt2c 6 R B Dewar Indirect threaded code Communications of the ACM 18 330 331 1973 7 M A Ertl Threaded code online 1998 Available from http www complang tuwien ac at forth threaded code htm1 8 M A Ertl and D Gr
23. exit of the body The normal tar get of the branch is one native instruction beyond the end of the case which is precisely what is required for catenation However the optimizer exploits the Sparc delayed branch slot 18 schedules an instruction of the extraneous dis patch code after the branch and changes the branch tar get to two instructions beyond the end Suppressing the optimizer delayed branch scheduler using a command line switch to the C compiler was not a reasonable option be cause this optimization is important to the performance of Sparc code and we want to leave most such transformations intact The second problem occurs when the compiler applies a tail merging 14 optimization to two or more opcode cases This results in the cases sharing part of their implementation but catenation requires that each case be self contained because later passes may make changes to the code and these must be separate for each opcode Using text processing techniques on the assembly output we deoptimize the delayed branch and tail merging transfor mations if they have been applied in the places described Together they affect 14 of 97 opcodes All build time ma nipulation and assembly post processing for deoptimization is performed with Tcl scripts These scripts after boot strapping are executed by a Tcl interpreter which incor porates our native code compiler After post processing the templates are assembl
24. ges Processors Interpreters General Terms bytecode interpreters Keywords virtual machines just in time compilation Tcl 1 INTRODUCTION Many portable high level languages are implemented using virtual machines Examples include traditional program ming languages such as Java and scripting languages such as Tcl Perl and Python The virtual machine may pro vide portability and intimate linkage with a high level run time system that is more sophisticated than a typical general purpose hardware machine for example it may provide garbage collection object systems and other services One penalty of the virtual machine approach is lost perfor mance Because the virtual machine code is interpreted it runs slower than native code Scripting systems are usually designed for portability flexibility extensibility and expres siveness not high performance However they are powerful enough that they are used to implement entire software sys tems and thus their performance is important The efforts to address the VM performance problem can be divided into two main camps Clever interpreter designs such as threaded code can make the process of interpretation faster Alter natively compilers can translate the virtual machine code into native code and execute that directly Just in time JIT compilers make this translation at runtime Avoiding a separate slow static compilation step means that a JIT compiler can try to b
25. king Special ization consumes more time running for every operand of every bytecode instruction It still can run quickly because our templates are fixed size and we can pre compute the offsets where bytecode operands can be specialized into na tive code operands In the next section we give details of this and other aspects of our implementation add 16 4 16 sethi hi operand 01 or 01 lo operand 01 st 01 16 ld 01 00 add 00 1 00 st 00 01 incr VM stack pointer object to push object to push store to top of VM stack next 3 instructions incr ref count of pushed object Figure 7 Template for push opcode compiled from Fig ure 5 using the SPECIALIZE variation Note that it is much shorter than Figure 6 The native operands of the instruc tions marked are points for operand specialization The sethi or instruction pair is the Sparc idiom for setting a 32 bit constant in a register 5 PREPARING AND USING TEMPLATES We build our templates using the output of the GNU C compiling a modified version of the Tcl interpreter This as sembly language output itself requires two small build time transformations and is then conventionally linked into the Tcl virtual machine Finally at runtime our catenating compiler is divided into several phases which include trans formations on the templates In this section we describe each of these steps 5 1 Building Templates To each op
26. loy some constant propagation 2 convert virtual branches into native branches and 3 elide maintenance of the virtual program counter and if necessary rematerialize it In evaluating its performance we found that our technique improves some benchmarks significantly but slows others down We attribute the slowdown to more instruction cache misses resulting from the code growth caused by catenation As we discuss in Section 7 we feel this result is interesting in light of recent work by Ertl and Gregg 8 They find that a replication technique similar to our catenation also causes code growth but that the code growth rarely causes enough I cache misses to hurt performance The key distinc tion perhaps predicted by Ertl and Gregg is that Tcl and many other popular VMs are not efficient interpreters The large opcode bodies in such VMs suffer less dispatch overhead and thus gain less from its removal More im portantly the large bodies when replicated cause excessive code growth The rest of this paper is organized as follows Section 2 re views the structure and performance of some common tech niques for interpreter dispatch Section 3 presents catena tion our approach to eliminating dispatch overhead Sec tion 4 introduces operand specialization which removes the overhead of bytecode operand fetch Section 5 provides some enum INST_ADD INST_SUB void interpret unsigned char program unsigned
27. micro architectural statis tics The second experiment tries to answer questions raised by the first by running a larger set of benchmarks on both our catenating Tcl interpreter and the original while vary ing only the size of the instruction cache This hypotheti cal scenario requires a simulation infrastructure because of the lack of variety in I cache sizes within generations of the Sparc CPUs Using CPUs from different generations which have substantial architectural differences would confound the results 6 1 Cycle counts and I Cache Misses The highest precision clock available for our first experiment is the CPU s cycle counter available using the Sparc per formance counters 20 which are present in the sparcv8 instruction set on the Ultrasparc II and later CPUs Two 64 bit hardware registers can each be programmed to collect any one of a wide variety of events such as cycles retired instructions cache misses branch mis predicts etc To fa cilitate the experiment we implemented a Tcl command to collect performance statistics while running arbitrary Tcl code and separately track bytecode compilation time na tive compilation time and execution time We can also choose whether to include events during execution of system code on behalf of the application We ran our benchmarks on an otherwise unloaded machine and exclude events in curred while the operating system was executing other pro grams The machine is a Sun
28. nfirm it appears the correct number of times If these assertions fail the interpreter code must be retooled This happened during development with a handful of opcodes because our pattern matcher did not yet recognize all variations of com piler output The input and output types for each patch can largely be determined from the type of virtual opcode e g virtual branch or operands or during the analysis phase e g na tive pc relative calls There is a small table of exceptions coded by hand to handle special cases Our code contains a large number of assertions During development we identi fied the exceptions when a small number of assertions failed 5 3 Compilation A code unit is compiled to native code only the first time it is executed The process runs quickly because most of the work has been done in the initialization pass The compila tion is simply interpretation of the patches for the cate nated program Each patch can be interpreted very quickly because it re quires only a load from the bytecode stream possibly an other to index through constant tables sometimes one or two adds to handle pc relative instructions possibly a lookup in the virtual to native pc map a few operations for mask ing shifting the data into destination native instruction then store On average this requires about 120 us per patch The initialization pass requires about 4 ms An initial ver sion of our system did not perform
29. on Notes The implementation is divided into several modules follow ing the structure described above The pre computation of templates and patches is implemented in 771 lines of C con taining 462 semicolons The code generator which uses the templates to implement catenation and operand specializa tion is 581 lines long or 258 semicolons While these mod ules include some profiling instrumentation an additional 535 lines 240 semicolons of Tcl extension commands to collect detailed statistics when running on real machines Finally for the simulation experiments described below we created 1181 lines 513 semicolons of statistics extensions to both Tcl and our simulator As described above we also made small typically one or two line changes to several Tcl VM opcode implementa tions Excluding these changes and the instrumentation code roughly two weeks of effort were required to code the template and code generators However we spent consid erably more time finding and resolving subtle bugs such as the deoptimizations requiring 250 lines of Tcl described in Section 5 1 6 PERFORMANCE EVALUATION To measure the impact of catenation and specialization in our implementation we constructed two sets of experiments The first measures execution time on a small number of benchmarks from the Tcl performance benchmark suite 21 to determine if our modified interpreter actually improves performance We capture detailed
30. ortize the cost of native code compilation except in four less than 1 of benchmarks due to continuous recompilation as described earlier 7 RELATED WORK Ertl and Gregg 8 evaluate the performance of a large num ber of advanced interpretation techniques Their focus is the high proportion of indirect branch instructions in an inter preter Forth with short opcode bodies and the predictabil ity of those branches by a pipelined micro architecture Their techniques include forming superinstructions and making replicas of code Their dynamic superinstructions across ba sic blocks with replication technique is very similar to our catenation except that they leave some dispatch in place to handle virtual branches whereas we remove all dispatch Instead their key goal is to have more indirect branch in structions one for each static bytecode instruction whose behavior and context precisely match the behavior of the bytecode program This yields much better branch predic tion They also find that I cache stalls induced by code growth due to replication are not a major problem for their efficient Forth VM which has short opcode bodies On the other hand our VM with large opcode bodies encoun ters major I cache stalls on many benchmarks QEMU 4 is a CPU emulator which employs techniques very similar to our catenation and operand specialization It is more portable than our system supporting a plurality of host an
31. r gt objArrayPtr TclGetUInt1AtPtr pc 1 define PC_OP x pe x define NEXT_INSTR break elseif SPECIALIZE define MAGIC_OP1_U1_LITERAL Tcl_Obj 0x7bc5c5c1 define NEXT_INSTR define PC_OP x goto jump_table pc start unnecessary endif case INST_PUSH1 Tcl_Obj objectPtr objectPtr MAGIC_OP1_U1_LITERAL 4 tosPtr objectPtr Tcl_IncrRefCount objectPtr PC_OP 2 NEXT_INSTR Figure 5 Source for interpreter implementation of push op code with variation to generate template suitable for spe cialization add 16 4 16 add 15 1 15 ldub 15 00 incr VM stack ptr inc vpc past opcode load operand ld fp 0x48 02 load addr of execution context ld 02 0x4c 01 load addr of literal tbl from ctx sll 00 2 00 compute offset into table ld 01 00 01 load from literal table st 01 16 ld 01 00 store to top of VM stack next 3 instructions increment inc 00 reference count of pushed object st 00 01 inc 415 increment vpc sethi hi 0x800 00 or 00 Ox2f0 00 ld 17 00 01 ldub 15 00 sll 00 2 00 rest is dispatch to next instr 1d 01 00 00 jmp 00 nop branch delay slot unusable Figure 6 C compiler s assembly language translation of code in Figure 5 using the INTERPRET variation several assertions in the compiler initialization code to en sure this results in a suitable template as descri
32. spatch Consider the problem of interpreting the bytecode program in Figure 3a The interpreter uses the imaginary native instructions in Figure 3b to implement the push and add opcodes and instruction dispatch In Figure 3c we show in bold the dynamic sequence of useful i e non dispatch na tive instructions executed when interpreting this bytecode program Now we are set to understand the idea of cate nation which is our technique for improving interpreter performance If we simply copy into executable memory the sequence of useful work instructions those shown in bold we have compiled a native code program with exactly push 2 push 3 add a Sample bytecode program Note opcode push is used twice with different operands push P1P2P3 add a a2a3a4a5 dispatch 010203 b Definitions of virtual instruction bodies in native code Each p a etc represents one native instruction 010203 P1P2P3 010203 P1P2P3 010203 Q1 Q2 Q3 Q a5 c Dynamic sequence of native instructions executed when interpreting program in a P1P2P3 P1p2p3 A1A2Aa3a4a5 d Static sequence of native instructions emitted by catenating compiler for program in a Figure 3 Compiling bytecode objects into catenated copies of native code from the interpreter avoids dispatch overhead the same semantics as the interpreted version This new program has no dispatch overhead Of course most programs contain branches and loops
33. table and then memcpy ing the code into the native code buffer There are a few things to observe about the code in Figure 4 First the compiled code still has to fetch operands from the bytecode stream It does this using the virtual program counter vpc The code uses the entire interpreter runtime interfacing with it at the register level Also note that while the instruction template ends with the label inst_incr_immed_end we still include the NEXT_INSTR code after the template proper This code con sists of a traditional token threaded dispatch Its purpose is to force the C optimizer to build a control flow graph which reflects the fact that any opcode case may be preceded or followed by any other opcode case which is precisely the sit uation after catenation and during normal interpretation However even position independent code output by the C compiler is not intended for this sort of relocation in deed considerable transformation of the output is required to make catenation work We undertake some of these trans formations at interpreter build time and others during cate nation We describe this in some detail in Section 5 By itself catenation is not a true compilation Among other things the resulting code still refers to operands in the byte code instruction stream We address this problem in the next section 4 SPECIALIZATION We would like to eliminate the fetching of operands from the bytecode stre
34. ts It performs some analysis of the types and locations of objects pushed onto the stack but no use is made of this information For Python the similar 211 system 1 com piles extended basic blocks of Python bytecode into superin structions by concatenating the C code for the interpreter cases performing some peephole optimization and then in voking the C compiler in a separate offline step We are sympathetic to Aycock s suggestions that a VM based on registers and lower level bytecodes would be more amenable to compilation The s4 17 project has experimented with improved dis patch techniques for the Tcl VM such as token threading and many of its results have been folded back into the pro duction Tcl release improving performance The ICE 2 0 Tcl Compiler project 16 created a commercial stand alone static ahead of time Tcl version 7 to C com piler in 1995 and then a later version in 1997 that also tar geted bytecode and included a conservative type inference system setting the stage for classical compiler optimiza tions The compiler offered an approximately 30 improve ment in execution time over the Tcl 8 0 bytecode compiler Both the ICE compilers were static that is required a sep arate compile step This precludes using the compiler as a drop in replacement for the original interactive Tcl inter preter an important modality for scripting languages The source code of the ICE Compilers was never rele
35. ts the entire bytecode program as a constant and using an optimization they call complete loop unrolling accomplishes essentially the same effect as our catenation The system ap plies traditional optimizations after partial evaluation This process is static and quite expensive and thus might not be appropriate for a dynamic scripting language which fre quently compiles and re compiles At static compile time they specialize their runtime specializer so it is pre loaded with most of the analysis for optimization and code genera tion This is more general than but similar to our system of patches which pre computes the necessary fix ups They report speedups of 1 8 on m88ksim but do not discuss the complexities of I cache and code explosion Trace based dynamic re compilation techniques 2 also have the promise to automatically accomplish effects similar to catenation and many other optimizations Sullivan et al 19 show how to extend Dynamo s infrastructure by telling it about the virtual program counter so that it is able to perform well while executing virtual machine interpreters whereas it had previously done poorly There have been several efforts to improve Tcl performance The kt2c system 5 while unfinished uses the bytecode for a given function to build a C file containing a huge su perinstruction implementing all the bytecode instructions in the function It converts virtual jumps into C goto state men
36. y essentially in two steps First some information is extracted from the bytecode stream for example a one byte unsigned integer operand Then it may be filtered or manipulated in some way Finally it is applied to the output native code template Thus we store for each patch an input and out put type along with its relative offset from the beginning of the template Input types essentially select the kind of patch necessary and include for example plain operands of various size and signedness destination addresses of vir tual jumps and operands which need translation through the literal table or builtin math function table After a patch has fetched and transformed the input data the data must be placed into the template at the appro priate location This means changing the operands of one or more native machine instructions On a RISC proces sor like the Sparc there are only a few formats and indeed only a few instructions to recognize for patching loading a small immediate integer constant one which fits in 13 bits loading larger constants calls and a few branches In the case of virtual branch opcodes the patch may also syn thesize rather than rewrite the native instruction branch false branch true or branch always The analysis expects certain instruction patterns to appear a certain number of times usually once for each operand of a given opcode The magic constant is used to locate the pattern and co

Catenation and Specialization for Tcl Virtual Machine Performance

Contents

Download Pdf Manuals

Related Search

Related Contents