Home

Rex A Scanner Generator J. Grosch DR. JOSEF

image

Contents

1. STD letter letter digit ident MakeIdent TokenPtr TokenLength return 74 Rex 51 IO StdOutput Appendix 3 Example Specifi cation of a Modula 2 Scanner in Modula 2 GLOBAL FROM Strings IMPORT tString FROM StringM IMPORT tStringRef PutString H FROM Idents IMPORT tIdent MakeIdent E VAR level CARDINAL PROCEDURE ErrorAttribute Token INTEGER VAR Attribute tScanAttribute BEGIN END ErrorAttribute LOCAL VAR Word tString ident tIdent ref tStringRef BEGIN level 0 DEFAULT IO WriteS IO StdOutput illegal character yyEcho IO WriteNl DEFINE digit 0 9 letter 1a z A Z cmt 055 START comment RULE MAA INC level yyStart comment comment 2 DEC level IF level 0 THEN yyStart STD comment ERE emt se oem STD digit i STD digit GetWord Word ref PutString Word RETURN 1 STD 0 7 B GetWord Word ref PutString Word RETURN 2 STD 0 7 E GetWord Word ref PutString Word RETURN 3 STD digit 0 9 A F H GetWord Word ref PutString Word RETURN 4 STD digit digit E digit I GetWord Word ref PutString Word RETURN 5 ESTD AND et NS H NDE o Xt GetWord Word
2. define ENDIAN_NONE 0 no endian property specified define ENDIAN_LITTLE 1 little endian define ENDIAN_BIG 2 big endian 5 2 3 Scanner Driver A main program is necessary for the test of a generated scanner Rex can provide a minimal main program in the file lt Scanner gt Drv cxx which can serve as test driver It counts the tokens and looks like the following f include lt stdio h gt include Position h include lt Scanner gt h int main void int Token Count 0 lt Scanner gt Scanner do Token Scanner GetToken Count ifdef Debug char Word 2048 if Token lt Scanner gt _EofToken Scanner GetWord Word else Word 0 0 WritePosition stdout Scanner Attribute Position printf 5d s n Token Word endif while Token Scanner EofToken printf d n Count return 0 5 3 Modula 2 5 3 1 Scanner Interface The scanners generated by Rex offer an interface given by the following defi nition module named lt Scanner gt md Rex 28 DEFINITION MODULE Scanner IMPORT Position Strings TYPE tScanAttribute RECORD Position Position tPosition END PROCEDURE ErrorAttribute Token INTEGER VAR Attribute tScanAttribute CONST EofToken 0 VAR TokenLength INTEGER VAR T
3. RULE BEGIN printf BEGIN recognized END printf END recognized Pot printt s recognized The scanner generated from the above example specifi cation would print an appropriate message upon finding one of the character sequences BEGIN END or in the input whenever they appear We say a character seguence and a regular expression match if the character seguence has a structure according to the regular expression In general the input of the scanner is searched for character seguences which match one of the specifi ed regular expressions and the associated action is executed Input characters which are not matched by any regular expression are copied by default to standard output The syntax to write regular expressions is as follows see Appendix 1 for a complete defi nition of the syntax The productions are given in increasing precedence Reg_Expr Reg Expr Reg_Expr Reg Expr Reg Expr Reg Expr Reg Expr Reg Expr Reg Expr Number Reg Expr Number Number Reg Expr Character Set Character Identifier String A character is matched by a single identical character a matches the character a t matches a tab character n matches a newline character 10 matches a newline character only if ASCII is used matches the character Oxabcd matches a Unicode character Nuabcdef01 matches a
4. tr END lt Scanner gt Source The newline character may constitute a token of its own in applications such as dialog pro grams Like for every other token Rex needs at least a look ahead of one character to recognize this token Therefore the user has to type not only one extra character but a complete extra input line be cause usually input is line buffered by the operating system This behaviour is undesirable The problem can be solved by modifying the procedure GetLine in the fi le lt Scanner gt Source mi The variant in the comment ifdef Dialog f else adds a dummy character after the newline char acter to serve as lookahead The dummy character should be a character that is ignored such as e g a blank 5 3 3 Scanner Driver A main program is necessary for the test of a generated scanner Rex can provide a minimal main program in the fi le lt Scanner gt Drv mi which can serve as test driver It counts the tokens and looks like the following MODULE lt Scanner gt Drv FROM lt Scanner gt IMPORT BeginScanner GetToken GetWord Attribute EofToken TokenLength CloseScanner FROM Strings IMPORT tString ArrayToString WriteL FROM IO IMPORT StdOutput Writel WriteC WriteNl CloseIO FROM Position IMPORT WritePosition FROM System IMPORT Exit VAR Token INTEGER Word tString Debug BOOLEAN Count INTEGER BEGIN Debug F
5. word while token lt Scanner gt eofToken scanner finalize System out println count 6 Usage NAME rex generator of lexical analyzers SYNOPSIS rex options k 124 ffile ldirectory 1 file DESCRIPTION Rex generates program code to be used in lexical analysis of text A typical application is the generation of scanners for compilers The generated scanners can handle single byte in put as well as Unicode input The input file contains regular expressions to be searched for and actions written in the implementation language to be executed when strings according to the expressions are found Unrecognized portions of the input are copied to standard output In order to be able to recognize tokens depending on their context Rex provides start states to handle left context and the right context can be specifi ed by an additional regular expres sion If several regular expressions match the input characters the longest match is pre ferred If there are still several possibilities the regular expression given first in the specifi cation is chosen Rex generated scanners automatically provide the line and column position of every token For languages like Pascal and Ada where the case of letters is insignifi cant tokens can be normalized to lower or upper case There are predefi ned rules to skip white space such as blanks tabs or newlines and there is a mechanism to handle include files
6. do not optimize table size Effects fast scanner large tables factor 1 10 short generation time default improve table size Effects slower scanner 0 5 medium size tables factor 1 2 medium generation time factor 1 2 suppress warnings generate line directives do not partition character set into blocks during generation implies k1 touch output fi les only if necessary print information about ambiguous rules print statistics about the generated lexical analyzer print help information ffi le specify a fi le to be used as skeleton for the scanner ldir specify the directory dir where rex fi nds its data fi les FILES if output is in C lt Scanner gt h header fi le of the generated scanner lt Scanner gt c body of the generated scanner lt Scanner gt Source h header fi le of support module source lt Scanner gt Source c body of support module source lt Scanner gt Drv c body of the scanner driver main program if output is in C lt Scanner gt h header fi le of the generated scanner lt Scanner gt cxx body of the generated scanner lt Scanner gt Source h header fi le of support module source lt Scanner gt Source cxx lt Scanner gt Drv cxx if output is in Modula 2 lt Scanner gt md lt Scanner gt mi lt Scanner gt Source md lt Scanner gt Source mi lt Scanner gt Drv mi if output is in Ada lt Scanner gt ads lt Scanner gt adb lt Scanner gt source ads lt Scanner gt source
7. Last out Integer is function rRead File Integer Buffer Address Size Integer return Integer pragma Interface C rRead pragma Interface_Name rRead rRead begin Last rRead File Buffer Address Size end GetLine procedure CloseSource File Integer is procedure rClose File Integer pragma Interface C rClose pragma Interface_Name rClose rClose begin rClose File end CloseSource end lt Scanner gt Source 5 4 3 Scanner Driver A main program is necessary for the test of a generated scanner Rex can provide a minimal main program in the file lt Scanner gt drv adb which can serve as test driver It counts the tokens and looks like the following with lt Scanner gt Text_Io Position Strings use lt Scanner gt Text Io Position Strings procedure lt Scanner gt Drv is package Int Io is new Text Io Integer IO Integer use Int_lo Token Integer 2 1 Word tString Debug Boolean False Count Integer begin BeginScanner while Token EofToken loop Token GetToken Count Count 1 if Debug then WritePosition Standard_Output Attribute Position Put Standard_Output Token 5 if TokenLength gt 0 then Put Standard Output GetWord Word WriteS Standard Output Word Rex 34 end if New_Line Standard_Output end if end loop CloseScanner Put Standard_Output Count 0 New
8. RETURN SymSlash RETURN SymLess RETURN SymGreater PrevStat rules RETURN SymColon 7 x PrevStat rules RETURN SymColonMinus Va Attribute Ch 012C RETURN SymChar b Attribute Ch 010C RETURN SymChar ME Attribute Ch 011C RETURN SymChar Nn Attribute Ch 012C RETURN SymChar v Attribute Ch 013C RETURN SymChar NE Attribute Ch 014C RETURN SymChar Mir Attribute Ch 015C RETURN SymChar digit GetWord Word SubString Word 2 Length Word TargetCode n LONGCARD StringToInt TargetCode IF n lt MaxUCHAR THEN Attribute Ch ELSE Attribute Ch DE E H Cs RN SymChar n 0 N O xX hexdigit GetWord Word SubString Word n StringToNumber IF n lt MaxUCHAR TH Attribute Ch ELSE r 4 Length E N n Message number out of range Error Attribute Position INC ErrorCount Word TargetCode TargetCode 16 Rex 61 Message number out of range Error Attribute Position INC ErrorCount Attribute Ch 0 END RETURN SymChar STD set rules AN xXuU hexdigit GetWord Word SubString Word 3 Length Word TargetCode n StringToNumber TargetCode 16 IF n lt MaxUCHAR THEN Attribute Ch n ELSE Message number out of range Error Attribute Position INC ErrorCount Attribute
9. Rex A Scanner Generator J Grosch DR JOSEF GROSCH COCOLAB DATENVERARBEITUNG ACHERN GERMANY Cocktail Toolbox for Compiler Construction Rex A Scanner Generator Josef Grosch Aug 01 2006 Document No 5 Copyright 2006 Dr Josef Grosch Dr Josef Grosch CoCoLab Datenverarbeitung H henweg 6 77855 Achern Germany Phone 49 7841 669144 Fax 49 7841 669145 Email grosch cocolab com Rex 1 1 Introduction Rex generates program code to be used in lexical analysis of text A typical application is the generation of scanners for compilers The generated scanners can handle single byte input as well as Unicode input Rex stands for Regular EXpression tool In principle it is a remake of LEX Les75 Rex processes a specification containing regular expressions to be searched for and actions written in one of the target languages C C Modula 2 Ada Eiffel or Java to be executed when regular expressions are matching Unrecognized portions of the input are copied by default to stan dard output Rex generates a table driven scanner consisting of a scanner routine and control tables The scanner routine implements a tunnel automaton Gro89 and contains a copy of the specified actions The scanners generated by Rex are 5 times faster and up to 5 times smaller than those gener ated by LEX It is possible to reach a speed of 1 5 million lines per minute on a SPARC station ELC including inp
10. Scanner lt Scanner gt f rFILE Attribute ScanAttribute II f make write from fp f stdout_fp II Scanner BeginScanner from Token Scanner GetToken Count 1 debug Scanner Attribute Position WritePosition f f putint2 Token 5 f putchar j f putstring Scanner GetWord f new line end until Token Scanner EofToken loop Token Scanner GetToken Count Count 1 debug Scanner Attribute Position WritePosition f f putint2 Token 5 f putehar f putstring Scanner GetWord f new_line end end Scanner CloseScanner f putint Count f new_line f close end end 5 6 Java 5 6 1 Scanner Interface 36 The fi le lt Scanner gt java contains the class lt Scanner gt which offers the following features as default import de cocolab reuse public class lt Scanner gt class ScanAttribute implements HasPosition pu pu pu pu pu pu blic ScanAttribute errorAttribute int token blic static final int eofToken 0 blic int tokenLength blic ScanAttribute attribute blic lt Scanner gt throws java io IOException blic void beginFile Java io InputStream stream throws java io IOException Rex 37 public int getToken throws java io IOException public String getWord O public String getLower O public String getUpper O public void closeFile throws java io IOException public void finalize O The procedure gerToken is the ce
11. Software Practice amp Experience 19 11 1089 1103 Nov 1989 J Grosch Effi cient Generation of Table Driven Scanners CoCoLab Germany Document No 2 7 Implementation Rex is implemented by a 12 000 line Modula 2 program The program makes heavy use of a library of reusable Modula 2 modules currently comprising 9 000 lines of code Groa Grob Of the 12 000 lines of Rex around 4 900 lines are generated by tools 2100 lines for the scanner are generated by Rex itself Rex 43 1500 lines for the parser are generated by the LALR 1 parser generator lark 1100 lines for a tree data structure are generated by the abstract syntax tree tool asf 250 lines for an attribute evaluator are generated by the attribute evaluator generator ag How can Rex generate a part of itself before its existence Well the scanner has been boot strapped using LEX The first version of the scanner was a separate C program generated by LEX which wrote the internal representation of the tokens on a file A simple hand written scanner read the tokens from this file during construction of Rex After Rex was operational it could generate its own scanner in Modula 2 And how is Rex working It differentiates between constant regular expressions and non con stant ones as defined in Gro89 The non constant regular expressions constitute a nondeterminis tic fi nite automaton The so called subset construction algorithm is used for conversion into
12. ref PutString Word 52 Rex n T non S 1 O i O O OO l 0 si SO O O O NM XO co O c CN SIO O i O c4 N O SIO ORM O O aH Foo 1 1 1 1 1 1 1 1 1 A CN CN CN CN CN CN N N N 00 00 00 sf sr SP SP ST SP sf s s LO LO LO LO LO LO LO LO LO AO O 4 RA GAA GGA GGG 2 GGA 2 2 2 GGG GG Zoo ooo co co c 23 A A Y Yn 106 OG CY DR Go 0G 0G mM mM 206 06 OG Co OG CU OG 166 106 6 Go FY DG Y Y Ym 0 ne OA 704 E NG ee OG Me DM ee CU Da 0 106 gt YND yo AD r GE dn dX va a sf 4 EDI ED Ma o 5 58 0055 De DD rl J E E Fr Fl Bl E m ORO Dl Fl Bl um 2 RD B FE mom um mom m Hm J E Tr FE Hm Ta rl Fl Fl m m Homo Fl Fl m m um mom Hm m m m m Y Xi r 406 304 MM 0G OG DG D DG ae a OG AG a GEN a 06 ADY 06 Mi fi 106 06 Yi Y G04 CY od BE 06 Mi AN a Mi De M ee YG SEG PY Mi OG a ANA DG ADY EY a o H E z O H 1 H Z aq H R m H E Ha oz H Z _ Fu ad JA H E4 H E OH 1 H H O HO A DH zo EEE E mr m 0 DA EE SEEN All Er ETE Er 6 ETE A
13. regTerm regFactor regFactor regFactor regFactor regFactor regFactor Number regFactor Number Number regExpr charSet Char Ident String Number charSet Ft range Char Char Char CsChar N decimal number N hex number r r x X hex digit M fur U hex digit letter letter or digit Ident DottedIdent Ident letter digit mu character 7 decimal_number character Tyf character Na ENT n 5 TNF v V b ENZ r N character octal number decimal number hex number 46 octal_number decimal_number hex_number Rex 0 octal digit digit 0 x X hex digit 47 Rex Appendix 2 Example Specifi cation of a Modula 2 Scanner in C GLOBAL include Memory h include StringM h include Idents h int level 0 void ErrorAttribute Token int Token tScanAttribute Attribute LOCAL char word 256 tIdent ident tStringRef ref A int length DEFAULT Attribute printf illegal character yyEcho printf An DEFINE digit 0 9 letter a z A Z cmt meo END START comment RULE Wen level yyStart comment comment level if level 0 yyStart
14. STD comment cmt The procedure PutString is imported from the module StringM emory It is used to store the string representation of some tokens STD digit j STD digit length GetWord Word ref PutString Word length return 1 STD 0 7 B length GetWord Word ref PutString Word length return 2 STD 0 7 C length GetWord Word ref PutString Word length return 3 STDF digit 0 9 A F H length GetWord Word ref PutString Word length return 4 STD digit digit E digit I length GetWord Word ref PutString Word length return 57 48 Rex STD Nn Ann A length GetWord Word ref PutString Word length return 6 STD return 7 STD amp return 8 STD return 9 STD return 10 STD x return 11 STD return 12 STD return 13 STD return 14 STD return 15 STD Ls return 16 STD return 17 STD return 18 STD return 19 STD return 20 STD lt return 21 STD lt return 22 STD lt gt re
15. return tNumber The above example shows how to handle nested comments in a Modula 2 scanner The rule for opening comment brackets is recognized in all inclusive start states The nesting level is increased and we change the start state to the inclusive start state comment with the predefi ned state ment yyStart Closing comment brackets are recognized only if the scanner is in start state com ment Upon their recognition the nesting level is decreased Should the nesting level reach zero the comment is finished and we change the state back to STD using yyStart again While the scanner is in start state comment everything except opening and closing comment brackets is skipped by speci fying an empty action The last rule specifying the structure of decimal numbers is recognized only in the start state STD The problem of how to declare the variable for counting the nesting level of comments is solved in section 3 7 Example START S TU RULE A Bc STD s S D T Bi xe HS T Es S TH G s NOT U This example declares one inclusive start state 5 and two exclusive start states T and U following table gives for every rule the set of start states where the rule is active Rex 9 Table Start States Start States STD S STD S T U STD 2 o mOmmogow gt o 3 6 Scanner Name A specifi cation may be optionally headed by a name for the scanner
16. 74 Rex Appendix 4 Example Specifi cation of a Scanner for Rex 54 EXPORT FROM Idents IMPORT tIdent H FROM StringM IMPORT tStringRef FROM Texts IMPORT tText A FROM Position IMPORT tPosition FROM UniCode IMPORT UCHAR T TYPE tScanAttribute RECORD Position tPosition CASE INTEGER OF 1 Ident tIdent 2 Number SHORTCARD 3 String tStringRef 4 Ch UCHAR 2 I 5 Text tText END END VAR ErrorCount CARDINAL PROCEDURE ErrorAttribute Token INTEGER VAR Attribute tScanAttribute PROCEDURE startCode PROCEDURE startCharset PROCEDURE startSet E PROCEDURE startRules GLOBA FROM Strings IMPORT tString Concatenate Char SubString cMaxStrLength StringToInt StringToNumber AssignEmpty Length ArrayToString IntToString FROM Texts IMPORT MakeText Append FROM StringM IMPORT tStringRef PutString FROM Idents IMPORT tIdent Makeldent Noldent GetString FROM Errors IMPORT Message Error Restriction FROM ScanGen IMPORT Language tLanguage Procedures AppendLine pGetWord pGetLower pGetUpper pinput pyyPush pyyPop FROM Position IMPORT tPosition FROM UniCode IMPORT MaxUCHAR CONST SymIdent za SymNumber 2 SymString 3 SymChar 4 E SymTargetcode 7 SymScanner 37 A SymImport 39 E SymExport 32 SymGlobal 6 H SymLocal 31 R
17. Ch 0 END RETURN SymChar STD set rulesf AN ANY GetWord Word Attribute Ch RETURN SymChar ORD Char Word 2 STD set rules Nr An Attribute Ch lt ORD CNN RETURN SymChar STD set rulesf t n f r 26 GetWord Word Attribute Ch RETURN SymChar ORD Char Word 1 NE N N26 charset digit IsChar NOT IsChar GetWord Word IF NOT IsChar THEN Attribute Ch ORD Char Word 1 RETURN SymChar Attribute Number StringTolnt Word RETURN SymNumber charset 0 octdigit IsChar NOT IsChar GetWord Word Attribute Number StringToNumber Word 8 RETURN SymNumber Rex 62 charset digit IsChar NOT IsChar GetWord Word Attribute Number StringToInt Word RETURN SymNumber charset 0 xX hexdigit IsChar NOT IsChar GetWord Word SubString Word 3 Length Word TargetCode Attribute Number StringToNumber TargetCode 16 RETURN SymNumber charset a Attribute Ch ORD 007C IsChar FALSE RETURN SymChar charset NN b Attribute Ch ORD 010C IsChar FALSE RETURN SymChar charset NN t Attribute Ch ORD 011C Is
18. Interface toa ee na O ehe Scanner Driver u Oeo GG anii d DO AY pore WBWWWWWWWWWN NNN NNN lO t SS ERARAIAAHHFSAUWAWIAAGLCGEGZARH 5 6 1 5 6 2 5 6 3 5 6 4 Scanner Interface Tuning the Scanner Interface Source Interface Scanner Driver Usage Implementation Differences to LEX Appendix 1 Syntax of the Specifi cation Language Appendix 2 Example Specifi cation of a Modula 2 Scanner in C Appendix 3 Example Specifi cation of a Modula 2 Scanner in Modula 2 Appendix 4 Example Specifi cation of a Scanner for Rex References Rex 36 38 39 39 40 42 43 45 48 51 54 63
19. Principles Techniques and Tools Addison Wesley Reading MA 1986 J Grosch Effi cient Generation of Lexical Analysers Software Practice amp Experience 19 11 Nov 1989 1089 1103 J Grosch Reusable Software A Collection of Modula Modules Cocktail Document No 4 CoCoLab Germany J Grosch Reusable Software A Collection of C Modules Cocktail Document No 30 CoCoLab Germany M E Lesk LEX A Lexical Analyzer Generator Computing Science Technical Report 39 Bell Telephone Laboratories Murray Hill NJ 1975 Contents 3 1 3 2 3 3 3 4 3 5 3 6 3 1 3 8 3 9 4 1 4 2 4 3 4 4 5 1 5 1 1 3152 5 1 3 5 2 3 2 1 52 2 5 2 3 5 3 5 3 1 532 5 3 3 5 4 5 4 1 5 4 2 5 4 3 5 5 5 5 1 DL 5 5 3 5 6 Rex IntrOduCctIOD t coe tio ne FR FN OVERVIEW PEU Specifi cationi Language tds Lexic al CONVENE ONS EE EE sen Regular Pxpressions aaa En Ile Ambiguous Specifi cations 91 15 ien een e ete een Ue chance Def DITIONS ERR ito Start States Scanner Name HN Targe Code iei ene e OBEN EB NI eds SOULCE sn erreichen Character SOU er ret a Na aaa a Sb Predefined Tenis ande en Per A anun a N omes a an ld n Start States aa a obio Scanner Interface eere e ee Source Interface ti ee cedet meet ap nd Scanner Driver tc n EN ER BT YT Na Eiffel sas iii Source
20. The generated scanners are implemented as table driven deterministic fi nite automatons OPTIONS a generate all ds c generate a lexical analyzer in C generate a lexical analyzer in C m generate a lexical analyzer in Modula 2 default u generate a lexical analyzer in Ada e generate a lexical analyzer in Eiffel j generate a lexical analyzer in Java d generate a header fi le or defi nition module s generate support modules a source module for input amain program to be used as test driver i Rex 41 do not predefi ne rules for skipping of white space require explicit defi nitions for used identifi ers default undefi ned identifi ers are treated as strings do not generate dummy labels might cause compiler messages such as statement not reached default generate dummy labels might cause compiler messages such as label not used reduce the number of generated case switch labels might be necessary due to compiler restric tions Effects slower scanner 2 4 larger tables same scanner size use ISO 8 bit code default ASCH 7 bit code k lt n gt generate scanner for characters having n bytes default 1 n gt 1 implies z and disables CHARACTER_SET z lt n gt o mh on g D m map characters to classes at run time use an array of n elements n gt 256 default 16384 optimize table size Effects slower scanner 0 15 small tables long generation time factor 1 10
21. a default value for the additional properties of the token Token The variable Exit refers to a procedure which is called upon an internal error in the scanner The default procedure terminates the program execution The variable can be changed in order to achieve a different behaviour If the scanner reaches the end of the input it returns the special token called EofToken which is encoded by 0 5 3 2 Source Interface The scanners generated by Rex need a source module for blocked input of characters Rex can provide a prototype source module which reads from standard input It is contained in the fi les lt Scanner gt Source md and lt Scanner gt Source mi The definition module in the file lt Scan ner gt Source md has the following contents D m rg E EFINITION MODULE lt Scanner gt Source FROM SYSTEM IMPORT ADDRESS ROM System IMPORT tFile ROCEDURE BeginSource FileName ARRAY OF CHAR tFile ROCEDURE GetLine File tFile Buffer ADDRESS Size CARDINAL INTEGER ROCEDURE CloseSourc File tFile D lt Scanner gt Source BeginSource is called from the scanner in order to open fi les or to initialize any other source of input If not called input is read from standard input GetLine is called in order to fill a buffer starting at address Buffer with a block of maximal Size characters Lines are terminated by newline characters ASCH 12C GetL
22. adb lt Scanner gt drv adb if output is in Eiffel lt Scanner gt e lt Scanner gt buffer e lt Scanner gt drv e lt Scanner gt txt source e attribute e scanattribute e position e rfi le e rsystem e if output is in Java lt Scanner gt java lt Scanner gt Drv java SEE ALSO Rex 42 body of support module source body of the scanner driver main program defi nition module of the generated scanner implementation module of the generated scanner defi nition module of support module source implementation module of support module source implementation module of the scanner driver package of the generated scanner package body of the generated scanner package of support module source package body of support module source package body of the scanner driver class of the generated scanner class of the character buffer for the scanner class of the scanner driver main program tables controlling the generated scanner ASCII format support class for input support class for the description of properties of nonterminals support class for the description of properties of tokens support class for the representation of source positions support class extending the class FILE support class for system specifi c properties class of the generated scanner class of the scanner driver main program J Grosch Rex A Scanner Generator CoCoLab Germany Document No 5 J Grosch Effi cient Generation of Lexical Analyzers
23. all characters except the newline character ANY matches all characters except the newline character Two regular expressions separated by operator match characters that are matched by the fi rst or by the second regular expressions a b matches the characters a or b Two regular expressions following each other with no operator in between match the concate nation of character sequences matched by the single regular expressions ab matches the character sequence ab The operator matches a character sequence matched by the preceding regular expression or the empty character seguence In other words the specifi ed characters are optional ab matches the character sequences a and The operator matches a character sequence which can be matched by the repetition of the preceding regular expression 1 or more times a matches the character sequences a aa aaa The operator matches a character sequence which can be matched by the repetition of the preceding regular expression zero or more times a b matches the character sequences a ab abb abbb A regular expression followed by a number in brackets matches a character seguence which can be matched by the repetition of the preceding regular expression exactly the times speci fi ed by the number a 4 matches the character sequence tern Rex 6 A regular exp
24. driver 1 gt Scanner cat on Y data flow gt Source compilation invocation use program executable module 5 fi le memory area mal Fig 1 Rex Overview replaced by any other main program or subroutine like e g a parser 3 Specification Language The input of Rex consists mainly of three parts code written in the target language to be copied unchanged to the output see 3 7 defi nitions of named regular expressions and start states see 3 4 3 5 a set of regular expressions with associated actions written in the target language see 3 2 The first two parts are optional We discuss the three parts in reverse order after introducing some lexical conventions 3 1 Lexical Conventions The specifi cation can be written in unformatted manner That means white space in the form of blanks tab characters and newline characters has no meaning except to separate other items Rex 3 Comments are written in the styles of C or C Text included in and or from to the end of line is ignored Comments may not be nested The specification uses a few keywords which should be escaped if needed as identifi ers see below BEGIN CHARACTER_SET CLOSE DEFAULT DEFINE EOF EXPORT GLOBAL LOCAL NOT RULE RULES SCANNER START The following special characters are used as op
25. file Otherwise input comes from Rex 21 memory and the parameter File can be ignored Lines are terminated by newline characters ASCH Oxa The function returns the number of characters transferred Reasonable block sizes are between 128 and 8192 or the length of a line Smaller block sizes especially block size 1 will drastically slow down the scanner The end of fi le or end of input is indicated by a return value lt lt Scanner gt _GetWLine is the same as lt Scanner gt _GetLine for type wchar_t instead of type char lt Scanner gt _CloseSource is called from the scanner function lt Scanner gt _CloseFile at end of fi le or at end of input respectively It can be used to close files The functions lt Scanner gt _Be ginSource and lt Scanner gt _CloseSource can be called in a nested way for example in order to handle include files The encoding and the endian property of the input stream are stacked Therefore after a call of lt Scanner gt _CloseSource the properties of the previous input stream are restored The function lt Scanner gt _SetEncoding can be called by the user in order to specify the encod ing and the endian property of the input stream The arguments have to be values as defi ned below This function has to be called after the function lt Scanner gt BeginSource If neither little endian nor big endian is specifi ed then the endian property of the current system is as sumed to hold for the inpu
26. of the scan ner as well as during generation time of the table In order to make a scanner work these two inter nal representations have to agree This is no problem as long as a scanner is generated on a machine with the same encoding of characters as the machine where the scanner is supposed to run on For example if both machines use ASCII everything is fi ne However if the encoding of characters is different then Rex has to Know about the internal representation of the character set on the target machine This can be done by a specifi cation like the following Rex 15 CHARACTER_SET 0 Oxf0 1 Oxf1 9 Oxf9 A 1 B 0xc2 Z 0xe9 a 0x81 b 0x82 z 0xa9 0x09 0x05 tab n 0x25 newline 5 32 0x40 space NN Oxe0 back slash 0 0x6a Oxd0 The curly brackets after the keyvvord CHARACTER SET contain a list of pairs A pair describes a translation and it consists of a character and its internal code A character is given by a printable character a C escape sequence Wn X v Vb V M or a number and a code is given by a number The numbers can be either decimal octal or hexadecimal numbers Like in C octal numbers start with the digit 0 and hexadecimal numbers with Ox While the character refers to the representa tion on the host machine the code refers to the representation on the target machine If no transla tion is given for a character then the internal repres
27. rules comment Strl Str2 CStrl CStr2 AStrl AStr2 charset PPline RULES targetcode n d AppendCode INC level targetcode Wyn d DEC level IF level 0 THEN yyStart PrevState Append Attribute Text TargetCode Attribute Position Position RETURN SymTargetcode ELSE AppendCode END targetcod code gGiy r AppendCode targetcod GetWord IF Language Java THE INCL Procedures pGetWord END AppendCode targetcod getWord IF Language Java THEN INCL Procedures pGetWord END AppendCode targetcod GetLower IF Language Java THEN INCL Procedures pGetLower END AppendCode targetcod getLower IF Language Java THEN INCL Procedures pGetLower END AppendCode targetcod GetUpper IF Language Java THEN INCL Procedures pGetUpper END AppendCode targetcod getUpper IF Language Java THEN INCL Procedures pGetUpper 57 Rex END AppendCode targetcod input INCL Procedures pinput Append targetcod yyPush INCL Procedures pyyPush Append targetcod yyPop INCL Procedures pyyPop Append targetcod t 4 Strings Append TargetCode 110 yyTab targetcode XrMm i Append Attribute Text TargetCode AssignEmpty TargetCode yyEol
28. to ken which is given by the parameter Token The procedure should return in the second argu ment called Attribute a default value for the additional properties of the token Token The variable lt Scanner gt _Exit refers to a procedure which is called upon an internal error in the scanner The default procedure terminates the program execution The variable can be changed in order to achieve a different behaviour The internal scanner interface consists of the following objects The initial size of the scanner input buffer is defi ned by the value of the preprocessor symbol yyInitBufferSize with a default of 8448 The buffer size is increased automatically when nec essary The initial buffer size can be changed by including a preprocessor directive in the GLOBAL section such as define yyInitBufferSize 562 For best results the value should be a power of two plus a constant between 50 and 256 Rex 20 The initial size of the stack for include fi les is defined by the value of the preprocessor symbol yyInitFileStackSize with a default of 8 The stack size is increased automatically when neces sary The initial stack size can be changed by including a preprocessor directive in the GLOB AL section such as define yyInitFileStackSize 16 The value for tab stops is defined by the preprocessor symbol yyTabSpace with a default of 8 This value can be changed by including a preprocessor directive in the GLOBAL section such as de
29. to be generated Example SCANNER lexer The identifi er is used to derive the names of the scanner and source modules and if the target lan guage requires it a prefi x for the objects exported by the scanner If the name is missing it defaults to Scanner In the following we refer to this name by Scanner The prefi xes Scanner and Scanner are generated only if this clause is present Otherwise they are omitted in order to be compatible with former versions of Rex If the target language is Java this name may include a package name Example SCANNER mydomain mypackage Lexer Here the scanner name is Lexer and the generated class will include a package declaration placing it in mydomain mypackage 3 7 Target Code The actions associated with regular expressions may need variables or in general arbitrary dec larations to perform their task A scanner specifi cation may be preceded by several kinds of sec tions written in the target language The syntax rules for actions apply to these sections too These sections are copied unchanged and unchecked to the generated scanner at the following places The IMPORT section is used to declare use of other modules by the scanner For Ada target code after the keyword IMPORT is included in the specifi cation part of the generated scanner before the package header It can be used to introduce WITH and USE clauses For Java target code after the keyword IMPORT is included a
30. 0 targetcode ANY A GetWord Word Strings Append TargetCode Char Word targetcod NGANG 2 ANG targetcod Strings Append TargetCode TNI targetcod 24 GetWord String IF Language C OR Language Cpp THEN yyS ELSIF Language Ada THEN yyStart AStrl ELSE yyStart Strl END StringPosition Attribute Position targetcode yr d GetWord String IF Language C OR Language Cpp THEN yyS ELSIF Language Ada THEN yyStart AStr2 ELSE yyStart Str2 END StringPosition Attribute Position Strl Strchl A Str2 Strch2 CStrl CStrChl ANY CStr2 CStrCh2 AN ANY AStr2 AStrCh GetWord Word Concatenate String CStrl CStr2 NN Nr n Strings Append String Strings Append String 12C yyEol Strl CStrlif Str2 CStr2 AStr2 AStrif ANY GetWord Word Concatenate String yyPrevious Concatenate TargetCode 58 Code Code Code 2 r tart CStrl tart CStr2 Word String 59 Rex Strl Str2 CStrl CStr2 t Strings Append String 11C yyTab Strl Str2 CStrl CSt
31. ALSE Count 0 BeginScanner REPEAT Token GetToken INC Count IF Debug THEN GetWord Word WritePosition StdOutput Attribute Position WriteI StdOutput Token 5 EH me Rex 31 WriteC StdOutput WriteL StdOutput Word END UNTIL Token EofToken CloseScanner Writel StdOutput Count 0 WriteNl StdOutput CloselO rExit 0 END lt Scanner gt Drv 5 4 Ada 5 4 1 the fi vith Scanner Interface The scanners generated by Rex offer an interface given by the following package contained in le lt Scanner gt ads Position Strings package lt Scanner gt is type tScanAttribute is record Position tPosition end record procedure ErrorAttribute Token Integer Attribute out tScanAttribute EofToken constant Integer 0 TokenLength Integer TokenIndex Integer Attribute tScanAttribute procedure BeginScanner 7 procedure BeginFile FileName String function GetToken return Integer procedure GetWord Word out Strings tString procedure GetLower Word out Strings tString procedure GetUpper Word out Strings tString procedure CloseFil i procedure CloseScanner end Scanner The procedure GetToken is the central scanning routine It returns the next token found in the input or whatever is specifi ed in the actions associated with the regular expressions The procedure
32. BeginFile may be called in order to open an input file or a nested include fi le The parameter FileName specifies the file name The value empty string denotes input from standard input If not called input is read from standard input Include fi les up to a nest ing depth of 15 can be processed The procedure CloseFile may be called in order to close the current input fi le before reaching end of fi le CloseFile is called automatically by the scanner upon reaching end of fi le The procedure BeginScanner may be called in order to initialize user data The contents of the target code section named BEGIN is included in the body of this procedure The procedure CloseScanner may be called in order to fi nalize user data The contents of the target code section named CLOSE is included in the body of this procedure The procedures GetWord GetLower and GetUpper allow access to the matched character se quence as described in section 4 4 Rex 32 The variable TokenLength specifi es the number of matched characters The variable TokenIndex is an array index of the internal buffer an array of characters which specifi es the location where the matched character sequence starts It can be used as argument for the macros that compute source positions The variable Attribute is supposed to communicate additional properties of the current token The value must be provided by appropriate action statements This variable is of type tScanAttri
33. Char FALSE RETURN SymChar charset AX n Attribute Ch ORD 012C IsChar FALSE RETURN SymChar charset NN v Attribute Ch ORD 013C IsChar FALSE RETURN SymChar charset AN f Attribute Ch ORD 014C IsChar FALSE RETURN SymChar charset NN r Attribute Ch ORD 015C IsChar FALSE RETURN SymChar charset NN ANY zod IsChar FALSE GetWord Word Attribute Ch ORD Char Word 2 RETURN SymChar charset t n f r 26 IsChar FALSE GetWord Word Attribute Ch ORD Char Word 1 RETURN SymChar charset t n digit IsChar FALSE Attribute Ch ORD RETURN SymChar charset jJ yyStart STD RETURN SymRBrace lt line yyPush PPline PPline 0 9 GetWord Word PPLine StringToInt Word 1 1 to compensate for the following yyEol PPline A n GetWord Word SubString Word 2 Length Word 1 Word yyLineCount PPLine Rex 63 change the line only if there is a file name Attribute Position File Makeldent Word PPline Ar An yyPop References ASU86 Gro89 Groa Grob Les75 CASE yyStartState OF targetcode AppendLine TargetCode yyLineCount Attribute Position File ELSE END yyEol 0 don t move yyEol before CASE A V Aho R Sethi and J D Ullman Compilers
34. GetLower lt Scanner gt _xxtChar Word extern int lt Scanner gt _GetUpper lt Scanner gt _xxtChar Word extern void lt Scanner gt CloseFile void extern void lt Scanner gt _CloseScanner void extern void lt Scanner gt _ResetScanner void The procedure lt Scanner gt _GetToken is the central scanning routine It returns the next token found in the input or whatever is specified in the actions associated with the regular expres sions The procedure lt Scanner gt _BeginFile may be called in order to open an input file or a nested include fi le It has one parameter of type char string which specifi es the fi le name The value NULL indicates input from standard input If not called input is read from standard in put Include fi les may be nested to arbitrary depth The procedure lt Scanner gt _BeginFileW does the same as lt Scanner gt _BeginFile for fi le names given by wide character strings The procedure lt Scanner gt _BeginMemory may be called in order to indicate that input should be read from the null terminated string of input items at location InputPtr The input string may not contain null characters The contents of the string may not be changed until it has been processed completely The procedure lt Scanner gt _BeginMemoryN may be called in order to indicate that that the in put should be Length input items at location InputPtr The input may contain null characters The contents of the input m
35. MD o Z th gt vo o n Hn mo A Q O IO o H O sk A Note oce ee WO VO VO A A rat n 24H gt 2 2 NS GE BEINE ENE LER ES ESA et ne Set Je US SEs ney cE et ne ce Ge OS 2 0 1 1 HAA Aa Eu Fd H Hi H k S 2 027 Oona 4 4 4 4 4 4 4 4 QE GG Ex Er a a a Ex G6 C6 A A C CEP A x S GB B E aa u Gc OG Ex 4 Gr E eG o qma EM E EA EM EH EM EM EM EM EM EM EM EM EM EM EM EM EM EM EM EM EM FA EM EA FA FA E E E RE EH 1 1 1 1 un EM EH EM EM EM EM EM EM EAM EA EA EA BG 02 0 02 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 NND 0 o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 OQ Rex STD QUALIFIED RETURN 62 STD RECORD RETURN 63 STD REPEAT RETURN 64 STD RETURN RETURN 65 STD SE RETURN 66 STD HEN RETURN 67 STD O RETURN 68 STD YPE RETURN 69 STD UNTIL RETURN 70 STD VAR RETURN 71 STD WHILE RETURN 72 STD WITH RETURN 73 STD letter letter digit GetWord Word ident MakeIdent Word RETURN
36. THEN Message string length 0 required Error Attribute Position INC ErrorCount ArrayToString TargetCode ELSE GetWord Word END Attribute String RETURN SymString STD set rules STD STD STD set STD set rules STD rules STD rules sin STD rules Tp STD rules Ww STD rules mw STD rules mo STD rules 2 STD rules ii STD rules qu STD rules DU rules rules T rules rules rules ma rules Wa STD set rulest STD set rules STD set rules STD set rules STD set rules STD set rules STD set rules STD set rules Rex PutString TargetCode r 60 SubString Word 2 Length Word 1 TargetCode RETURN SymDot M RETURN SymEqual 7 yyPrevious RETURN SymRBrace 7 y RETURN SymMinus RETURN SymComma H RETURN SymBar RETURN SymPlus E RETURN SymAsterisk A RETURN SymQuestion RETURN SymLParen RETURN SymRParen 5 RETURN SymLBracket RETURN SymRBracket H StartPosition Attribute Position RETURN SymLBrace RETURN SymNrSign 2
37. Unicode character A string is matched by a character sequence identical to the characters that make up the string iz matches the character sequence mn matches the character Rex 5 An identifi er may be defi ned to refer to a regular expression In this case it matches the same characters as the regular expression An undefi ned identifi er is treated like a string it matches its own character seguence END matches the character sequence END NOT matches the character sequence NOT A number is treated like a string it matches its own character sequence 007 matches the character sequence 007 A character set matches one arbitrary character contained in the set It is written as a sequence of characters enclosed in braces Ranges may be used to include intervals of characters The same escapes as described for characters may be used Unprintable characters and the follow ing ones have to be escaped within character sets F T Tyn r ENT The predefi ned identifi er ANY stands for a character set containing every character except the newline character If a character set is preceded by the operator it matches one arbitrary character except the ones contained in the set ANS matches the arithmetic operators A Z a z 0 9 matches all letters and digits NOxabcd Nuabcdef01 matches a set of Unicode characters mud matches all characters An matches
38. _Line Standard_Output end 5 5 lt Scanner gt Drv Eiffel 5 5 1 Scanner Interface The fi le lt Scanner gt e contains the class lt Scanner gt which offers the following features class lt Scanner gt creation BeginScanner feature EofToken INTEGER is 0 TokenLength INTEGER Attribute ScanAttribute BeginScanner is BeginFile FileName STRING is GetToken INTEGER is GetWord STRING is GetLower STRING is GetUpper STRING is CloseFile is CloseScanner is ErrorAttribute Token INTEGER ScanAttribute is SetAttribute Value ScanAttribute is The procedure GetToken is the central scanning routine It returns the next token found in the input or whatever is specifi ed in the actions associated with the regular expressions The procedure BeginFile may be called in order to open an input fi le or a nested include fi le The parameter FileName specifies the file name The value empty string denotes input from standard input If not called input is read from standard input Include fi les may be nest ed to arbitrary depth The procedure CloseFile may be called in order to close the current input fi le before reaching end of fi le CloseFile is called automatically by the scanner upon reaching end of fi le The procedure BeginScanner instantiates a scanner object and performs the necessary initial izations For example the tables are
39. a deter ministic fi nite automaton Then an algorithm to minimize the number of states is applied After ex tending the automaton to a tunnel automaton the constant regular expressions are added in linear time using the algorithm described in Gro89 The sparse matrix to control the automaton is com pressed into a data structure called comb vector ASU86 to save space The key to the performance of scanners generated by Rex lies in the following facts access to the comb vector table is fast input happens rarely because blocks of characters are transferred no check for the last character of a block is necessary because of the sentinel technique used the same holds for the check of stack underflow for the stack to record the passed states the treatment of right context is effi cient and only necessary in a few cases because partial evaluation has been applied 8 Differences to LEX Some specialists might want to know about the differences between Rex and LEX Les75 see Table Advantages of Rex Rex can generate scanners for Unicode standard or initial start state has a documented name STD list of start states can be inverted using the operator NOT to specify that a rule is valid in all states except the listed ones specifi cations can be written unformatted white space in the form of blanks tabs and newlines is skipped Identifiers used to refer to named regular expressions are wr
40. anner gt BASE CLASS public define lt Scanner gt _EofToken 0 lt Scanner gt _xxtChar TokenPtr int TokenLength lt Scanner gt _tScanAttribute Attribute H void Exit void Scanner void void BeginFile char FileName void BeginFileW wchar t FileName void BeginMemory void InpuPtr void BeginMemoryN void InpuPtr int Length void BeginGeneric void InpuPtr int GetToken void int GetWord Scanner xxtChar Word int GetLower Scanner xxtChar Word int GetUpper Scanner xxtChar Word void CloseFile void lt Scanner gt void void ErrorAttribute int Token lt Scanner gt tScanAttribute Attribute Errors ErrorsObj The method GetToken is the central scanning routine It returns the next token found in the in put or whatever is specifi ed in the actions associated with the regular expressions The method BeginFile may be called in order to open an input file or a nested include file It has one parameter of type char string which specifi es the file name The vaue NULL in dicates input from standard input If not called input is read from standard input Include fi les may be nested to arbitrary depth The method BeginFileW does the same as BeginFile for file names given by wide character strings The method BeginMemory may be called in order to indicate that input should be read from the null terminated string of input items at locatio
41. are rescanned for the next character seguence The start state is changed to state s The current start state is pushed on a stack and the start state is changed to state s The start state is changed to the state popped from a stack The start state is changed to the state valid before the last execution of yyStart yyPush yyPop or yyPrevious This is not a statement but an expression of type short or SHORTCARD respec tively whose value is the current start state It can be used to execute different statements in one action depending on the current start state This statement should be used if a regular expression is specifi ed by the user to process tab characters Its purpose is to update the internal variable to calculate the column position of tokens yyTab works only if the tab character exclusively is specifi ed by a rule Rex 17 yyTabl a Like yyTab this statement should be used if a regular expression is specified by the user to process tab characters Its purpose is to update the internal variable to calculate the column position of tokens yyTabl works if the tab character is em bedded in other characters The parameter a must specify the number of characters before the tab character yyEol n This statement should be used if a regular expression is specified by the user to process newline characters Its purpose is to update the internal variables to calcu late the line and column position of tokens yyEol should be ex
42. ay not be changed until it has been processed completely The procedure lt Scanner gt _BeginGeneric may be called in order to indicate that the input is user defi ned at location InputPtr The source module see below has to be extended by the us er in order to implement this feature Rex 19 The procedure lt Scanner gt _CloseFile may be called in order to close the current input stream before reaching end of file or end of input lt Scanner gt _CloseFile is called automatically by the scanner upon reaching end of file or end of input The procedure lt Scanner gt _BeginScanner may be called in order to initialize user data The contents of the target code section named BEGIN is included in the body of this procedure The procedure lt Scanner gt _CloseScanner may be called in order to fi nalize user data The con tents of the target code section named CLOSE is included in the body of this procedure If the scanner reaches the end of the input it returns the special token called lt Scanner gt _EofTo ken which is encoded by 0 The preprocessor symbol lt Scanner gt _xxMaxCharacter is used to describe the range of the character set The preprocessor symbol lt Scanner gt _xxtChar is defined to be either char or wchar_t It de scribes the type used as representation of a character Note the size of wchar_t can be 2 or 4 bytes depending on the compiler The procedures lt Scanner gt _GetWord lt Scanner gt _GetLower and lt Scann
43. bute which has to be a record type with at least one fi eld called Position of type tPo sition tPosition has to be a record type with at least two fi elds called Line and Column The values of Line and Column are computed by the scanner automatically They indicate the source position of the current token The position of a token is the position of the fi rst charac ter of the token For exceptions see section 3 8 The types tScanAttribute and tPosition are predefi ned as given above The defi nitions of these types can be changed as described in sec tion 3 7 During automatic error repair a parser may insert tokens In this case the parser calls the proce dure ErrorAttribute in order to ask for the additional properties of an inserted token which is given by the parameter Token The procedure should return in the second argument called At tribute a default value for the additional properties of the token Token If the scanner reaches the end of the input it returns the special token called EofToken which is encoded by 0 5 4 2 Source Interface The scanners generated by Rex need a source module for blocked input of characters Rex can provide a prototype source module which reads from standard input It is contained in the fi les lt Scanner gt source ads and lt Scanner gt source adb The package module in the file lt Scan ner gt source ads has the following contents package lt Scanner gt Source is function BeginSource Fi
44. c error repair a parser may insert tokens In this case the parser calls the method ErrorAttribute in order to ask for the additional properties of an inserted token which is given by the parameter Token The method should return in the second argument called At tribute a default value for the additional properties of the token Token The variable Exit refers to a procedure which is called upon an internal error in the scanner The default procedure terminates the program execution The variable can be changed in order to achieve a different behaviour The preprocessor symbol lt Scanner gt BASE_CLASS can be used to specify a base class for the class lt Scanner gt using a defi ne directive in the EXPORT section of a scanner description Example EXPORT define Scanner_BASE_CLASS The internal scanner interface consists of the following objects Rex public BaseClass 25 The initial size of the scanner input buffer is defined by the value of the preprocessor symbol yyInitBufferSize with a default of 8448 The buffer size is increased automatically when nec essary The initial buffer size can be changed by including a preprocessor directive in the GLOBAL section such as define yyInitBufferSize 562 For best results the value should be a power of two plus a constant between 50 and 256 The initial size of the stack for include fi les is defi ned by the value of the preprocessor symbol yyInitFileStackSize wi
45. called by the user in order to specify the encod ing and the endian property of the input stream The arguments have to be values as defined below This function has to be called after the function lt Scanner gt BeginSource If neither little endian nor big endian is specifi ed then the endian property of the current system is as sumed to hold for the input The function lt Scanner gt _GetWLine will convert the input stream to a stream of type wchar_t The following constants describe the encoding of the input stream define CODE NONE 0 define CODE BYTE 1 1 byte define CODE WCHAR T 2 2 or 4 bytes define CODE UCS2 3 2 bytes define CODE UCS4 4 4 bytes define CODE UTF8 5 seq of 1 byte define CODE UTF16 6 seq of 2 bytes The above comments give the size of an input stream item in bytes All input stream items or sequences of input stream items in the cases of UTF8 and UTF16 represent Unicode charac ters The encodings BYTE UCS2 and UTF16 and possibly VVCHAR can represent sub sets of the full Unicode character set only A Unicode character will be stored in variables of type wchar t Note the size of wchar t can be 2 or 4 bytes depending on the compiler Therefore if the size of wchar t is 2 then characters encoded by UCS4 UTFS and UTF16 will be truncated to two bytes The following constants describe the endian property of the input stream Rex 27
46. cessor symbol lt Scanner gt _xxtChar is defined to be either char or wchar_t It de scribes the type used as representation of a character Note the size of wchar_t can be 2 or 4 bytes depending on the compiler The methods GetWord GetLower and GetUpper allow access to the matched character se guence as described in section 4 4 Alternatively the matched character seguence can be accessed using the member variables To kenPtr and TokenLength TokenPtr points to the beginning of the matched character sequence TokenLength specifi es the number of matched characters Note the matched character se quence is not terminated by a 0 character The member variable Attribute is supposed to communicate additional properties of the current token The value must be provided by appropriate action statements This variable is of type lt Scanner gt _tScanAttribute which has to be a struct type with at least one member called Posi tion of type tPosition Position has to be a struct type with at least two members called Line and Column The values of Line and Column are computed by the scanner automatically They indicate the source position of the current token The position of a token is the position of the first character of the token For exceptions see section 3 8 The types lt Scanner gt tScanAt tribute and tPosition are predefined as given above The defi nitions of these types can be changed as described in section 3 7 During automati
47. cters They consist of a seguence of characters enclosed in double quotes It is not possible to include a double quote or an newline character into a string No escape is needed within strings It is a shorthand for escaping a whole seguence of characters BEGIN wea Actions are statements to be copied unchanged into the generated code statements have to be written in the desired target language The actions have to be enclosed in braces The characters and can be used within the actions as long as they are either properly nested or contained in strings or in character constants Otherwise they have to be escaped by a backslash character V The escape character V has to be escaped by itself if it is used outside of strings or character constants W In general a backslash character A can be used to escape any character Rex 4 outside of strings or character constants Within those tokens the escape conventions are disabled and the tokens are left unchanged There are additionally statements available to aid in scanning see section 4 4 printf BEGIN recognized n return tBegin if level gt 0 GetWord String Concatenate Word String printf recognized n 3 2 Regular Expressions In general the specifi cation of a scanner consists of the keyword RULE or RULES followed by a list of regular expressions each one associated with an action
48. ction lt Scanner gt _GetLine Scanner BeginSourceFileW is called from the scanner function lt Scanner gt _BeginFileW It does the same as lt Scanner gt _BeginSourceFile for file names given by wide character strings The source module has to be extended by the user in order to implement this feature lt Scanner gt _BeginSourceMemory is called from the scanner function lt Scanner gt _BeginMemo ry indicating that input should be read from the null terminated string of input items at location InputPtr The input string may not contain null characters The contents of the string may not be changed until it has been processed completely lt Scanner gt _BeginSourceMemoryN is called from the scanner function lt Scanner gt _Begin MemoryN indicating that the input should be Length input items at location InputPtr The in put may contain null characters The contents of the input may not be changed until it has been processed completely lt Scanner gt _BeginSourceGeneric is called from the scanner function lt Scanner gt _BeginGeneric indicating that the input is user defi ned at location InputPtr The source module has to be ex tended by the user in order to implement this feature Scanner GetLine is called from the scanner in order to fill a buffer at address Buffer with a block of maximal Size characters Input should be read from a fi le specifi ed by the integer fi le descriptor File if the current input stream comes from a
49. e following Rex 22 include Position h include lt Scanner gt h int main void int Token Count 0 char Word 2048 lt Scanner gt _BeginScanner do Token lt Scanner gt _GetToken Count ifdef Debug if Token lt Scanner gt _EofToken lt Scanner gt _GetWord Word else Word 101 NO WritePosition stdout lt Scanner gt _Attribute Position printf 5d sNn Token Word endif while Token lt Scanner gt _EofToken lt Scanner gt _CloseScanner printf Sd n Count return 0 5 2 C 5 2 1 Scanner Interface The scanner interface consists of two parts While the objects specifi ed in the external interface can be used from outside the scanner the objects of the internal interface can be used only within a scanner description The external scanner interface is described by a class named lt Scanner gt The name lt Scanner gt may be specifi ed after the keyword SCANNER It defaults to Scanner The class defi nition is contained in a fi le named lt Scanner gt h which has the following contents Rex 23 include Position h typedef struct tPosition Position lt Scanner gt _tScanAttribute defin lt Scanner gt _xxMaxCharacter 255 if xxMaxCharacter lt 256 define lt Scanner gt _xxtChar char else define lt Scanner gt _xxtChar wchar_t endif define lt Scanner gt _BASE_CLASS I class lt Scanner gt lt Sc
50. e specifi ed only once n zt n matches both possible forms of Modula 2 strings 3 3 Ambiguous Specifications Rex can handle ambiguous specifi cations When more than one expression can match the cur rent input Rex chooses as follows The longest match is preferred Among rules which match the same number of characters the rule given fi rst is preferred The length of a match is the number of matched characters plus the number of characters matched by the regular expression following the right context operator if applicable Example 0 9 return tDecimal 0 9 return tDecimal 0 9 0 9 return tReal 73 204 return tRange 7 return tDot Suppose the right context of the fi rst rule above is missing The input dos would be recognized as tReal and tDot because tReal matches two characters To get the right Rex 7 solution the right context is necessary Now the input is recognized as tDecimal and tRange because the second rule for tDecimal matches 3 characters Example BEGIN return tBegin END return tEnd A Z return tIdent The rules for keywords should be given before the rule for identifi ers Otherwise the keywords would be recognized as identifi ers An analysis that checks a scanner specifi cation for ambiguous rules can be requested with option p The resu
51. ecuted once for every newline character matched The parameter n should specify the number of characters matched after the last newline character In simple cases where the pat tern consists only of a newline character one invocation of yyEol 0 is suffi cient input This is a function call returning the next character from the input It is used where regular expressions alone are not able to describe the input language for example Fortran style constants unput c This pushes the character c back into the input to be considered when scannin p p p g for the next token 5 Interface of the Generated Scanners The scanners generated by Rex offer an interface to be used by a main program like e g a parser and they reguire a source module for blocked input of characters to obey a certain interface The structure of these two interfaces is independent from a specifi c target language The interfaces are discussed in language specific chapters because the syntactic details vary from one target lan guage to another 5 1 The option c selects the generation of a scanner in C that can be translated by compilers for ANSI C K amp R C or C This is accomplished by appropriate preprocessor directives It has been already mentioned that the prefixes lt Scanner gt and lt Scanner gt _ are generated only if the keyword SCANNER is present Otherwise they are omitted in order to be compatible with former versions of Rex 5 1 1 Scanne
52. ed using three cpp macros which are predefi ned as follows If the target language is C define yyColumn Ptr int Ptr char yyLineStart define yyOffset Ptr yyFileOffset Ptr yyChBufferStart2 define yySetPosition lt Scanner gt _Attribute Position Line yyLineCount N lt Scanner gt _Attribute Position Column yyColumn lt Scanner gt _TokenPtr If the target language is C define yyColumn Ptr int Ptr char yyLineStart define yyOffset Ptr yyFileOffset Ptr char yyChBufferStart define yySetPosition Attribute Position Line yyLineCount Attribute Position Column yyColumn TokenPtr If the target language is Modula 2 f define yyColumn Index Index yyLineStart define yyOffset Index yyFileOffset Index yyChBufferStart f define yySetPosition Attribute Position Line yyLineCount N Attribute Position Column yyColumn TokenIndex Rex 14 If the target language is Ada define yyColumn Index Index yyLineStart define yyOffset Index yyFileOffset Index yyChBufferStart define yySetPosition Attribute Position Line yyLineCount N Attribute Position Column yyColumn TokenIndex If the target language is Java f define yyColumn Index Index yyLineStart define yyOffset Index yyFileOffset Index yyChBufferStart define yySetPosition N attribute new ScanAttribute yyLineCo
53. egular expressions It can be used to detect illegal characters for example If not given it is predefi ned as described below Target code after the keyword EOF is included in the scanner routine to be executed upon reaching the end of the input It can be used to return a value different from the predefi ned one lt Scanner gt _EofToken 0 or to check for unclosed comments or strings for example If the IMPORT EXPORT GLOBAL or DEFAULT sections are not used then the following predefi ned declarations are included If the target language is C EXPORT include Position h typedef struct tPosition Position lt Scanner gt _tScanAttribute extern void lt Scanner gt _ErrorAttribute int Token lt Scanner gt tScanAttribute Attribute GLOBAL void lt Scanner gt _ErrorAttribute Token Attribute int Token lt Scanner gt tScanAttribute Attribute 11 EFAULT yyEcho If the target language is C EXPORT include Position h typedef struct tPosition Position GLOBAL void lt Scanner gt ErrorAttribute int DEFAU yy iT Echo If the target language is Modula 2 Rex 11 lt Scanner gt _tScanAttribute Token lt Scanner gt _tScanAttribute Attribute EXPORT IMPORT Position TYPE tScanAttrib
54. entation of the host machine will be used The following should be noted if a character set is specifi ed The option i might be necessary if codes greater than 127 are used Option i selects an 8 bit representation for characters The action statements lt Scanner gt _GetLower and lt Scanner gt _GetUpper see section 4 4 may not work because they rely on ASCH The operator lt for matching the beginning of lines may cause trou bles This feature is implemented by a test whether the character before a line is an end of line character The end of line character is predefi ned as the ASCII newline character or whatever this character is translated to by the specifi cation of the character set define yyEolCh unsigned char 12 For C and C scanners this defi nition can be overwritten by supplying an appropriate preprocessor directive in the GLOBAL section 4 Predefined Items Rex knows several predefi ned items described in the next sections 4 1 Definitions The identifi er ANY is predefi ned to match one arbitrary character except newline DEFINE ANY n 4 2 Start States The identifi er STD is predefi ned to denote the standard start state of Rex The generated scan ners are initially in this state START STD 4 3 Rules Rex 16 The 4 for rules given below are predefi ned after the user specifi ed rules by default By giving own rules the user can overwrite these because of the strategy to s
55. er gt _GetUpper allow access to the matched character seguence as described in section 4 4 Alternatively the matched character seguence can be accessed using the variables lt Scan ner gt _TokenPtr and lt Scanner gt _TokenLength lt Scanner gt _TokenPtr points to the beginning of the matched character seguence lt Scanner gt _TokenLength specifi es the number of matched characters Note the matched character sequence is not terminated by a 0 character The variable lt Scanner gt _Attribute is supposed to communicate additional properties of the current token The value must be provided by appropriate action statements This variable is of type lt Scanner gt _tScanAttribute which has to be a struct type with at least one member called Position of type tPosition tPosition has to be a struct type with at least two members called Line and Column The values of Line and Column are computed by the scanner auto matically They indicate the source position of the current token The position of a token is the position of the fi rst character of the token For exceptions see section 3 8 The types lt Scan ner gt _tScanAttribute and tPosition are predefined as given above The definitions of these types can be changed as described in section 3 7 During automatic error repair a parser may insert tokens In this case the parser calls the proce dure lt Scanner gt _ErrorAttribute in order to ask for the additional properties of an inserted
56. erators delimiters or escape characters 2 0 Me ER O BT Me ease A XN Besides keywords and the above special characters a scanner specifi cation is composed of characters numbers identifi ers strings and actions A character denotes itself Special characters have to be escaped using a preceding escape character The escape character is a backslash V For certain non graphic characters the same escape seguences as in C are available bell BEL Na backspace BS b character tabulation HT t line feed LF n line tabulation VT v form feed FF f carriage return CR r Other unprintable characters are represented by the escape character followed either by an integer decimal number or by a hexadecimal number giving the internal encoding These escape sequences can be used to denote Unicode characters whose representation can take up to 4 bytes gt t An MO Oxabcd Nuabcdef01 Numbers denote numerical integer values They consist of a sequence of digits 8 12 0 Identifi ers are used to refer to named entities They consist of a letter followed by letters dig its or underscore characters _ Lower case as well as upper case letters are possible If an identi fi er is not defi ned its character sequence is treated as a string Identifi ers that are keywords have to be escaped by a preceding escape character letter HexDigit under score BEGIN END Strings denote a seguence of chara
57. ex SymBegin 7 SymClose 8 SymEof 34 SymDefault 36 SymCharSet 38 SymDefine 9 SymStart 10 SymRules 11 SymDot 12 2 SymComma 13 SymEqual 14 H SymColon 15 H SymColonMinus 35 SymNrSign 33 SymSlash 16 SymBar 17 H SymPlus 18 SymMinus 19 2 SymAsterisk 20 SymOuestion 21 SymLParen 22 SymRParen 23 SymLBracket 24 SymRBracket 25 SymLBrace 26 SymRBrace 27 SymLess 28 SymGreater 29 i SymRule 30 level INTEGER string TargetCode tString E NoString tStringRef StartPosition StringPosition tPosition PrevState SHORTCARD H IsChar BOOLEAN 2 PROCEDURE ErrorAttribute Token CARDINAL VAR Attribute BEGIN pAttribute Position Attribute Position CASE Token OF SymIdent Attribute Ident Noldent SymNumber Attribute Number 0 SymString Attribute String NoString SymChar Attribute Ch ORD SymTargetcode MakeText Attribute Text E iS E END END ErrorAttribute PROCEDURE startCode yyStart targetcode MakeText Attribute Text AssignEmpty TargetCode level 1 tScanAttribute zi z oO startCode PROCEDURE BEGIN yyStart ch Esthar TI tartChar start tj z oO un start PROCEDURE yyStart se END startSet s
58. f an inserted token which is given by the parameter Token The procedure should return a default value for the additional properties of the token Token The procedure SetAttribute can be used to store values in the variable Attribute If the scanner reaches the end of the input it returns the special token called EofToken which is encoded by 0 5 5 2 Source Interface The scanners generated by Rex need a source class for blocked input of characters Rex can provide a prototype source module which reads from standard input It is contained in the file source e and has the following interface class Source creation Open feature Open filename STRING is GetLine wanted INTEGER STRING is Close is Open is called from the scanner in order to open fi les or to initialize any other source of input If not called input is read from standard input GetLine is called in order to return a block of maximal wanted characters Close is called from the scanner at end of file respectively at end of input It can be used to close fi les 5 5 3 Scanner Driver A main program is necessary for the test of a generated scanner Rex can provide a minimal main program in the file lt Scanner gt drv e which can serve as test driver It counts the tokens and looks as follows class lt Scanner gt Drv creation main feature main is local Token INTEGER Count INTEGER do Rex
59. f the action of a rule is preceded by a colon like in all the exam ples so far However if the character is appended to the colon the calculation of the source posi tion can be disabled for a rule There are mainly two reasons for not to compute the position First some compound tokens have to be recognized by the combination of several rules usually in connection with a start state In order to get the correct position which is the position yielded by the fi rst rule the calculation of the position has to be disabled for the following rules Example Pascal strings START string RULE STD yyStart string string t n 250 string cs 0 string yyStart STD return tString Second there is no need to calculate the source position in rules that skip input characters without returning a token In this case disabling the computation of the position yields an increase in run time effi ciency The typical examples are comments The example given in the chapter about Start States should be rewritten as follows Example Modula 2 comments START comment RULE TE level yyStart comment comment level if level 0 yyStart STD comment AtAn The automatic computation of the line and column position for every token is the default behaviour of a generated scanner This mechanism can be changed It is implement
60. face HasPosition i e it has a method position which returns an instance of Position Position has to be a class with at least two fi elds called line and column This arrangment leaves the user free to decide whether to have a fi eld of type Position or to inherit directly from Position The default defi nition of the macro yySetPosition assumes the latter this minimises the number of objects created The values of line and column are computed by the scanner automatically They indicate the source position of the current token The position of a token is the position of the fi rst character of the token For exceptions see section 3 8 During automatic error repair a parser may insert tokens In this case the parser calls the proce dure errorAttribute in order to ask for the additional properties of an inserted token which is given by the parameter foken The procedure should return default values for the additional properties of the token In the event of an internal error in the scanner an exception de cocolab reuse FatalError will be thrown It is not reguired to catch this exception If the scanner reaches the end of the input it returns the special token called eofToken which is encoded by 0 The internal scanner interface consists of the following objects The initial size of the scanner input buffer is defi ned by the value of the preprocessor symbol yylnitBufferSize with a default of 8448 The buffer size is increased automatica
61. fer e wchar_t_ Buffer e int Length int Size int Size lt Scanner gt _BeginSourceFile is called from the scanner method BeginFile indicating that input should be read from a fi le The fi le specifi ed by the parameter FileName is opened and used as input fi le If not called input is read from standard input The function should return an integer fi le descriptor as provided by the system call open or any other handle understood by the func tion lt Scanner gt _GetLine lt Scanner gt _BeginSourceFileW is called from the scanner method BeginFileW It does the same as lt Scanner gt _BeginSourceFile for file names given by wide character strings The source module has to be extended by the user in order to implement this feature lt Scanner gt _BeginSourceMemory is called from the scanner method BeginMemory indicating that input should be read from the null terminated string of input items at location InputPtr The input string may not contain null characters The contents of the string may not be Rex 26 changed until it has been processed completely lt Scanner gt _BeginSourceMemoryN is called from the scanner method BeginMemoryN indicat ing that the input should be Length input items at location InputPtr The input may contain null characters The contents of the input may not be changed until it has been processed complete ly lt Scanner gt _BeginSourceGeneric is called from the scanner method BeginGeneric ind
62. fine yyTabSpace 4 5 1 2 Source Interface The scanners generated by Rex need a source module that provides blocked input of charac ters Rex can provide a prototype source module which can read from standard input from any file or from memory It is contained in the fi les lt Scanner gt Source h and lt Scanner gt Source c The speci fi cation fi le lt Scanner gt Source h consists of something like extern void lt Scanner gt _SetEncoding int Encoding int Endian extern int lt Scanner gt _BeginSourceFile char FileName extern int lt Scanner gt _BeginSourceFileW wchar_t FileName extern void lt Scanner gt _BeginSourceMemory void InputPtr extern void lt Scanner gt _BeginSourceMemoryN void InputPtr int Length extern void lt Scanner gt _BeginSourceGeneric void InputPtr extern int lt Scanner gt _GetLine int File char Buffer int Size extern int lt Scanner gt _GetWLine int File wchar_t Buffer int Size extern void lt Scanner gt _CloseSource int File lt Scanner gt _BeginSourceFile is called from the scanner function lt Scanner gt _BeginFile indicat ing that input should be read from a file The file specified by the parameter FileName is opened and used as input file If not called input is read from standard input The function should return an integer fi le descriptor as provided by the system call open or any other handle understood by the fun
63. his macro is empty The macro yyArtributePosition attribute may be used to change how position information is obtained from an instance of ScanAttribute This is only of signifi cance when the scanner is reporting some internal error such as misuse of the yyPush yyPop methods A scanner to be used by a ark generated parser has other reguirements 5 6 3 Source Interface The scanners generated by Rex need a source module for blocked input of characters For Java this is any class which descends from java io InputStream 5 6 4 Scanner Driver A main program is necessary for the test of a generated scanner Rex can provide a minimal main program in the fi le lt Scanner gt Drv java which can serve as test driver It counts the tokens and looks like the following import java io Simple class for driving a generated scanner S class lt Scanner gt Drv public static void main String argv throws java io IOException lt Scanner gt scanner new lt Scanner gt int token boolean debug false String filename null int count 0 for int i 0 i lt argv length i if argv i eguals D debug true lse filename argv i if filename null scanner beginFile new FileInputStream filename do token scanner getToken count if debug String word scanner getWord Rex 40 System err println scanner attribute position token
64. icating that the input is user defi ned at location InputPtr The source module has to be extended by the user in order to implement this feature lt Scanner gt _GetLine is called from the scanner in order to fill a buffer at address Buffer with a block of maximal Size characters Input should be read from a fi le specifi ed by the integer file descriptor File if the current input stream comes from a file Otherwise input comes from memory and the parameter File can be ignored Lines are terminated by newline characters ASCII Oxa The function returns the number of characters transferred Reasonable block sizes are between 128 and 8192 or the length of a line Smaller block sizes especially block size I will drastically slow down the scanner The end of fi le or end of input is indicated by a return value lt 0 lt Scanner gt _GetWLine is the same as lt Scanner gt _GetLine for type wchar_t instead of type char lt Scanner gt _CloseSource is called from the scanner method CloseFile at end of file or at end of input respectively It can be used to close files The functions lt Scanner gt _BeginSource and lt Scanner gt _CloseSource can be called in a nested way for example in order to handle include files The encoding and the endian property of the input stream are stacked Therefore after a call of lt Scanner gt _CloseSource the properties of the previous input stream are restored The function lt Scanner gt _SetEncoding can be
65. ike LEX Rex Appendix 1 Syntax of the Specifi cation Language specification decls name code characterSet define start rules charDef definition identList rule patternList pattern startStates regExpr decls rules decl decl decl decl decl name code charact define start un nm nun SCANNER Iden SCANNER Dotte ji LH IMPOR arge EXPOR arge GLOBAL Targe OCAL Targe BEGIN Targe CLOSE arge DEFAULT Targe EOF arge CHARACTER_SET DEFINE defini erSet t dIdent Java only tCode tCode tCode tCode tCode tCode tCode tCode charDef tion START identL RULE rule RULES rule CsChar CsNu CsNumber CsNu Ident reg Ident identList Ide identList patternList patternList pattern patternList startStates identList rir rxr rur NOT ident rar ident regExpr r regTerm ist identList mber mber EXpr nt Ident TargetCode TargetCode pattern lt regExpr regExpr gt 23 List List egTerm 45 regTerm regFactor charSet range Char Ident DottedIdent letter or digit String Number Target code CsChar CsNumber Rex
66. ine returns the number of characters transferred Reasonable block sizes are between 128 and 2048 or the length of a line Smaller block sizes especially block size 1 will drastically slow down the scanner CloseSource is called from the scanner at end of fi le respectively at end of input It can be used to close fi les The implementation module in the fi le lt Scanner gt Source mi has the following contents IMPLEMENTATION MODULE lt Scanner gt Source FROM SYSTEM IMPORT ADDRESS FROM System IMPORT tFile OpenInput Read Close PROCEDURE BeginSource FileName ARRAY OF CHAR tFile BEGI RETURN OpenInput FileName ND BeginSource tr PROCEDURE GetLine File tFile Buffer ADDRESS Size CARDINAL INTEGER CONST IgnoreChar VAR n INTEGER VAR BufferPtr POINTER TO ARRAY 0 30000 OF CHAR BEGIN ifdef Dialog Rex 30 n Read File Buffer Size Add dummy after newline character in order to supply a lookahead for rex This way newline tokens are recognized without typing an extra line BufferPtr Buffer IF n gt 0 AND BufferPtr n 1 012C THEN BufferPtr n IgnoreChar INC n END RETURN n else RETURN Read File Buffer Size endif END GetLine PROCEDURE CloseSourc File tFile BEGIN Close File ND CloseSource
67. itten without enclosing braces y Rex automatically calculates the source position of the tokens in the fi elds Line and Column of the variable lt Scanner gt _Attribute There are predefi ned rules to skip the white space characters Include fi les with an unlimited nesting depth can be processed Routines are provided to normalize tokens to upper or lower case characters Rex 44 Table Syntactical differences between Rex and LEX Meaning LEX Rex delimiter for character classes 1 complement of character classes 1 any character ANY left justifi cation lt right justifi cation gt replicator n n replicator m n m n delimiter for start states lt gt HH escape representation for characters octal decimal scanner routine yylex lt Scanner gt _GetToken access to matched string yytext lt Scanner gt _GetWord length of matched string yyleng result of lt Scanner gt _GetWord output of matched string ECHO yyEcho retain part of matched string yyless yyLess initial start state INITIAL STD change of start state BEGIN yyStart character set T CHARACTER_SET action at end of input yywrap EOF section _ No adjustment of the internal data structures are necessary to be able to process large specifi cations Disadvantages of Rex action statement yymore is not available action statement REJECT is not available Rex can only fi nd one solution and not all l
68. leName String return Integer procedure GetLin File Integer Buffer out String Last out Integer procedure CloseSourc Fil Integer end lt Scanner gt Source with System BeginSource is called from the scanner in order to open fi les or to initialize any other source of input If not called input is read from standard input GetLine is called in order to fill a buffer starting at address Buffer with a block of maximal Size characters Lines are terminated by newline characters ASCH 12C GetLine returns the number of characters transferred Reasonable block sizes are between 128 and 2048 or the length of a line Smaller block sizes especially block size 1 will drastically slow down the scanner CloseSource is called from the scanner at end of fi le respectively at end of input It can be used to close fi les The implementation module in the fi le lt Scanner gt source adb has the following contents Use System package body lt Scanner gt Source is Rex 33 function BeginSource FileName String return Integer is function OpenInput FileName Address return Integer pragma Interface C OpenInput pragma Interface_Name OpenInput OpenInput C_Name gt String l aa 256 begin C Name 1 FileName Last FileName C Name FileName Last 1 Character Val 0 return OpenInput C_Name Address end BeginSource procedure GetLine File Integer Buffer out String Size Integer
69. lly when nec essary The initial buffer size can be changed by including a C preprocessor directive in the GLOBAL section such as Rex 38 define yyInitBufferSize 562 For best results the value should be a power of two plus a constant between 50 and 256 The stack for include fi les is supplied by default with unlimited size If nested include fi les are not required the size of the generated scanner can be reduced by including a preprocessor di rective in the GLOBAL section such as define yyInitFileStackSize 0 The value for tab stops is defined by the preprocessor symbol yyTabSpace with a default of 8 This value can be changed by including a preprocessor directive in the GLOBAL section such as define yyTabSpace 4 The stack used by yyPush and yyPop is initially 16 elements and will grow as reguired A dif ferent initial size can be specifi ed by including a preprocessor directive in the GLOBAL sec tion such as define yyInitStStStackSize 32 If the initial size is given as zero then there is no start state stack and yyPush yyPop may not be used This feature may be used to obtain the smallest possible scanner 5 6 2 Tuning the Scanner Interface It is not possible to design one interface to the scanner that is optimal in all circumstances This is because different cases will reguire different emphasis on the following characteristics the number of objects created per token there will always be an object represe
70. lt of this analysis is a list of pairs of patterns that are ambiguous with respect to each other 3 4 Definitions Regular expressions can be given names This serves to avoid duplication of regular expres sions or to increase the expressive power of a specifi cation After the keyword DEFINE a list of identifi ers can be associated with regular expressions Defi ned identifi ers appearing within regular expressions are replaced by the regular expression given in the definition Identifi ers have to be declared before use Undefined identifi ers are treated as strings by default and reported as errors when option x is set The identifi er ANY is predefi ned to match any character except newline Example letter 1 AZ 2 2 digit 0 9 string character Mn ANY 2 3 5 Start States For complex tasks Rex offers a facility called start states Usually the generated scanner is always in the standard state called STD and all specified patterns are recognized In general the scanner is allowed to change its state between an arbitrary number of user defi ned states The pat terns can be specifi ed to be recognized only in certain states Initially the scanner is in the standard start state STD There are special statements to change the state of the scanner see section 4 4 They can be used in the actions of the rules Two kinds of start states are distinguished inclusive start states and exclusive start states This disti
71. n InputPtr The input string may not contain null characters The contents of the string may not be changed until it has been processed com pletely The method BeginMemoryN may be called in order to indicate that that the input should be Length input items at location InputPtr The input may contain null characters The contents of the input may not be changed until it has been processed completely Rex 24 The method BeginGeneric may be called in order to indicate that the input is user defi ned at location InputPtr The source module see below has to be extended by the user in order to implement this feature The method CloseFile may be called in order to close the current input file before reaching end of file or end of input CloseFile is called automatically by the scanner upon reaching end of file or end of input The constructor lt Scanner gt is called automatically in order to initialize a scanner object The contents of the target code section named BEGIN is included in the body of this method The destructor lt Scanner gt is called automatically in order to finalize a scanner object The contents of the target code section named CLOSE is included in the body of this method If the scanner reaches the end of the input it returns the special token called lt Scanner gt _EofTo ken which is encoded by 0 The preprocessor symbol lt Scanner gt _xxMaxCharacter is used to describe the range of the character set The prepro
72. nction is relevant for patterns given without start states Start states have to be defined by giving a list of identifi ers after the keyword START Two groups of identifiers can be separated by the character The identifiers in the first group are treated as inclusive start states while the identifi ers in the second group are treated as exclusive start states The identifi ers in every group may be separated by commas The standard state STD is pre defi ned as an inclusive start state A pattern given without start states is recognized when the scanner is in any inclusive start state The exclusive start states are not considered A pattern preceded by the characters is recognized when the scanner is in any start state Both inclusive and exclusive start states are considered Rex 8 pattern preceded by a list of start states enclosed in characters is recognized only if the scanner is in one of the listed start states Again the listed start states may be separated by commas A pattern preceded by the keyword NOT and a list of start states enclosed in characters is recognized only if the scanner is in a start state not listed Instead of the keyword NOT the character can be used as well Example START comment RULE 20 level yyStart comment comment level if level 0 yyStart STD comment 4 STD 0 9
73. nting source lo cation Position and there will often be an additional object ecapsulating this together with other information about the token for example a coded identifi er This is important where the input is large and run time is to be minimized memory usage the size of an instance of ScanAttribute is important if it is to be stored in an abstract syntax tree and the size of the input may be large the number of classes generated the time to load a Java applet over the WWW increases with the number of classes For small input load time is the most signifi cant factor A scanner generated for Java may be tuned in a number of ways using macros defi ned in the GLOBAL section and by proper choice of design for the ScanAttribute class These facilities are introduced here more details may be found by examining the skeleton from which Rex generates a Java scanner i e the fi le Scanner java found in the lib rex directory within the Cocktail installation The name ScanAttribute may be defi ned to Position so that attributes consist directly of an in stance of Position this being the minimum requirement of a lark generated parser This is suitable for a small fast scanner which does not have to deliver additional attributes ScanAttribute may extend Position instead of having a fi eld of type Position This avoids an additional object creation at the cost of having some information in the ScanAttribute class about the implementation of P
74. ntral scanning routine It returns the next token found in the input or whatever is specifi ed in the actions associated with the regular expressions The procedure beginFile may be called in order to open an input fi le or a nested include fi le The parameter stream specifi es the input source If not called input is read from standard in put Include fi les may be nested to arbitrary depth The procedure closeFile may be called in order to close the current input fi le before reaching end of file closeFile is called automatically by the scanner upon reaching end of fi le The contents of the target code section named BEGIN is included in the contructor lt Scan ner gt and is executed whenever a new scanner object is created The procedure finalize may be called in order to fi nalize user data The contents of the target code section named CLOSE is included in the body of this procedure A good JVM will call this procedure when the scanner object is garbage collected or it may be called explicitly The procedures getWord getLower and getUpper allow access to the matched character se guence as described in section 4 4 The variable tokenLength specifi es the number of matched characters The variable affribute is supposed to communicate additional properties of the current token The value must be provided by appropriate action statements This variable is of type ScanAt tribute which has to be a class which implements the inter
75. okenIndex INTEGER VAR Attribute tScanAttribute VAR Exit PROC PROCEDURE BeginScanner PROCEDURE BeginFile FileName ARRAY OF CHAR PROCEDURE GetToken INTEGER PROCEDURE GetWord VAR Word Strings tString PROCEDURE GetLower VAR Word Strings tString PROCEDURE GetUpper VAR Word Strings tString PROCEDURE CloseFile PROCEDURE CloseScanner 7 END lt Scanner gt The procedure GetToken is the central scanning routine It returns the next token found in the input or whatever is specifi ed in the actions associated with the regular expressions The procedure BeginFile may be called in order to open an input fi le or a nested include fi le The parameter FileName specifi es the fi le name The value empty string denotes input from standard input If not called input is read from standard input Include fi les up to a nest ing depth of 15 can be processed The procedure CloseFile may be called in order to close the current input fi le before reaching end of fi le CloseFile is called automatically by the scanner upon reaching end of fi le The procedure BeginScanner may be called in order to initialize user data The contents of the target code section named BEGIN is included in the body of this procedure The procedure CloseScanner may be called in order to fi nalize user data The contents of the target code section named CLOSE is included in the body of this procedure The procedures GetWord GetLo
76. olve ambiguities The predefi ned rules help to calculate the line and column positions and to skip blanks effi ciently The implicit def inition of the first 3 rules can be switched off with option v The fourth rule can be overwritten using the DEFAULT section RULE Nt n ANY yyTab yyEol 0 yyEcho 4 4 Action Statements The following statements can be used within the actions associated with regular expressions The exact syntax varies according to the target language for example GetWord may be a function returning a value see the next section for details lt Scanner gt _GetWord v This statement gives access to the matched character seguence In C or C the sequence is returned in the variable v which must be of type char v or wchar tv Additionally the length of the sequence is returned as result of the function In Modula 2 the seguence is returned in the variable v which must be of type Strings tString lt Scanner gt _GetLower v Like lt Scanner gt _GetWord except that every letter is normalized to lower case lt Scanner gt _GetUpper v yyEcho yyLess n yyStart s yyPush s yyPop O yyPrevious yyStartState yyTab Like lt Scanner gt _GetWord except that every letter is normalized to upper case The matched character seguence is printed on standard output The matched character sequence is truncated to the first n characters The other characters
77. osition Specifi cally there needs to be a constructor which mir rors that of Position and a decision must be made as to what foString should return just the position or some representation of any additional attributes For an example of this technigue see the default EXPORT section in the skeleton Another way of achieving the same end is to have ScanAttribute implement HasPosition The values of line and column are stored as fi elds and used to create an instance of Position only Rex 39 when the position method is called that is only if a syntax error is detected This achieves the aim of creating only one object per token for correct input while avoiding the issue of what toString should return By default the macro yySetPosition is defined to create an instance of ScanAttribute from the position information This macro is called whenever a pattern is matched and position calcula tion has not been suppressed see section 3 8 but before any user action code is entered If the user action code may create some subclass of ScanAttribute in order to include attributes spe cific to the type of token the value of a numeric literal for example then either yySetPosition must be redefi ned or position calculation must be suppressed for those rules which will instan tiate some descendant of ScanAttribute The macro yyGetTokenBegin may be used to execute code at the beginning of getToken that is for every token read By default t
78. r Interface The scanner interface consists of two parts While the objects specifi ed in the external interface can be used from outside the scanner the objects of the internal interface can be used only within a scanner description The external scanner interface in the file lt Scanner gt h has the following con tents Rex 18 include Position h typedef struct tPosition Position lt Scanner gt _tScanAttribute extern void lt Scanner gt _ErrorAttribute int Token lt Scanner gt tScanAttribute Attribute define lt Scanner gt _EofToken 0 define lt Scanner gt _xxMaxCharacter 255 if xxMaxCharacter lt 256 define lt Scanner gt _xxtChar char ls define lt Scanner gt xxtChar wchar_t ndif extern lt Scanner gt _xxtChar lt Scanner gt _TokenPtr extern int lt Scanner gt _TokenLength extern lt Scanner gt tScanAttribute lt Scanner gt _Attribute extern void lt Scanner gt Exit void A extern void lt Scanner gt _BeginScanner void extern void lt Scanner gt _BeginFile char FileName extern void lt Scanner gt _BeginFileW wchar_t FileName extern void lt Scanner gt _BeginMemory void InputPtr extern void lt Scanner gt _BeginMemoryN void InputPtr int Length extern void lt Scanner gt _BeginGeneric void InputPtr extern int lt Scanner gt _GetToken void extern int lt Scanner gt _GetWord lt Scanner gt _xxtChar Word extern int lt Scanner gt _
79. r2 AStr2f Nr An Message unterminated string Error Attribute Position INC ErrorCount Strings Append String 12C yyEol 0 yyPrevious Concatenate TargetCode String Strl Str2 CStrl CStr2 r Strings Append String 15C STD rules charsetf 2 yyStart comment StartPosition Attribute Position commentf cmtch comment yyPrevious STD rules charsetf ANY STD IMPORT PrevState STD RETURN SymImport B STD EXPOR PrevState STD RETURN SymExport H STD GLOBAL PrevState STD RETURN SymGlobal STD OCAL PrevState STD RETURN SymLocal H STD BEGIN PrevState STD RETURN SymBegin 3 STD CLOSE PrevState STD RETURN SymClose H STD DEFAU PrevState STD RETURN SymDefault 7 STD EOF PrevState STD RETURN SymEof H STD SCANNER RETURN SymScanner AN STD CHARACTER_SET RETURN SymCharSet 7 STD DEFINE RETURN SymDefine 7 STD START RETURN SymStart 7 STD RULE RETURN SymRule 7 STD RULES RETURN SymRules 7 STD rules letter letter digit _ GetWord Word Attribute Ident MakeIdent Word RETURN SymIdent STD rules digit GetWord Word Attribute Number StringTolnt Word RETURN SymNumber STD rules NIU string X mod IF TokenLength cMaxStrLength THEN Message string too long max 255 Restriction Attribute Position INC ErrorCount ArrayToString TargetCode ELSIF TokenLength 2
80. read in from a fi le named lt scanner gt txt The contents of the target code section named BEGIN is included in the body of this procedure The procedure CloseScanner may be called in order to fi nalize user data The contents of the target code section named CLOSE is included in the body of this procedure The procedures GetWord GetLower and GetUpper allow access to the matched character se quence as described in section 4 4 The variable TokenLength specifi es the number of matched characters Rex 35 The variable Attribute is supposed to communicate additional properties of the current token The value must be provided by appropriate action statements The class of this feature has to be a subclass of the predefi ned support class ScanAttribute This class has one feature called Position of type Position The class Position has two features called Line and Column The values of Line and Column are computed automatically by the scanner They indicate the source position of the current token The position of a token is the position of the first charac ter of the token For exceptions see section 3 8 The classes ScanAttribute and Position are predefi ned in the library reuse eiffel Subclasses of these classes can be defi ned in order reflect application specifi c needs During automatic error repair a parser may insert tokens In this case the parser calls the proce dure ErrorAttribute in order to ask for the additional properties o
81. ression followed by a range in brackets matches a character sequence which can be matched by the repetition of the preceding regular expression a number of times lying in between of the two given numbers a 2 4 matches the character sequences aa and aaaa Parentheses can be used for grouping in more complex regular expressions a bt c d matches strings like acdcd cdcdcd bcd or bbb but not ab abb or abcd A complete regular expression which is not part of any other regular expression is called a pat A pattern is matched exactly in the same way as regular expressions It can be augmented by the following specifi cations A pattern preceded by the operator x matches a character sequence only if it appears at the beginning of a line a z matches identifiers only at the beginning of lines A pattern followed by the operator gt matches a character sequence only if it appears at the end of a line n gt matches trailing spaces lt C ANY gt matches FORTRAN comment lines A pattern followed by the operator and a regular expression matches a character sequence only if it is followed by a character seguence that is matched by the regular expression behind the operator 0 9 4 f matches numbers but only if folloved by tvo dots Several patterns that share a common action can be given in a comma separated list thus the action has to b
82. ritten in the language described in the next chapter The output is the source text of a scanner The source text consists of a specification and a body part These parts are files with the suffixes h and c if C is the target language In the case of Modula 2 the parts are a definition and an implementation module The scanner reguires a source module to get blocks of characters e g by input from file Rex can be asked to provide a prototype source module which performs input from the UNIX standard input file Additionally Rex can be asked to provide a main program to serve as test driver of the scanner This main program calls the scanner routine until the end of the input is reached The above mentioned source programs constitute the minimum configuration to run the gener ated scanner What is happening after the compilation of the program modules is shown in the run time half of Fig 1 Then the scanner driver starts calling the scanner routine which in turn some times calls the source module routines to get characters The data flow is in the opposite direction The source module returns blocks of characters to the scanner The scanner analyzes the character stream executes the associated actions upon finding character seguences matched by regular expressions and eventually returns tokens to the scanner driver In general the scanner driver can be Rex 2 generation time run time Scanner
83. se items are needed in combination with parser generators A variable called lt Scanner gt _Attribute of type lt Scanner gt _tScanAttribute is used to communicate additional properties of the tokens from the scanner to the parser The type lt Scanner gt _tScanAttribute has to be a struct record type with at least one member fi eld called Position of type tPosition tPosition has to be a struct record type with at least two members fi elds called Line and Column see section 3 8 It can be imported from the predefi ned module Position or from a user modifi ed version of it During automatic error repair a parser may insert tokens In this case the parser calls the proce dure lt Scanner gt _ErrorAttribute in order to ask for the additional properties of an inserted token which is given by the parameter Token The types tPosition and lt Scanner gt _tScanAttribute are pre defined as given above and the procedure lt Scanner gt _ErrorAttribute is empty If only one of the sections IMPORT EXPORT or GLOBAL is used it has to contain declarations consistent with the remaining predefi ned ones 3 8 Source Position The generated scanners automatically compute the line and column position of every token This position can be accessed via the fields Position Line and Position Column of the global Rex 13 variable lt Scanner gt _Attribute as described in the section about the Scanner Interface The source position is computed automatically i
84. t The function lt Scanner gt _GetWLine will convert the input stream to a stream of type wchar_t The following constants describe the encoding of the input stream define CODE NONE 0 define CODE_BYTE 1 1 byte define CODE WCHAR T 2 2 or 4 bytes x define CODE UCS2 3 2 bytes define CODE UCS4 4 4 bytes define CODE UTF8 5 seq of 1 byte define CODE UTF16 6 seq of 2 bytes The above comments give the size of an input stream item in bytes All input stream items or sequences of input stream items in the cases of UTF8 and UTF16 represent Unicode charac ters The encodings BYTE UCS2 and UTF16 and possibly WCHAR_T can represent sub sets of the full Unicode character set only A Unicode character will be stored in variables of type wchar_t Note the size of wchar_t can be 2 or 4 bytes depending on the compiler Therefore if the size of wchar_t is 2 then characters encoded by UCS4 UTFS and UTF16 will be truncated to two bytes The following constants describe the endian property of the input stream f define ENDIAN_NONE 0 no endian property specified f define ENDIAN LITTLE 1 little endian f define ENDIAN_BIG 2 big endian 5 1 3 Scanner Driver A main program is necessary for the test of a generated scanner Rex can provide a minimal main program in the file lt Scanner gt Drv c which can serve as test driver It counts the tokens and looks like th
85. t the head of the class file after the package declaration if there is one It may be used to add import statements For other languages an IMPORT section is treated like an EXPORT section Target code after the keyword EXPORT is included in the specifi cation part defi nition mod ule of the generated scanner It allows to extend the set of objects exported by the scanner module If not given it is predefi ned as described below Rex 10 Target code after the keyword GLOBAL is included in the scanner module at level O that is the extent of variables given in this section is the run time of the whole program If not given it is predefi ned as described below Target code after the keyword LOCAL is included in the scanner routine called lt Scan ner gt _GetToken at level 1 that is the extent of variables given in this section is one invocation of this routine Target code after the Keyword BEGIN is included in the routine lt Scanner gt _BeginScanner which may be called in order to initialize the data structures declared in the sections EXPORT and GLOBAL Target code after the Keyword CLOSE is included in the routine lt Scanner gt _CloseScanner which may be called after scanning is fi nished This statements can be used to fi nalize the data structures declared in the sections EXPORT and GLOBAL Target code after the keyword DEFAULT is included in the scanner routine to be executed whenever a character is not matched by one of the r
86. tart PROCEDURE yyStart ru tartRule E 9 u LOCAL VAR String n PPLine PROCEDURE BEGIN GetWord Concatena ND AppendCo Appen tr BEGIN level 0 AssignEmpty NoString ErrorCount DEFAULT Word Rex 7 Charset arset RUE set Set EG Rules les S tString LONGCARD dCode Word te TargetCode de Word string PutString string 07 Message illegal character INC ErrorCount Error Attribute Position BOF CASE yyStartState OF targetcode set Message terminating missing INC ErrorCount comment Message unterminated comment INC ErrorCount CSErl OStr2 AStrl AStr2 Strl Str2 Message unterminated string INC ErrorCount ELSE END yyStart STD Error Error Error StartPosition StartPosition StringPosition 56 Rex DEFINE letter A Z a z digit 0 9 octdigit 0 7 hexdigit 0 9 A F a f string n cmtch t n code t n r gGiyf StrChl t n StrCh2 t n CStrChl t n CStrCh2 t n AStrch 7 START targetcode set
87. th a default of 8 The stack size is increased automatically when neces sary The initial stack size can be changed by including a preprocessor directive in the GLOB AL section such as define yyInitFileStackSize 16 The value for tab stops is defined by the preprocessor symbol yyTabSpace with a default of 8 This value can be changed by including a preprocessor directive in the GLOBAL section such as define yyTabSpace 4 5 2 2 Source Interface The scanners generated by Rex need a source module that provides blocked input of charac ters Rex can provide a prototype source module which can read from standard input from any file or from memory It is contained in the files lt Scanner gt Source h and lt Scanner gt Source cxx The specifi cation fi le lt Scanner gt Source h consists of something like extern void lt Scanne extern int lt Scanne extern int lt Scanne extern void lt Scanne extern void lt Scanne extern void lt Scanne extern int lt Scanne extern int lt Scanne extern void lt Scanne r gt BeginSou r gt BeginSou r gt BeginSou r gt BeginSou r gt BeginSou r gt GetLine r gt GetWLine r gt CloseSou r SetEncoding rceFile rceFileW rceMemory rceMemoryN rceGeneric rece int Encoding int Endian char FileName wchar_t FileName void void void int Fil int Fil int Fil InputPtr InputPtr InputPtr e char Buf
88. turn 23 STD return 24 STD gt return 25 STD gt return 26 STD return 27 STD return 28 STD return 29 STD return 30 STD return 31 STD return 32 STD return 33 STD AND return 34 STD ARRAY return 35 STD BEGIN return 36 STD BY 5 return 37 STD CASE return 38 STD CONST return 39 STD DEFINITION return 40 STD DIV return 41 STD DO return 42 STD ELSE return 43 STD ELSIE return 44 STD END return 45 STD EXIT return 46 STD EXPORT return 47 STD FOR return 48 STD FROM return 49 STD IF return 50 STD IMPLEMENTATION return 51 STD IMPORT return 52 STD IN return 53 STD LOOP return 54 STD MOD return 55 STD MODULE return 56 STD NOT return 57 SA 58 STD OF se Rex STD return 59 STD STD return 60 z o Oo C m E po return 61 return 62 ST iw G D H nj H s return 63 return 64 return 65 turn 66 n WW WON 9 215 ED Q O ye z nn nn nn an anno by EN return 67 return 68 YPE return 69 UNTIL return 70 STD VAR return 71 STD WHILE return 72 STD WITH return 73
89. unt tokenIndex yyLineStart The macro yyColumn determines the column number for a given buffer location TokenPtr in C and C TokenIndex in Modula 2 Java and Ada The macro yySetPosition is used by the gen erated scanner in order to assign the position data to the variable Attribute It can be redefi ned by the user in the GLOBAL section This allows for example for the following It is possible to get rid of the fi elds Line and Column The fi elds Line and Column can be named differently It is possible to implement a completely different representation for source positions such as e g the absolute character offset in a file as is used by fseek of Unix This is achieved by using the macro yyOffsef which determines the offset value for a given buffer location In the following example the fields Offsef and End receive the absolute character positions of the beginning and the end of a token define yySetPosition Attribute Position Offset yyOffset TokenPtr N Attribute Position End yyOffset TokenPtr TokenLength 1 3 9 Character Set Scanners generated by Rex depend on the internal representation of the character set The rea son originates from the implementation of the finite automaton which uses in principal a two dimensional array that maps a state and a character to a new state State Table State Character The internal representation of the characters is used for the array access during run time
90. ut from file If additionally hashing of identifiers is performed the speed is around 1 25 million lines per minute The generator Rex itself is 10 to 20 times faster than LEX in typical cases Like LEX Rex has all the features necessary to scan contemporary languages that is the left and the right context can be taken into account to identify a token The left context is han dled by so called start states and the right context by additional regular expressions The source coordinates line and column number of recognized words are calculated automatically Scanners can be generated in the languages C C Modula 2 Ada Eiffel or Java Rex itself is implemented originally in Modula 2 The following chapters constitute the user manual of Rex Chapter 2 gives an overview of the operation of Rex and how its output is to be integrated in e g compilers Chapter 3 describes the specification language Chapter 4 summarizes the predefined items of the specification language Chapter 5 contains the specification of the interface of the generated scanners Chapter 6 shows how to invoke and use Rex Chapter 7 contains some details of the implementation Chapter 8 describes the differences between Rex and LEX for those already familiar with LEX The appen dices contain a grammar for the input language and some examples 2 Overview Figure 1 gives an overview of the observable behaviour of Rex It takes as input a specifica tion of a lexical analyser w
91. ute RECORD Position Position tPosition END PROCEDURE ErrorAttribute Token INTEGER VAR Attribute tScanAttribute GLOBAL PROCEDURE ErrorAttribute Token INTEGER VAR Attribute tScanAttribute BEGIN END ErrorAttribute DEFAULT yyEcho If the target language is Ada EXPORT type tScanAttribute is record Position tPosition end record procedure ErrorAttribute Token Integer Attribute out tScanAttribute GLOBAL procedure ErrorAttribute Token Integer Attribute out tScanAttribute is begin null end ErrorAttribute DEFAULT Text_Io Put Text_Io Standard_Output yyChBufferPtr yyChBufferIndex 1 Rex 12 If the target language is Eiffel GLOBAL ErrorAttribute Token INTEGER ScanAttribute is do Result Attribute end DEFAULT yyEcho If the target language is Java IMPORT import de cocolab reuse EXPORT class ScanAttribute extends Position public ScanAttribute int line int column super line column public ScanAttribute Position other super other line other column public ScanAttribute errorAttribute int token return new ScanAttribute Position NoPosition DEFAULT yyEcho These sections import the type tPosition from a module named Position and they declare the type lt Scanner gt _tScanAttribute as well as the procedure lt Scanner gt _ErrorAttribute The
92. wer and GetUpper allow access to the matched character se quence as described in section 4 4 The variable TokenLength specifi es the number of matched characters The variable TokenIndex is an array index of the internal buffer an array of characters which specifi es the location where the matched character sequence starts It can be used as argument for the macros that compute source positions The variable Attribute is supposed to communicate additional properties of the current token The value must be provided by appropriate action statements This variable is of type tScanAttribute which has to be a record type with at least one fi eld called Position of type tPo sition tPosition has to be a record type with at least two fi elds called Line and Column The values of Line and Column are computed by the scanner automatically They indicate the source position of the current token The position of a token is the position of the fi rst charac ter of the token For exceptions see section 3 8 The types tScanAttribute and tPosition are predefi ned as given above The defi nitions of these types can be changed as described in sec tion 3 7 Rex 29 During automatic error repair a parser may insert tokens In this case the parser calls the proce dure ErrorAttribute in order to ask for the additional properties of an inserted token which is given by the parameter Token The procedure should return in the second argument called At tribute

Download Pdf Manuals

image

Related Search

Related Contents

Hornche Corporation  Trivia Bischoff & Bischoff Manual de instrucciones  Operation Manual 06/2005  Les interactions entre chiens en liberté  Philips AJ110 User's Manual  IN-3001 - INSTAR Wiki  VCS-04 User Manual - R. A. Mayes Company  Notebook PC  TGX FastCast アクリルアミド溶液キット  ati Radeon X600 Pro  

Copyright © All rights reserved.
Failed to retrieve file