Home

Unitex User Manual

image

Contents

1. 191 192 gladna gladne gladne gladne gladna gladna gladna CHAPTER 10 COMPOUND WORD INFLECTION kao vukovi gladan kao vuk AC_A3XN2 w4mgea hungry as a wolf kao vuk gladan kao vuk AC_A3XN2 w4fgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 w4fgea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 w4fgea hungry as a wolf kao vuk gladan kao vuk AC_A3XN2 w4ngea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 w4ngea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 w4ngea hungry as a wolf zxiro racyun aliizxiro racyun lt 3 Nb n Case c Anim a Gen g gt lt Nb n Case c Anim a Gen g gt Figure 10 28 Inflection graph NC_2XN1 for Serbian MWUs avio prevoznik 1 avioprevoznik lt 3 Nb n Case c Anim a Gen gt lt Nb n Case c Anim a Gen g9 Figure 10 29 Inflection graph NC_2XN2 for Serbian MWUs predsednik drzxave plural predsednici dizxave i predsednici dizxava lt 1 Nb s Case c Anim a Gen pg gt lt Nb s Case c Anim a Gen g gt EZ lt Nb p Case c Anim a Gen g gt 32 lt Nb w Case c Anim a Gen g gt lt 1 Nb p Case c Anim a Gen gt Sp ES gt Figure 10 30 Inflection graph NC_N2X1 for Serbian MWUs lt 1 Nb w Case c Anim a Gen g gt 10 3 INTEGRATION IN UNITEX 193 Novi Sad Crvena Zastava Ujeinxenxe Nacije lt 2 gt lt 3 Gen g Nb n Case
2. Day UnitexiFrenchiElag PPYSISE grf a postpos grf IN SE ot TN pat at E pe tor E PpvLUl grf TN papp ot TN PpvSeq grf File Name Files of Type Elag Grammar grf v Compiled Elag Rule baam compile cancel compilation Figure 7 16 ELAG grammars compilation frame in the left frame and click on the button to add them to the list in the right frame Then click on the Compile button This will launch the ElagComp program which will compile the selected grammars and create a file named elag rul by default If you have selected grammars in the right frame you can search patterns whith them by clicking on the Locate button This opens the window Locate Pattern and automatically enters a graph name ending with conc fst2 This graphs corresponds to the if part of the grammar You can thus obtain the occurrences of the text to which the grammar will apply NOTE The conc fst2 file used to locate the if part of a grammar is automatically gen erated when ELAG grammars are compiled by means of the Compile button It is thus necessary to have your grammar compiled before searching using the Locate button 7 3 3 Resolving Ambiguities Once you have compiled your grammar into an elag rul file you can apply it to a text automaton In the text automaton window click on the Apply Elag Rule button A dialog box will a
3. Figure 6 45 Result of the application of the transducer in figure 6 44 If the beginning or the end of variable is malformed end of a variable before its beginning or absence of the beginning or end of a variable it will be ignored during the emission of outputs There is no limit to the number of possible variables The variables can be nested and even overlap as is shown in figure 6 46 6 8 Applying graphs to texts This section only applies to syntactic graphs 6 8 1 Configuration of the search In order to apply a graph to a text you open the text then click on Locate Pattern in the Text menu or press lt Ctrl L gt You can then configure your search in the window shown in figure 6 47 6 8 APPLYING GRAPHS TO TEXTS 119 January February March April May June July August September October November December Monday Tuesday Wednesday Thursday lt NB gt Friday Saturday Sunday DayAndNumber NumberAndMonth DayAndNumber NumberAndMonth Figure 6 46 Overlapping variables In the Locate pattern in the form of field choose Graph and select your graph by clicking on the Set button You can choose a graph in gr format Unicode Graphs or a compiled graph in fst2 format Unicode Compiled Graphs If your graph is a grf one Unitex will compile it automatically before starting the search The Index field allows to select the recognition mode
4. franc macon NC_AN1 fp 184 CHAPTER 10 COMPOUND WORD INFLECTION francs macons franc ma on NC_ANl mp francs maconnes franc macon NC_AN1 fp n moire vive m moire vive NC_NN fs n moires vives m moire vive NC_NN fp nicroscope a effet tunnel microscope a effet tunnel NC_NXXXXXX ms nicroscopes effet tunnel microscope effet tunnel NC_NXXXXXX mp porte serviette porte serviette NC_VNm ms porte serviettes porte serviette NC_VNm ms porte serviettes porte serviette NC_VNm mp n n n n e g avant garde lt Gen g Nb n gt Figure 10 21 Inflection graph NC_XXN for French MWUs Gen g Nb n gt e g bateau mouche Figure 10 22 Inflection graph NC_NN for French MWUs lt Gen g Nb n gt e g pomme de terre Figure 10 23 Inflection graph NC_NXXXX for French MWUs lt Gen g Nb n gt e g assistant approvisionneur Figure 10 24 Inflection graph NC_NNmf for French MWUs Es lt Gen g Nb n gt e g franc macon Figure 10 25 Inflection graph NC ANT for French MWUs 10 3 INTEGRATION IN UNITEX 185 lt Gen g Nb n gt e g microscope effet tunnel Figure 10 26 Inflection graph NC_NXXXXXX for French MWUs lt Gen m Nb p gt Figure 10 27 Inflection graph NC_VNm for French MWUs 10 3 3 Complete Example in Serbian Let us assume that the description of morphological features of Serbian is given by the fol lowing Morphology t x
5. Abr viations Graphe r alis par Nathalie Friburger LI Tours Anne Dister Univ de Li ges Denis Maurel LE Tours Figure 2 9 Sentence splitting grammar for French By default the space is optional between two boxes If you want to prohibit the presence of the space you have to use the special character At the opposite if you want to force 2 5 PREPROCESSING A TEXT 27 the presence of the space you must use the sequence Lower and upper case letters are defined by an alphabet file see chapter 12 For more details on grammars see chapter 5 For more information about sentence boundary detection see 21 The grammar used here is named Sentence fst2 and can be found in the following directory user home directory language Graphs Preprocessing Sentence This grammar is applied to a text with the Fst2Txt program in MERGE mode This has the effect that the output produced by the grammar in this case the symbol S is inserted into the text This program takes a snt file and modifies it 2 5 3 Normalization of non ambiguous forms Certain forms present in texts can be normalized for example the English sequence I m is equivalent to I am You may want to replace these forms according to your own needs However you have to be careful that the forms normalized are unambiguous or that the removal of ambiguity has no undesirable consequences For instance if you want to normalize O clock to on the c
6. N NPN 21 p Figure 7 2 Overlap between a compound word and a combination of simple words sentence automaton in figure 7 3 you can conclude that the word which has been coded twice as a determiner in two subcategories of the category DET This granularity of descrip tions will not be of any use if you are only interested in the grammatical category of this word It is therefore necessary to adapt the granularity of the dictionaries to the intended use circumstance which DET DetQ sn DET Dind s za DET Dadj s p Figure 7 3 Double entry for which as a determiner For each lexical unit of the sentence Unitex searches the dictionary of the simple words of the text for all possible interpretations Afterwards all combination of lexical units that have an interpretation in the dictionary of the compound words of the text are taken into account All the combinations of these information constitute the sentence automaton 128 CHAPTER 7 TEXT AUTOMATON NOTE If the text contains lexical labels e g out of date A z1 these labels are reproduced identically in the automaton without trying to decompose them In each box the first line contains the inflected form found in the text and the second line contains the canonical form if it is different The other information is coded below the box cf section 7 4 1 The spaces that separate the lexical units are not copied into the automaton except for the
7. 42 CHAPTER 3 DICTIONARIES E D My Unitex English Dela agreeably dic agreeably ADV agreed INTJ agreed agree V ti K I1ls I2s I3s Ilp I2p I3p ah aid N s Figure 3 2 Dictionary example F1 Check Dictionary Format Dictionary Type im Check Dictionary O DELAS DELAC Figure 3 3 Checking a dictionary 3 3 Sorting Unitex uses the dictionaries without having to worry about the order of the entries When displaying them it is sometimes preferable to sort the dictionaries The sorting depends on a number of criteria first of all on the language of the text Therefore the sorting of a Thai dictionary is done according to an order different from the alphabetical order So different in fact that Unitex uses a sorting procedure developed specifically for Thai see chapter 11 For European languages the sorting is usually done according to the lexicographical order although there are some variants Certain languages like French treat some characters as equivalent For example the difference between the characters e and is ignored if one wants to compare the words manger et mang s because the contexts r and s allow to decide the order The difference is only taken into account when the contexts are identical as they are when comparing p che and p che To allow for such effect the SortTxt program uses a file which defines the equivalence of characters This file is named Alphabet_ sort txt and ca
8. 4444244444 ess els 134 7 3 1 Grammars For Resolving Ambiguities 134 Joe Compiling ELAG eegnen gt ek meini we Re ee a 135 733 Resolving AMbipuiles 7 eee metres EA Ce A 137 ol Grammar collections ecn 66 02 2 S64 Sn A Ba Save wee ut 139 Jom Window For ELAG Processing s scos nade aca uci Rw OR Oe OS 139 Zb Description OT he MESS e eo oe ee AAA 140 foe Grammar Opumiza on o c s 622 44 oH he Se ERE E 146 74 Manipulation of textamtomala ss cse cuerda aa d imas eee Babes 147 7 4 1 Displaying sentence automata 147 742 Modifying the textautomaton lt lt 42 pide dau RS AR EEN 148 dh Display Gonnereng eos E s ia me A da re 149 7 5 Converting the text automaton into linear text 149 Lexicon grammar 151 A e la A 151 8 2 Conversion ofa table into graphs s sesa soi a sas ae 152 8 2 1 Principle of parameterized graphs oporto 224s ee eds 152 8 2 2 Format of the table 152 8 29 Parameterized E a a be AAA 153 8 2 4 Automatic generation of graphs ENEE EE 154 Text alignment 159 SC Loading EXS oo 4 ee a 2 ES A ARAN AAA 159 92 Aligaing texis lt a ae a AA AAA E 161 99 Ee lt a a A AAA 163 Compound word inflection 167 10 1 Multi Word Units 167 10 1 1 Formal Description of the Inflectional Behavior of Multi word Units 168 10 1 2 Lexicalized vs Grammar Based Approach to Morph
9. egory For example the codes 1 2 3 which indicate the person of the entry are rel evant for pronouns but not for adjectives Each line describes an inflexional attribute gender time etc and is made up of the attribute name followed by the character and the values which it can take For example the following line declares an attribute pers being able to taking the values 1 2 or 3 pers 1 2 3 cat this part declares the syntactic and semantic attributes which can be assigned to the entries belonging to the part of speech concerned Each line describes an attribute and the values which it can take The codes declared for the same attribute must be exclusive In other words an entry cannot take more than one value for the same attribute On the other hand all the tags in a given part of speech don t necessarily take val ues for all the attribute of the part of speech For example to define the attribute niveau de Langue which can take the values z1 z2 and z3 the following line can be written niveau_de_langue z1 z2 z3 but this attribute is not necessarily present in all words discr this part consists of a declaration of a unique attribute The syntax is the same as in the cat part and the attribute described here must not be repeated there This part allows for dividing the grammatical category in discriminating sub categories in which the entries have similar inflectional attributes For pronouns for example a perso
10. Figure 2 4 Saving in Unicode with OpenOffice Writer By default the encoding proposed on a PC is always Unicode Little Endian The texts thus 22 CHAPTER 2 LOADING A TEXT obtained do not contain any formatting information anymore fonts colors etc and are ready to be used with Unitex 2 3 Editing text files You also have the possibility of using the text editor integrated into Unitex accessible via the Open command in the File Edition menu This editor offers search and replace functionalities for the texts and dictionaries handled by Unitex To use it click on the Find icon You will then see a window divided into three parts The Find part corresponds to the usual search operations If you open a text split into sentences you can base your search on sentence numbers in the Find Sentence part Lastly the Search Dictionary part visible in figure 2 5 enables you to carry out operations concerning the electronic dictionaries In particular you can search by specifying if it concerns inflected forms lemmas grammatical and semantic and or inflectional codes Thus if you want to search for all the verbs which have the semantic feature t which indicates transitivity you just have to search for t by clicking on Grammatical code You will get the matching entries without confusion with all the other occurrences of the letter t ES Find Find Sentence l Dictionary Search
11. GHT TO LEFT falseq 234 CHAPTER 12 FILE FORMATS PACKAGE NODES COLOR 23029764 CONTEXT NODES COLOR 167119364 CHAR BY CHAR falseY ANTIALIASING false HTML VIEWER 4 MAX TEXT FILE SIZE 20971524 ICON BAR POSITION WestY PACKAGE PATH D repository MORPHOLOGICAL DICTIONARY D MyUnitex English Dela zz bing MORPHOLOGICAL NODES COLOR 39117284 MORPHOLOGICAL USE OF SPACE falseY The first two lines are comment lines The following three lines indicate the name the style and the size of the font used to display texts dictionaries lexical units sentences in text automata etc The CONCORDANCE FONT NAME and CONCORDANCE FONT HTML SIZE parameters define the name the size and the font to be used when displaying concordances in HTML The size of the font has a value between 1 and 7 The INPUT FONT and OUTPUT FONT parameters define the name the style and the size of the fonts used for displaying the paths and the transducer outputs of the graphs The following 10 parameters correspond to the parameters given in the headings of the graphs Table 12 3 describes the correspondances Parameters in the Config file Parameters in the grf file DATE DDATE FILE NAME DF ILE PATH NAME DDIR FRAME DFRAME RIGHT TO LEFT DRIG BACKGROUND COLOR BCOLOR FOREGROUND COLOR FCOLOR AUXILIARY NODES COLOR
12. In order to call a sub graph its name is inserted into a box and preceded by the character If you enter the text alpha beta gamma E greek delta grf 76 CHAPTER 5 LOCAL GRAMMARS into a box you get a box similar to the one in figure 5 7 alpha beta gamma E greek delta grf Figure 5 7 Graph that calls sub graphs beta and delta You can indicate the full name of the graph E greek delta grf or simply the base name without the path beta in this case the the sub graph is expected to be in the same directory as the graph that references it References to absolute path names should as a rule be avoided since such calls are not portable If you use such an absolute path name the graph compiler will emit a warning see figure 5 8 essages with a colored background are generated by the interface not by the external programs Compiling graph alpha Compiling graph beta Compiling graph E greek delta Recursion detection started Resolving lt E gt conditions Looking for lt E gt loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Absolute path name detected Windows E greek delta grf Absolute path names are not portable OK Cancel Figure 5 8 Warning about a non portable graph name For portability you should not use or as separator in graph path names Use instead which is understood as a system independent separator In figure 5 8 and are i
13. Johnson DET see Figure 5 11 1 E 5 repository El Det o 5 Smith Figure 5 10 Graph repository example DetdolmsonDEr E Figure 5 11 Call to a graph located in the repository TRICK If you want to avoid long path names like Det Johnson DET you can create a graph named DET and put it the repository root here D repository DET grf In this graph just put a call to Det Johnson DET Then you can just call DET in your own graphs This has two advantages 1 you do not have long path names 2 you can modify the graphs in your repository with no constrainst on your own graphs because the only graph that will have to be modified is the one located at the repository root Calls to sub graphs are represented in the boxes by grey lines or brown lines in the case of graphs located in the repository On Windows you can open a sub graph by clicking on the grey line while pressing the Alt key On Linux the combination lt Alt Click gt is intercepted by the system In order to open a sub graph click on its name by pressing the left and the right mouse button simultaneously 5 2 3 Manipulating boxes You can select several boxes using the mouse In order to do so click and drag the mouse without releasing the button When you release the button all boxes touched by the se lection rectangle will be selected and are displayed in white on blue ground as shown on Figure 5 12 When boxes are selec
14. accumulation des accumulation de NDET Dnom1 4 accumulation N z1 fs PRO PpvLE z1 3fs Figure 7 5 Automaton that has been normalized with the grammar of figure 7 4 paths for the sequence 1 because of the keep best paths heuristic see section 7 2 4 The normalization at the time of the construction of the automaton allows you to add paths to the automaton but not to remove ones Removing paths will be partially done by the keep best paths heuristic if enabled To go further you will need to use the ELAG disambiguation functionality 7 2 3 Normalization of clitical pronouns in Portuguese In Portuguese verbs in the future tense and in the conditional can be modified by the inser tion of one or two clitical pronouns between the root and the suffix of the verb For example the sequence dir me 0 they will tell me corresponds to the complete verbal form dir o as sociated with the pronoun me In order to be able to manipulate this rewritten form it is necessary to introduce it into the text automaton in parallel to the original form Thus the user can search one or the other form The figures 7 6 and 7 7 show the automaton of a sentence after normalization of the clitics 130 CHAPTER 7 TEXT AUTOMATON 3543 sentences Os benfeitores Dir se ia uma galeria de afogados todos solenes secos hirtos de Sentence l bios finos e ar de cerim nia Reset Sentence Graph Rebuild FST Text Elag Frame Expl
15. e Shortest matches give precedence to the shortest matches e Longest matches give precedence to the longest sequences This is the default mode e All matches give out all recognized sequences The Search limitation field allows you to limit the search to a certain number of occur rences By default the search is limited to the 200 first occurrences The Grammar outputs field concerns transducers The Merge with input text mode allows you to insert the output sequences in input sequences The Replace recognized sequences mode allows you to replace the recognized sequences with the produced se quences The third mode ignores all outputs This latter mode is used by default After you have selected the parameters click on SEARCH to start the search 6 8 2 Concordance The result of a search is an index file that contains the positions of all encountered occur rences The window of Figure 6 48 lets you choose whether to construct a concordance or modify the text 120 CHAPTER 6 ADVANCED USE OF GRAPHS Locate pattern in the form of O Regular expression 8 Graph S Index Grammar outputs O Shortest matches Are not taken into account 8 Longest matches Merge with input text All matches Replace recognized sequences Search limitation 8 Stop after 200 matches SEARCH Index all utterances in text Figure 6 47 Locate pattern Window In order
16. 34 B 65 35 13 invoke three major points e they are composed of two or more words e they show some degree of morphological distributional or semantic non compositionality e they have unique and constant references However the basic notions a word a reference the non compositionality and measures degree of non compositionality used in those definitions are themselves controversial Pragmatically we consider a MWU as a contiguous sequence of graphical units which for some application dependent reasons has to be listed described morphologically syntacti cally semantically etc and processed as a unit 167 168 CHAPTER 10 COMPOUND WORD INFLECTION 10 1 1 Formal Description of the Inflectional Behavior of Multi word Units The main issue in MULTIFLEX is the inflectional morphology of MWUs This phenomenon has been linguistically analyzed for English Polish and French in 63 Obviously a reliable inflection processing of single words is a necessary condition for the inflection processing of MWUs However this condition is rarely a sufficient one For ex ample in order to obtain the plural form of e battle cry e battle royal e battle of nerves in English not only do we need to know how to generate the plural of battle royal and cry but also to know how different inflected forms of these constituents combine e battle cries e battle royals or battles royal e battles of nerves but not bat
17. As described in section 3 1 2 a line in a DELAS consists of a canonical form and a sequence of grammatical or semantic codes aviatrix N4 Hum matrix N4 Math radix N4 The first code is used to determine the grammatical code of the entry as well as the name of the grammar used to inflect the canonical form There are two possible forms e V32 grammar name V32 fst2 grammatical code V longest letter prefix e N NC_XXX grammar name NC_XXX fst2 grammatical code N These inflectional grammars will automatically be compiled if needed In the example above all entries will be inflected by a grammar named N4 In order to inflect a dictionary click on Inflect in the DELA menu The window in figure 3 5 allows you to specify the directory in which inflectional grammars are found By default the subdirectory Inflection of the directory for the current language is used 3 4 AUTOMATIC INFLECTION 45 Directory where inflectional FST2 are stored Damy UnitexiEnglishilnflection Set Cancel Inflect Dictionary Figure 3 5 Configuration of automatic inflection matrix matrices Figure 3 6 Inflectional grammar N4 Figure 3 6 shows an example of an inflectional grammar The paths describe the suffixes to add or to remove to get to an inflected form from a canonical form and the outputs text in bold under the boxes are the inflectional codes to add to a dictionary entry In our example two paths are
18. Find what i Find Next Replace Replace Next Occurrences 0 Replace Options Count occurrences Search from begining Y Grammatical code Canonical form Replace All _ A A A O Search up Inflected form Flexional code Close e Search down Figure 2 5 Searching an electronic dictionary for the semantic feature t 2 4 Opening a text Unitex deals with two types of text files The files with the extension snt are text files pre processed by Unitex which are ready to be manipulated by the different system functions The files ending with txt are raw files To use a text open the txt file by clicking on Open in the Text menu Choose the file type Raw Unicode Texts and select your text 2 5 PREPROCESSING A TEXT 23 Unitex 2 0 current language is English Text DELA FSGraph Lexicon Grammar XAlign Edit File Edition Windows Info Open Open Tagged Text Preprocess st Change Language Apply Lexical Resources Locate Pattern Display Located Sequences Compile Elag Grammars Construct FST Text Convert FST Text to Text Close Text Quit Unitex Figure 2 6 Text Menu 2 5 Preprocessing a text After a text is selected Unitex offers to preprocess it Text preprocessing consists of perform ing the following operations normalization of separators splitting into sentences normal ization of non ambiguous f
19. Longest matches Merge with input text All matches Replace recognized sequences Search limitation a Stop after 200 matches SEARCH Index all utterances in text Figure 4 4 Locate pattern window 66 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS The Locate pattern in the form of box allows you to select regular expression or grammar Click on Regular expression The Index box allows you to select the recognition mode Shortest matches prefers shortest matches in case of nested sequences For instance if your grammar can recognize the sequences a very hot chili and very hot the first one will be discarded Longest matches prefers longest matches a very hot chili in our example This is the default e All matches outputs all recognized sequences The Search limitation box is used to limit the number of results to a certain number of occurrences By default the search is limited to the first 200 occurrences The options of the Grammar outputs box do not concern regular expressions They are described in section 6 8 Enter an expression and click on Search in order to start the search Unitex will transform the expression into a grammar in the gr format This grammar will then be compiled into a grammar of the fst2 format that will be used for the search 4 8 2 Presentation of the results When the search is finished the window of figure 4 5 appears showing the n
20. That is particularly true for some gram matical words when their subcategories carry almost as much of information as the lemmas themselves In any case it is recommended to specify its syntactic semantic and inflectional features as much as possible For example with the dictionaries provided for French it is preferable to replace symbols like lt je PRO 1s gt lt je PRO PpvIL 1s gt and lt je PRO gt with the symbol lt PRO Ppv11 1s gt Indeed all these symbols are identical insofar as they can recognize only the single entry of the dictionary je PRO PpvIL 1ms lfs How ever as the program does not deduce this information automatically if all these features are not specified the program will consider nonexisting labels such as lt je PRO 3p gt lt je PRO PronQ gt etc in vain 7 4 Manipulation of text automata 7 4 1 Displaying sentence automata As we have seen above the text automaton is in fact the collection of the sentence automata of a text This structure can be represented using the format fst 2 also used for represent 148 CHAPTER 7 TEXT AUTOMATON ing the compiled grammars This format does not allow the system to directly display the sentence automata Instead the system uses the Fst2Grf program to convert the sentence automaton into a graph that can be displayed This program is called automatically when you select a sentence in order to generate the corresponding gr f file The generated gr
21. This program compiles a grammar into a st 2 file for more details see section 6 2 The parameter graph denotes the complete path of the main graph of the grammar without omitting the extension grf OPTIONS e y loop_check enables error checking loop detection e n no_loop_check disables error checking default a ALPH alphabet ALPH specifies the alphabet file to be used for tokenizing the content of the grammar boxes into lexical units e c char_by_ char tokenization will be done character by character If neither c nor a option is used lexical units will be sequences of any Unicode letters e d DIR pkgdir DIR specifies the repository directory to use see section 5 2 2 page 76 e e no_empty_graph_warning no warning will be emitted when a graph matches the empty word This option is used by MultiFlex in order not to scare users with meaningless error messages when they design an inflection grammar that matches the empty word The result is a file with the same name as the graph passed to the program as a parameter but with extension fst2 This file is saved in the same folder as graph 11 18 ImplodeFst2 ImplodeFst2 OPTIONS lt txtauto gt This program computes and stores in OUT the compact form of the text automaton lt txtauto gt OPTIONS e o OUT output 0UT output file By default OUT is made of lt txtauto gt with imp before the extension like foo imp fst2 11 1
22. a Gen f Case c Nb s gt ES lt 3 Nb s Anim a Gen g1 Case 1 gt lt 3 Nb s Anim a Gen g1 Case 1 gt EZ lt 1 Anim a Gen f Case c Nb feminine name sumame first name Jovanovic Katarina d 70 lt Nb s Case c Anim a Gen f gt masculine name first name surname Ljuba Popovic b s Case fc Anim fa Gen m gt lt 1 Anim a Gen mCase c Nb s gt lt 2 gt P lt 3 Nb s Anim a Gen gl Case c gt H masculine name sumame first name Popovic Ljuba lt 3Nb s Anim a Gen g1 Case 1 gt ES lt 1 Anim a Gen m Case c Nb s gt Figure 10 34 Inflection graph NC_ImePrezime for Serbian MWUs gladan kao vuk ES Es Es Figure 10 35 Inflection graph AC_A3XN2 for Serbian MWUs Chapter 11 Use of external programs This chapter presents the use of the different programs of which Unitex is composed These programs which can be found in the Unitex App folder are automatically called by the in terface Itis possible to see the commands that have been executed by clicking on Info gt Console It is also possible to see the options of the different programs on Info gt Help on commands WARNING many programs use the text directory my_text_snt This directory is cre ated by the graphical interface after the normalization of the text If you work with the command line you have to create the directory manually before the execution of the pro gram Norma
23. inflection codes that are associated to the entry The mode of compression of the canonical form varies in function of the inflected form If the two forms are identical the compressed form contains only the grammatical semantic and inflectional information as in N Hum ms If the forms are different the compression program cuts up the two forms in units These units can be a space a hyphen or a sequence of characters that contains neither a space nor a hyphen This way of cutting up units allows the program to efficiently take into account the inflected forms of the compound words If the inflected and the canonical form do not have the same number of units the program encodes the canonical form by the number of characters to be removed from the inflected form followed by the characters to append For instance the line below is a line in the initial dictionary James Bond 007 N Since the sequence James Bond contains three units and 007 only one the canonical form is encoded with 10101017 The _ character indicates that the two forms do not have the same number of units The following number here 10 indicates the number of characters to be removed The sequence 101017 indicates that the sequence 007 should be appended The digits are preceeded by the character so they will not be confused with the number of Characters to be removed Whenever the two forms have the same number of units the units are compressed two by two Each
24. lt A z3 gt thus recognizes all the adjectives that do not have the code z3 cf table 3 2 If you want to refer to a code containing the character you have to escape this character by preceding it with a Thus the mask lt N faux ami gt could recognize all entries of the dictionaries containing the codes N and faux ami The order in which the codes appear in the mask is not important The three following patterns are equivalent lt N Hum z1 gt lt z1 N Hum gt lt Hum z1 N gt NOTE it is not possible to use a lexical mask that only has prohibited codes lt N gt and lt A z1 gt are thus incorrect masks However you can express such constraints using con texts see section 6 3 60 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS 4 3 4 Inflectional constraints It is also possible to specify constraints about the inflectional codes These constraints have to be preceded by at least one grammatical or semantic code They are represented as in flectional codes present in the dictionaries Here are some examples of lexical masks using inflectional constraints e lt A m gt recognizes a masculine adjective e lt A mp f gt recognizes a masculine plural or a feminine adjective e lt V 2 3 gt recognizes a verb in the 2nd or 3rd person that excludes all tenses that have neither a 2nd or 3rd person infinitive past participle and present participle as well as the tenses that are conjugated in the first person In
25. s N size NM font size to use in output HTML page 11 5 Convert Convert OPTIONS lt text_1 gt lt text_2 gt lt text_3 gt With this program you can transcode text files OPTIONS e s X src X input encoding e d X dest X output encoding default LITTLE ENDIAN Output options e r replace input files are overwritten default e ps PEX input files are renamed with the PFX prefix toto txt gt PFXtoto txt e pd PFX ouput files are renamed with the PFX prefix 11 5 CONVERT 199 e ss SFX input files are named with the SFX suffix toto txt gt totoSFX txt e sd SFX ouput files are named with the SFX suffix HTML options Convert offers some special options dedicated to HTML files You can use a combination of the following options e dnc Decode Normal Chars things like eacute amp 120 and amp xF8 will be decoded as the single equivalent unicode character except if it represents an HTML control character e dcc Decode Control Chars 1t amp gt samp and amp quot will be decoded as lt gt amp and the quote the same for their decimal and hexadecimal representations e eac Encode All Chars every character that is not supported by the output encod ing will be encoded as a string like amp 457 e ecc Encode Control Chars lt gt amp and the quote will be encoded by amp 1t amp gt samp and amp quot All HTML op
26. 11 30 Tokenize Tokenize OPTIONS lt txt gt This program tokenizes a tet text into lexical units lt txt gt the complete path of the text file without omitting the snt extension OPTIONS e a ALPH alphabet ALPH alphabet file e c char_by_ char indicates whether the program is applied character by char acter with the exceptions of the sentence delimiter S the stop marker STOP and lexical tags like today ADV which are considered to be single units e w word_by_word with this option the program considers a unit to be either a se quence of letters the letters are defined by file alphabet or a character which is not a letter or the sentence separator S or a lexical label like aujourd hui ADV This is the default mode The program codes each unit as a whole The list of units is saved in a text file called tokens txt The sequence of codes representing the units now allows the coding of the text This sequence is saved in a binary file named text cod The program also produces the following four files e tok_by_freq txt text file containing the units sorted by frequency e tok_by_alph txt text file containing the units sorted alphabetically e stats n text file containing information on the number of sentence separators the number of units the number of simple words and the number of numbers e enter pos binary file containing the list of newline positions in the text The coded repr
27. 205 farsa Grae ch Figure 11 1 Graph with a cycle 11 15 Fst2Txt Fst2Txt OPTIONS lt fst2 gt This program applies a transducer to a text at the preprocessing stage when the text has not been cut into lexical units yet OPTIONS e t TXT text TXT the text file to be modified with extension snt e a ALPH alphabet ALPH the alphabet file of the language of the text e s start_on_space this parameter indicates that the search will start at any position in the text even before a space This parameter should only be used to carry out morphological searches e x dont_start_on_space forbids the program to match expressions that start with a space default e c char_by_char works in character by character tokenization mode This is useful for languages like Thai e w word_by_word works in word by word tokenization mode default Output options e M merge merge transducer outputs with text inputs default e R replace replace texts inputs with corresponding transducer outputs This program modifies the input text file 11 16 Fst2Unambig Fst2Unambig OPTIONS lt fst2 gt This programs takes a fst2 text automaton and produces an equivalent text file if the automaton is linear i e with no ambiguity See section 7 5 page 149 OPTIONS e o TXT out TXT the output text file 206 CHAPTER 11 USE OF EXTERNAL PROGRAMS 11 17 Grf2Fst2 Grf2Fst2 OPTIONS graph
28. 5 4 EXPORTING GRAPHS 89 Figure 5 26 Graph with reading direction set to right to left Representation The preferences configuration window has an extra option concerning an tialiasing see figure 5 27 This option activates antialiasing by default for all graphs in the current language It is advised not to activate this option if your machine is an old slow one You can also change the position of the icon bar NOTE the Right to Left option is not present on the general graph configuration frame The orientation of graphs is set per default for the current language as defined in the Text Presentation tab see Figure 4 7 page 69 5 4 Exporting graphs 5 4 1 Inserting a graph into a document In order to include a graph into a document you have to convert it to an image To do this save your graph as a PNG image Click on Save as in the FSGraph menu and select the PNG file format You will get an image ready to be inserted into a document or to be edited with an image editor You should activate antialiasing for the graph that interests you this is not obligatory but results in a better image quality Another solution consists of making a screenshot On Windows Press Print Screen on your keyboard This key should be next to the F12 key Start the Paint program in the Windows Utilities menu Press lt Ctrl V gt Paint will tell you that 90 CHAPTER 5 LOCAL GRAMMARS Graph Presentation Morp
29. Ah marea personne qui ait oubli la m moire d outre mer Si insulele indepartate pierdute la 2 geana orizontului Et ces les l bas o 8 All sentences Plain text All sentences Plain text 8 O Matched sentences Matched sentences All sentences HTML All sentences HTML O Aligned with target concordance Aligned with source concordance Locate Clear alignment Save alignment Save alignment as Figure 9 3 Text alignment frame 9 2 Aligning texts Once you have loaded your texts you can align them by clicking on the Align button You will be asked to provide the name of the XML file that will contain all the information about the alignment Then Unitex launches the XAlign program and you will visualize the alignment under the form of red links between aligned sentences as shown on Figure 9 4 You can edit the alignment links with the mouse Clicking on a link removes it To add a link or remove it if it already exists click on one sentence in the text you want source or destination and then move your mouse over the corresponding sentence in the other text The link about to be added will appear in yellow as shown on Figure 9 5 When you click the link is actually added and becomes red When you have made all your corrections you can save your modified alignment using the Save alignment and Save alignment as buttons An interesting feature of XAlign is that it is reentrant
30. Explode Implode Feras faire V z1 F2s PRO PpvIL z1 2fs 2ms Apply Elag Rule PEZ Explode 4 PRO PpviIL Implode Replace Figure 7 13 Result of applying the grammar in figure 7 12 followed by t So if one considers the sentence of the figure 7 15 beginning with Est il one can see that all non verb interpretations of Est were removed 7 3 2 Compiling ELAG Grammars Before an ELAG grammar can be applied to a text automaton the grammar must be com piled ina rul file This operation is carried out via the Elag Rules command in the Text menu which opens the windows shown in figure 7 16 If the frame on the right already contains grammars which you don t wish to use you can withdraw them with the e button Then select your grammar s in the file explorer located 136 CHAPTER 7 TEXT AUTOMATON a dash followed by il elle or on must be preceeded by a verb lt PRO PpvIL 3ms gt lt PRO PpvIL 3fs gt Figure 7 14 Use of the synchronization point e TE 2 sentences Est il gentil D Sentence 2 ES Reset Sentence Graph Rebuild FST Text close elag frame Explode Implode Apply Elag Rule Explode Implode Figure 7 15 Result of the application of the grammar in figure 7 14 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 137 A Elag Grammar Compilation Set of Elag Grammars bai browse Lookin C PPYs y ca 6 ca 88
31. However the presence of these tags can alter the application of preprocessing graphs To avoid complications you can use the Open Tagged Text command in the Text menu With it you can open a tagged text and skip the application of preprocessing graphs as shown on Figure 2 14 34 CHAPTER 2 LOADING A TEXT Preprocessing amp Lexical parsing xl Preprocessing Sentence and Replace graphs should not be applied on tagged texts Tokenizing The text is automatically tokenized This operation is language dependant Cancel but tokenize text so that Unitex can handle languages with special spacing rules Lexical Parsing Apply All default Dictionaries C Analyse unknown words as free compound words this option is available onty for Dutch German Norwegian amp Russian C Construct Text Automaton Cancel and close text Figure 2 14 Preprocessing a tagged text Chapter 3 Dictionaries 3 1 The DELA dictionaries The electronic dictionaries distributed with Unitex use the DELA syntax Dictionnaires Elec troniques du LADL LADL electronic dictionaries This syntax describes the simple and compound lexical entries of a language with their grammatical semantic and inflectional information We distinguish two kinds of electronic dictionaries The one that is used most often is the dictionary of inflected forms DELAF DELA de formes Fl chies DELA of inflected forms or DELACF DELA de formes Compos es Fl chies DE
32. License This license can also be found in 33 GNU LESSER GENERAL PUBLIC LICENSE Version 2 1 February 1999 Copyright C 1991 1999 Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document but changing it is not allowed This is the first released version of the Lesser GPL It also counts as the successor of the GNU Library Public License version 2 hence the version number 2 1 Preamble The licenses for most software are designed to take away your freedom to share and change it By contrast the GNU General Public Licenses are intended to guarantee your freedom to share and change free software to make sure the software is free for all its users This license the Lesser General Public License applies to some specially designated software packages typically libraries of the Free Software Foundation and other authors who decide to use it You can use it too but we suggest you first think carefully about whether this license or the ordinary General Public License is the better strategy to use in any particular case based on the explanations below When we speak of free software we are referring to freedom of use not price Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software and charge for this service if you wish that you receive source code or can ge
33. Morphological dictionaries lt lt coso ea Pe ee ee ees 109 644 Dictionary Entry Variables eo xs dax 26e aie Oe dune OS 110 60 Exploring grammar paths 2266s corderos AA 111 66 CGrapheollectiops lt span SE P TER RHEE BESS 113 6 7 Rules tor apply transducers A rn ed ye be bu ee AA 114 6 7 1 Insertion to the left of the matched pattern 114 6 7 2 Application while advancing through the text 115 67 3 Priority of the leftmost match i i ee he cas a 115 6 74 Priority of the longestmateh io ad ee ed OG Oe OS 116 6 7 5 Transducer outputs with variables e664 ee a bee ewe bas 116 6 8 Applying graphs to texts gt e Suede RS BONE REEMA SA ELE EHS 118 68 1 Conlgura lo DI PESETA o s see au media AAA 118 632 Concordance lt lt eines a eed bed Bla droits a re 119 6 8 3 Modification of the text 5 0400 646 684 44 05 at au 120 Coe tte Occurrences 2 4 2 2s eee DES A OS eH SEY Beas 121 685 Comparing concordance rr Oe a ee oa 122 6 7 10 CONTENTS Text automaton 125 ch Displaying AMAR gt ur A ee RARE RARA A 125 Poe o 2 5 24 2 en er RE E E e ee E ee 126 7 2 1 Construction rules for text automata 126 7 2 2 Normalization of ambiguous forms lt 9 ro ews 128 7 2 3 Normalization of clitical pronouns in Portuguese 129 724 Keeping the DESEPpatls EE EE E EE RE 131 7 3 Resolving Lexical Ambiguities with ELAG
34. N Comp 3vfp istrazxnim sudijama istrazxni sudija NC_AXNF N Comp 3vfp istrazxne sudije istrazxni sudija NC_AXNF N Comp 4vfp istrazxne sudije istrazxni sudija NC_AXNF N Comp 5vfp istrazxnima sudijama istrazxni sudija NC_AXNF N Comp 6vfp istrazxnim sudijama istrazxni sudija NC_AXNF N Comp 6vfp istrazxnima sudijama istrazxni sudija NC_AXNF N Comp 7vfp 10 3 INTEGRATION IN UNITEX is trazxnim sudijama istrazxni sudija NC_AXNF N Comp 7vfp 189 istrazxne sudije istrazxni sudija NC_AXNF N Comp 2vfw istrazxne sudije istrazxni sudija NC_AXNF N Comp 4vfw istrazxnoga sudiju istrazxni sudija NC_AXNF N Comp ms4v istrazxnog sudiju istrazxni sudija NC_AXNF N Comp ms4v istrazxni sudija istrazxni sudija NC_AXNF N Comp 1lvms istrazxnoga sudije istrazxni sudija NC_AXNF N Comp 2vms istrazxnog sudije istrazxni sudija NC_AXNF N Comp 2vms istrazxnomu sudiji istrazxni sudija NC_AXNF N Comp 3vms istrazxnome sudiji istrazxni sudija NC_AXNF N Comp 3vms istrazxnom sudiji istrazxni sudija NC_AXNF N Comp 3vms istrazxnomu sudiji istrazxni sudija NC_AXNF N Comp 7vms istrazxnome sudiji istrazxni sudija NC_AXNF N Comp 7vms istrazxnom sudiji istrazxni sudija NC_AXNF N Comp 7vms istrazxni sudijo istrazxni sudija NC_AXNF N Comp 5vms i
35. Nb p gt battles royal lt Nb p gt After rewriting these forms into the Unitex DELACF format we obtain the following entries battle royal battle royal N s battle royals battle royal N p battles royal battle royal N p Note that this description is independent of the way we generate inflected forms of single words because we suppose that this problem is handled by an existing external morpho logical system for single words In the Unitex interfaced version of MULTIFLEX we would generate the plural of royal due to the fact that its lemma is known as having the inflection code N1 represented on Figure 10 3 In an inflection paradigm of a MWU each constituent is accompanied only by those mor phological categories which it should inflect for The categories that remain unchanged don t have to be mentioned For instance in bateau mouche in French a Paris style river boat both noun constituents have their gender set but they inflect in number bateaux mouches That s why on Figure 10 4 containing the inflection graph for this MWU the cor responding boxes contain value assignments for number only Note that both constituents may or may not agree in gender here bateau is masculine while mouche is feminine 10 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 175 A A0 D P Figure 10 3 Inflection graph N1 for simple words inflecting like royal s js lt s lt s1No p gt lt s2 gt j sante e g batea
36. P3p white white A Chapter 8 Lexicon grammar The tables of lexicon grammar are a compact way for representing syntactical properties of the elements of a language It is possible to automatically construct local grammars from such tables due to a mechanism of parameterized graphs In the first part of the chapter the formalism of tables is presented The second part describes parameterized graphs and a mechanism of automatically lexicalizing them with lexicon grammar tables 8 1 Lexicon grammar tables Lexicon grammar is a methodology developed by Maurice Gross and the LADL team 9 10 36 zl based on the following principle every verb has an almost unique set of syntactical properties Due to this fact these properties need to be systematically described since it is impossible to predict the exact behavior of a verb These descriptions are repre sented by matrices where rows correspond to verbs and columns to syntactical properties The considered properties are formal properties such as the number and nature of allowed complements of the verb and the different transformations the verb can undergo passiviza tion nominalisation extraposition etc The matrices or tables are mostly binary a sign occurs at the intersection of a row and a column of a property if the verb has that property a sign if not More information in http infolingu univ mlv fr including some lexicon grammar tables that you can freely down
37. THE COMPUTATIONAL MORPHOLOGY OF MWUS 173 e its inflection paradigm called inflection code e the inflection features of forms to be generated Thus within the Unitex MULTIFLEX interface the description of a single unit is done as follows e vive vif A54 fs where A54 is the inflection code of vif and fs is the DELA style description using morpho logical features appearing in Equivalences txt file cf section 10 2 1 Knowing that vive is a feminine singular form of vif we may demand the generation of its plural without hav ing to explicitly indicate the plural of which gender we are interested in since we only wish to change the number the gender remains as in the original word vive i e feminine 10 2 3 Inflection paradigm of a MWU The morphological description of MWUs in our formalism is inspired by the DELA system in the sense that e each MWU is attributed an inflection code e a MWUSs inflection code explicitly describes each inflected form of a MWU in terms of actions to be performed on the lemma and inflectional features to be attached to each form In the Unitex interfaced version MULTIFLEX uses inflection codes represented as Unitex graphs compiled into the fst2 format For example Figure 10 1 contains the inflection graph for battle royal Figure 10 1 Inflection graph for battle royal According to the Unitex convention three constituents are present in battle royal battle re ferred to as 1 a space r
38. TO TEXTS 121 E Display indexed sequences Modify text Resulting snt file Set File Extract units Set File Extract matching units Extract unmatching units Concordance presentation Use a web browser to view the concordance better for more than 2000 matches Show differences with previous concordance Show matching sequences in context Context length Stopat Sort according to Left _40 chars CIS center Left e Right 55 chars _ S Build concordance Figure 6 48 Configuration for displaying the encountered occurrences to start the modification of the text The precedence rules that are applied during these operations are described in section 6 7 After this operation the resulting file is a copy of the text in which transducer outputs have been taken into account Normalization operations and splitting into lexical units are auto matically applied to this text file The existing text dictionaries are not modified Thus if you have chosen to modify the current text the modifications will be effective immediately You can then start new searches on the text WARNING if you have chosen to apply your graph ignoring the transducer outputs all occurrences will be erased from the text 6 8 4 Extracting occurrences To extract from a text all sentences containing matches set the name of your output text file using the Set File button
39. a a a daa ada ee UN Deng EE EE acia hp eee he eed ed E E re e eg da IL OANE i noe Soa aa A E eee wR Ewes 12 File formats 12 1 Unicode Little Endian encoding o sece Le ao a Oe Sark OS E tarada a E AAA ad EA A III 1222 OHO alphabet A rep seiek II A EE ee de A ege Pee de TONE ee a a 12 3 2 Format fst L Le nu Ee AE de oe a a Ae 182 185 195 195 196 196 198 198 200 201 201 202 202 202 203 203 204 205 205 206 206 207 208 209 209 210 210 211 211 212 212 212 213 214 214 8 CONTENTS DATES 2 taa ca A a A a a A 222 J241 EEES e a a a ere E ee a 222 124 2 snt Files lt a ta a WER du dl NET a a EARS 222 TRAS Filetexteod enr ere a a a pre ea 222 1244 The tokens txt file ca 44 dune du ea aa ou tata sta 222 1245 The tok_by_alph txt and tok_by_freq txt files 222 1246 Theenterpos Dle Lin ro san dues Reese pump 223 125 THANA 0 eek a adora aaa 223 2S1 Whe tet les see ek ew he ne BE Oe we eee ee e N 223 125 2 Thecursentence eg as BR He Re EHS SE EHS 224 125 3 ThesemtenceN gri fle lt gt os AAA 224 12 5 4 The cursentence txt file 224 AR 1 4 Lau da Das and db vale A Du a Pe aY 224 1261 Theconcordand fle sec eres se ALMA AURA RL ARE EIRE 224 12 6 2 The concord txt file 225 12 6 3 Th econcordhimlfile ccr sec scared nad dues ess 225 1264 The dit html tile o reccs ee dus ek eu ares
40. abracadabra INTJ 4 1 All chars used in forms Y al 0020 0021 y 002 1 0031 2 0032 0033 0049 004A 004E 0054 0061 0062 Ww H 0063 0064 0065 0066 0067 0069 006C 006D 006E 006F 0070 2 _2 2 2 2 A 2 2 BA A A A A A AM A A A A A A A IO o D BrFRaQaMAAATMHAG 232 CHAPTER 12 FILE FORMATS r 0072 Y s 0073 4 t 0074 u 0075 4 x 0078 4 al 2 grammatical semantic codes used in dictionary Y al INTJ 4 INTJ warning 1 suspect char 1 space SPACE I NT J Y al O inflectional code used in dictionary 4 q 12 9 ELAG files 12 9 1 tagset def file See section 7 3 6 page 140 12 9 2 Ist files LST FILES ARE NOT UNICODE FILES A 1st file contains a list of grf file names These files are supposed to be located in the ELAG directory corresponding to the current working language Here is the elag 1st file used for French PPVs PpvIL grff PPVs PpvLE grf PPVs PpvLUI grff PPVs PpvPR grff P P P PVs PpvSeq grff PVs SE grtY PVs postpos grff 12 9 3 ele files elg files contain compiled ELAG rules These files are in the fst 2 format 12 94 rul files RUL FILES ARE NOT UNICODE FILES A rul file contains the different e lg files that compose an ELAG rule set It contains one part per elg file Each part l
41. among these occurrences the one that will be taken into account Unitex applies the following priority rule for that purpose the leftmost sequence is used If this rule is applied to the three occurrrences of the preceding concordance the occurrence in ancient overlaps with ancient times The first is retained because this is the leftmost occurrence and ancient times is eliminated The following occurrence of times a is no longer in conflict with ancient times and can therefore appear in the result Don there extended in ancient times a large forest The rule of priority of the leftmost match is applied only when the text is modified be it during preprocessing or after the application of a syntactic graph cf section 6 8 3 6 7 4 Priority of the longest match During the application of a syntactic graph it is possible to choose if the priority should be given to the shortest or the longest sequences or if all sequences should be retained During preprocessing the priority is always given to the longest sequences 6 7 5 Transducer outputs with variables As we have seen in section 5 2 5 it is possible to use variables to store some text that has been analyzed by a grammar These variables can be used in preprocessing graphs and in syntactic graphs You have to give names to the variables you use These names can contain non accentuated lower case and upper case letters between A and z digits and the character _
42. be the current position in the text at this time Now the Locate program tries to match the expression described inside the right context If it fails then there will be no match If it matches the whole right context that is to say if Locate reaches the right context end then the program will rewind at the position pos and go on exploring the grammar after the right context end You can also define negative right contexts using to indicate the right context start Figure 6 13 shows a graph that matches numbers that are not followed by th The difference with positive right contexts is that when Locate tries to match the expression described inside the context reaching the context stop will be considered as a failure because it would have matched a forbidden sequence At the opposite if the context stop cannot be reached then Locate will rewind at the position pos and go on exploring the grammar after the context end Right contexts can appear anywhere in the graph including the beginning of the graph Figure 6 14 shows a graph that matches an adjective in the right context of something that is not a past participle In other words this graph matches adjectives that are not ambiguous with past participles 6 3 CONTEXTS 103 H O Figure 6 13 Using a negative right context He Figure 6 14 Matching an adjective that is not ambiguous with a past participle This mechanism allows you to formulate complex patterns For insta
43. c Anim a gt lt 1 Gen g Nb n Case c Anim g Det e gt lt Gen g Nb n Case c Anim a gt lt 1 Gen gNb n Case c Anim g Det d gt masculine gender in accusative singular lt 1 Gen m Nb s Case 4 Anim a Det e gt lt 2 gt lt 3 Gen m Nb s Case 4 Anim a gt lt Gen m Nb s Case 4 Anim a gt lt 1 Gen m Nb s Case 4 Anim a Det d gt Figure 10 31 Inflection graph NC_AXN3 for Serbian MWUs Kosovo i Metohya H lt 1 Gen 81 Nb n Case c Anim a gt EZ n1 Case c Anim a gt EZ lt 5 Gen g5 Nb lt Gen g91 Nb n Case c Anim a gt Figure 10 32 Inflection graph NC_N3XN for Serbian MWUs istrazani sudija a gt lt 1 Gen g Nb s Case c Anim g Det e gt lt 2 gt lt 3 Gen g Nb s Case c Anim lt Gen g Nb s gt lt 1 Gen g Nb s Case c Anim g Det d gt lt 3 Gen f Nb w Case c Anim a gt lt 1 Gen f Nb w Case c Anim g Det e gt lt 1 Gen f Nb p Case c Anim g Det e gt lt 3 Gen f Nb p Case c Anim a gt lt 1 Gen mNb s Case 4 Anim a Det e gt lt 3 Gen m Nb s Case 4 Anim a gt lt Gen m Nb s Case 4 Anim a gt lt 1 Gen m Nb s Case 4 Anim a Det d gt Figure 10 33 Inflection graph NC_AXNF for Serbian MWUs 194 CHAPTER 10 COMPOUND WORD INFLECTION feminin name first name sumame Katarina Jovanovic lt 1 Anim
44. compile the graph Det of Figure 6 10 When you start a pattern search witha grf graph if Unitex detects an error at the graph compilation the locate operation is automatically interrupted 6 3 Contexts Unitex graphs as we described them up to there are equivalent to algebraic grammars These are also known as context free grammars because if you want to match a sequence A the context of A is irrelevant Thus you cannot use a contex free graph for matching occurences 102 CHAPTER 6 ADVANCED USE OF GRAPHS of president not followed by of the republic However you can draw graphs with positive or negative contexts In that case graphs are no more equivalent to algebraic grammars but to context sensitive grammars that do not have the same theoretical properties 6 3 1 Right contexts To define a right context you must bound a zone of the graph with boxes containing and which indicate the start and the end of the right context These bounds appear in the graph as green square brackets Both bounds of a right context must be located in the same graph 52 1 0 Figure 6 12 Using a right context Figure 6 12 shows a simple right context The graph matches numbers followed by a cur rency symbol but this symbol will not appear in matched sequences i e in the concordance Right contexts are interpreted as follows During the application of a grammar on a text let us assume that a right context start is found Let pos
45. cu Zi Dup ce igi m nca portia unul dintre noi ncepea Tanti d mi te rog partea de deasupra Matusa detaga partea de sus ornat de zahar si buc ti de ciocolat sii o d dea ea multumindu se s si ling degetele murdare de zah r 165 All sentences Plain text Matched sentences All sentences HTML Aligned with source concordance 8 Locate Clear alignment Figure 9 9 Displaying matched sentences and sentences they are linked to Save alignment Save alignment as Locate 166 CHAPTER 9 TEXT ALIGNMENT Chapter 10 Compound word inflection MULTIFLEX is a multi lingual Unicode compatible platform for automatic inflection of multi word units MWUs also known as compound words It is meant in particular for the creation of morphological dictionaries of MWUs It implements a unification based formalism 64 for the description of inflectional behavior of MWUs which supposes the existence of a mod ule for the inflectional morphology of simple words In this chapter we present the notion of multi word unit and we describe the method to inflect them with MULTIFLEX This chapter is derived from the MULTIFLEX manual written by Agata Savary the author of MULTIFLEX 10 1 Multi Word Units Multi word units MWUs encompass a bunch of hard to define and controversial linguistic objects cf 39 18 Their numerous linguistic and pragmatic definitions 5 22 51 4
46. de fil de ma m re hallucinants 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 145 decide if a label must be rejected as an invalid one while loading of the text autmaton In fact optional codes are independent of other codes such as for example the attribute of the language level z1 z2 or z3 In the same manner as for inflectional codes it is possible to deny an inflectional attribute by writing the character right before the name of the attribute Thus with our example file the lt A gauche gt symbol recognizes all adjectives in the feminine which do not have the gauche code All codes which are not declared in the tagset def file are discarded by ELAG If a dictio nary entry contains such a code ELAG will produce a warning and will withdraw the code from the entry Consequently if two concurrent entries differ in the original text automaton only by unde clared codes these entries will become indistinguishable by the programs and will thus be unified into only one entry in the resulting automaton Thus the set of labels described in the file tagset def file is compatible with the dictio naries distributed with Unitex by factorizing words which differ only by undeclared codes and this independently of the applied grammars For example in the most complete version of the French dictionary each individual use of a verb is characterized by a reference to the lexicon grammar table which contains it We have con
47. describes how to search a text for simple patterns by using regular expressions 4 1 Definition The goal of this chapter is not to give an introduction on formal languages but to show how to use regular expressions in Unitex in order to search for simple patterns Readers who are interested in a more formal presentation can consult the many works that discuss regular expression patterns A regular expression can be e a token book or a lexical mask lt smoke V gt e the concatenation of two regular expressions he smokes e the union of two regular expressions Pierre Paul e the Kleene star of a regular expression bye 4 2 Tokens In a regular expression a token is defined as in 2 5 4 page 29 Note that the symbols dot plus star less than opening and closing parentheses and double quotes have a special meaning It is therefore necessary to precede them with an escape character if you want to search for them Here are some examples of valid tokens cat des lt N ms gt S 57 58 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS By default Unitex is set up to let lower case patterns also find upper case matches It is pos sibe to enforce case sensitive matching using quotation marks Thus peter recognizes only the form peter and not Peter or PETER NOTE in order to make a space obligatory it needs to be enclosed in quotation marks 4 3 Lexical masks A lexical mask is a search query that matches t
48. designed to be displayed in the graphical interface This section describes these files and some others 12 11 1 The dlf n dlc n et err n files These three files are text files that are stored in the text directory They contain the number of lines of the dlf dlc and err files respectively These numbers are followed by a newline 12 11 2 The stat_dic n file This file is a text file in the directory of the text It has three lines that contain the number of lines of the d1 dlc and err files 12 113 The stats n file This file is in the text directory and contains a line with the following form 3949 sentence delimiters 169394 9428 diff tokens 73788 9399 simple forms 438 10 digits The numbers indicated are interpreted in the following way e sentence delimiters number of sentence separators S e tokens total number of lexical units in the text The number preceeding diff indi cates the number of different units e simple forms the total number of lexical units in the text that are composed of letters The number in parentheses represents the number of different lexical units that are composed of letters e digits the total number of digits used in the text The number in parentheses indi cates the number of different digits used 10 at most 12 11 VARIOUS OTHER FILES 237 12 11 4 The concord n file The concord n file is a text file in the directory of the text It contains information on the latest search of
49. each line is composed by a unit followed by a tab and the number of occurrences of the unit within the text The lines of the tok_by_freq txt file are formed after the same principle but the number of occurrences is placed after the tab and the unit 12 5 TEXT AUTOMATON 223 12 4 6 The enter pos file This file is a binary file containing the list of positions of the newline symbol in the snt file Each position is the index in the text cod file where a newline has been replaced by a space These positions are integers that are encoded in 4 bytes 12 5 Text Automaton 12 5 1 The text fst2 file The text fst2 file is a special fst2 file that represents the text automaton In that file each sub graph represents a sentence automaton The areas reserved for the names of the sub graphs are used to store the sentences from which the sentence automata have been constructed With the exception of the first label which is always the empty word lt E gt the labels have to be either lexical units or entries in the DELAF format in braces Example Here is the file that corresponds to the text He is drinking orange juice 00000000014 1 He is drinking orange juice Y O1 GA Ra Ss N r hot lt He he N s p 4 He he PRO Nomin 3ms 4 is be V P3s is i N p f drinking drinking A drinking drinking N s 4 drinking drink V G orange orange A orange orange N Conc s
50. filters to recognize roman numerals Note that it also uses contexts in order to avoid recognizing uppercase letters in some contexts By default dictionary graphs are applied in MERGE mode If you want to apply them in REPLACE mode you must suffix graph names with r This can be combined with the and priority marks bagpipe r fst2 McAdam r fst2 phtirius r fst2 3 6 4 Morphological dictionary graphs In addition to dictionary graphs that produce new entries in the text dictionaries you can design morphological dictionary graphs The output of such graphs will be used as special input for the construction of the text automaton We call them morphological dictionary graphs because their main utility is to introduce new morphological analysis in the text au tomaton using the morphological mode see section 6 4 This functionality will be helpful for agglutinative languages like Korean The rule is simple any output of a dictionary graph that begins with a slash will be added to the file tags ina located in the text directory This file is used by the Txt2Fst2 program in order to add interpretations into the text automaton Let us consider the grammar shown on Figure 3 14 that matches words made of the prefix un followed by an adjective If we apply this grammar as a dictionary graph we obtain new paths in the text automaton as shown on Figure 3 15 3 7 Bibliography Table 3 4 gives some references for electronic di
51. finite state transducer in the favorable case and an optimized grammar strictly equivalent to the original grammar if not default e d N depth N maximum depth to which graph calls should be unfolded The default value is 10 11 13 Fst2Grf Fst2Grf OPTIONS lt fst2 gt This program extracts a sentence automaton in grf format from the given text automaton OPTIONS e s N sentence N the number of the sentence to be extracted e o XXX output XXX pattern used to name output files XXX grf and XXX txt default cursentence e f FONT font FONT sets the font to be used in the output grf default Times new Roman The program produces the following two files and saves them in the directory of the text e cursentence grf graph representing the automaton of the sentence e cursentence txt text file containing the sentence 204 CHAPTER 11 USE OF EXTERNAL PROGRAMS 11 14 Fst2List Fst2List o out p s f d a t s m f s a s L R sO Str v rx L R 1 line 1 subname c SS 0xxxx fname This program takes a fst2 file and lists the sequences recognized by this grammar The parameters are e fname grammar name including fst2 e o out specifies the output file 1st t xt by default e a t s m indicates if the program must take into account t or not a the out puts of the grammars if any s indicates that there is only one initial state wh
52. for making modifications to it Activities other than copying distribution and modification are not covered by this License they are outside its scope The act of running a program using the Linguistic 257 258 CHAPTER 12 FILE FORMATS Resource is not restricted and output from such a program is covered only if its con tents constitute a work based on the Linguistic Resource independent of the use of the Linguistic Resource in a tool for writing it Whether that is true depends on what the program that uses the Linguistic Resource does You may copy and distribute verbatim copies of the Linguistic Resource as you receive it in any medium provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty keep intact all the notices that refer to this License and to the absence of any warranty and distribute a copy of this License along with the Linguistic Resource You may charge a fee for the physical act of transferring a copy and you may at your option offer warranty protection in exchange for a fee You may modify your copy or copies of the Linguistic Resource or any portion of it thus forming a work based on the Linguistic Resource and copy and distribute such modifications or work under the terms of Section 1 above provided that you also meet all of these conditions a The modified work must itself be a linguistic resource b You must cause
53. fsf org 12 11 6 2 Anna ANASTASSIADIS SYMEONIDIS Tita KYRIACOPOULOU Elsa SKLAVOUNOU Ias son THILIKOS and Rania VOSKAKI A system for analysing texts in modern greek representing and solving ambiguities In Proceedings of COMLEX 2000 Workshop on Computational Lexicography and Multimedia Dictionaries Patras 2000 3 7 3 Jean Claude ANSCOMBRE Pourquoi un moulin vent n est pas un ventilateur Langue Francaise 86 1990 10 1 4 Laurie BAUER English Word Formation Cambridge University Press 1983 10 1 5 Emile BENVENISTE Fondements syntaxiques de la composition nominale Formes nouvelles de la composition nominale pages 145 176 Gallimard Paris 1974 10 1 6 Olivier BLANC and Anne DISTER Automates lexicaux avec structure de traits In Actes RECITAL 2004 2004 7 3 7 Xavier BLANCO Noms compos s et traduction francais espagnol Lingvistice Investi gationes 21 1 1997 Amsterdam Philadelphia John Benjamins Publishing Company oF 8 Xavier BLANCO Les dictionnaires lectroniques de l espagnol DELASs et DELACs Lingvistice Investigationes 23 2 2000 Amsterdam Philadelphia John Benjamins Pub lishing Company 3 7 9 Jean Paul BOONS Alain GUILLET and Christian LECLERE La structure des phrases simples en frangais classes de constructions transitives Technical report LADL Paris 1976 8 1 10 Jean Paul BOONS Alain GUILLET and Christian LECLERE La structure des phrases sim ples en francais
54. graph Compilation is the operation that converts the grf format to a format that can be ma nipulated more easily by Unitex programs In order to compile a graph you must open it and then click on Compile FST2 in the Tools submenu of the menu FSGraph Unitex then launches the Gr 2Fst2 program You can keep track of its execution in a window cf Figure 6 4 Messages with a colored background are generated by the interface not by the external programs _ Compiling graph DetN Compiling graph DetSimple Recursion detection started Resolving lt E gt conditions Looking for lt E gt loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Cannot open the graph DetSimple grf D My UnitexEnglishiGraphs DetSimple grf Figure 6 4 Compilation window If the graph references subgraphs those are automatically compiled The result is a fst2 file that contains all the graphs that make up a grammar The grammar is then ready to be used by Unitex programs 6 2 2 Approximation with a finite state transducer The FST2 format conserves the architecture in subgraphs of the grammars which is what makes them different from strict finite state transducers The Flatten program allows 98 CHAPTER 6 ADVANCED USE OF GRAPHS you to turn a FST2 grammar into a finite state transducer whenever this is possible and to construct an approximation if not This function thus permits to obtain obje
55. in OUT the developed form of the text automaton lt txtauto gt OPTIONS e o OUT output OUT output file By default OUT is made of lt txtauto gt with exp before the extension like foo exp fst2 11 11 Extract Extract OPTIONS lt text gt This program extracts from the given text all sentences that contain at least one occurrence from the concordance The parameter lt text gt represents the complete path of the text file without omitting the extension snt OPTIONS e y yes extracts all sentences containing matching units default e n no extracts all sentences that don t contain matching units e o OUT output 0UT output text file e i X index X the ind file that describes the concordance By default X is the concord ind file located in the text directory The result file is a text file that contains all extracted sentences one sentence per line 11 12 FLATTEN 203 11 12 Flatten Flatten OPTIONS lt fst2 gt This program takes a fst2 grammar as its parameter and tries to transform it into a final state transducer OPTIONS e f fst the grammar is unfolded to the maximum depth and is truncated if there are calls to sub graphs Truncated calls are replaced by void transitions The result is a fst2 grammar that only contains a single finite state transducer e r rtn calls to sub graphs that remain after the transformation are left as they are The result is therefore a
56. is applied to the following text Formulas like E mc2 have nothing to do with acorn shells you will get the following lines in the dictionary of compound words of the text E mc2 FORMULA acorn shells N p 38 CHAPTER 3 DICTIONARIES Entry Factorization Several entries containing the same inflected and canonical forms can be combined into a single one if they also share the same grammatical and semantic codes Among other things this allows us to combine identical conjugations for a verb bottle V W P1s P2s Plp P2p P3p If the grammatical and semantic information differ one has to create distinct entries bottle N Conc s bottle V W P1s P2s Plp P2p P3p Some entries that have the same grammatical and semantic entries can have different mean ings as it is the case for the French word po le that describes a stove or a type of sheet in the masculine sense and a kitchen instrument in the feminine sense You can thus distinguish the entries in this case po le N z1 fs po le a frire po le N z1 ms voile linceul appareil de chauffage NOTE In practice this distinction has the only consequence that the number of entries in the dictionary increases For the different programs that make up Unitex these entries are equivalent to po le N z1 fs ms Whether this distinction is made is thus left to the maintainers of the dictionaries 3 12 The DELAS Format The DELAS format is ver
57. kao vu k gladan kao vu CHAPTER 10 COMPOUND WORD INFLECTION k AC_A3X k AC_A3X k AC_A3X k AC_A3X k AC_A3X 2 s6fgea 2 s6ngea 2 s7mgda 2 s7fgea 2 s7ngda hungry as a wo hungry as a wo hungry as a wol AC_A3XN2 s7mgka hungry as a wolf hungry as a wol hungry as a wol Lf lf kao vuk gladan kao vuk AC_A3XN2 s7mgda hungry as a wolf Lf Lf kao vuk gladan kao vuk AC_A3XN2 s7ngda hungry as a wolf Lf AC_A3XN2 s7ngka hungry as a wolf kao vuk gladan kao vuk AC_A3XN2 plmgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 plmgea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 plmgea hungry as a wol kao vuk gladan kao vuk AC_A3XN2 plfgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 plfgea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 plfgea hungry as a wol kao vuk gladan kao vuk AC_A3XN2 plngea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 plngea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 plngea hungry as a wol kao vuk gladan kao vuk AC_A3XN2 p2mgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 p2mgea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 p2mgea hungry as a wo kao vuk gladan kao vuk AC_A3XN2 p2fgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 p2fgea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 p2fgea hungry as a wo gladnih kao vuk gladan kao vuk AC_A3XN2 p2ngea hungry as a wolf gladnih kao vuci gladan kao vuk AC_A3
58. linguistic queries In Matthieu Constant Takuya Nakamura Michele De Gioia and Sara Vecchiato editors 27th International Conference on Lexis and Grammar LGC 08 pages 117 125 September 2008 9 9 2 BIBLIOGRAPHY 267 61 Adam PRZEPI RKOWSKI and Marcin WOLINSKI The Unbearable Lightness of Tag ging A Case Study in Morphosyntactic Tagging of Polish In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora EACL 2003 2003 10 1 1 10 2 2 62 Roger Bruno RABENNILAINA Le verbe malgache AUPELF UREF et Universit Paris 13 Paris 1991 8 1 63 Agata SAVARY Recensement et description des mots compos s m thodes et applications 2000 These de doctorat Universit de Marne la Vall e 3 7 10 1 1 10 1 2 64 Agata SAVARY A formalism for the computational morphology of multi word units Archives of Control Sciences 15 3 437 449 2005 10 10 1 2 10 2 65 Max SILBERZTEIN Les groupes nominaux productifs et les noms compos s lexical is s Lingvistice Investigationes 27 2 405 426 1999 Amsterdam Philadelphia John Benjamins Publishing Company 3 7 10 1 66 Carlos SUBIRATS R GGEBERG Sentential complementation in Spanish A lexico grammatical study of three classes of verbs John Benjamins Amsterdam Philadelphia 1987 8 1 67 Thomas TREIG Compl tives en allemand classification Technical Report 7 LADL 1977 8 1 68 Lidia VARGA Classification syntaxique des verbes de mou
59. lt var gt Gen lt var gt adv The above file says that for Polish three inflection categories are considered the number Nb the case Case and the gender Gen Each category is given an exhaustive list of its possible values singular and plural for number etc Further each morphological class is described with respect to the categories it inflects for and those that are fixed for it For example a noun inflects for number and case and has a fixed gender The presence of 10 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 171 such a file is necessary if we wish to express the fact that a certain word inflects for number gender or case without having to explicitly enumerate each time which inflectional values singular plural masculine etc it can take Similarly for French the Morphology txt file may be as follows French lt CATEGORIES gt Nb s p Gen m f lt CLASSES gt noun Nb lt var gt Gen lt var gt adj Nb lt var gt Gen lt var gt adv However in the existing systems for computational morphology such a description of classes categories and values is not always present For example according to the DELA conven tions 20 the morphological values of each simple word are plain sequences of characters e g ms for masculine singular without any explicit mention of their corresponding cate gories In order for the program to be compatible with such systems we use a list co
60. of the application of the grammar shown on Figure 6 24 E Figure 6 26 A grammar with both left and right contexts Concordance D My Unitex English Corpusivanhoe_snticoncord html said Athelstane upon whose memory the Abbot s good ale for Burton was ala mounted some by the dexterity of their adversary s lance some by the s ES The javelin inflicted a wound upon the animal s shoulder and narrowly mis the Templar aimed at the centre of his antagonist s shield and struck it r is not yet very far spent let the archer s shoot a few rounds at the he back of which was decorated with two ass s ears and which was placed taking their directions more from the Baron s eye and his hand than his Figure 6 27 Results of the application of the grammar shown on Figure 6 26 108 CHAPTER 6 ADVANCED USE OF GRAPHS 6 4 The morphological mode 641 Why As Unitex works on a tokenized version of the text it is not possible to perform queries that need to enter inside tokens except with morphological filters see section 4 7 as shown on Figure 6 28 2 502 2 0 This does not work We should use the following morphological filter lt lt un able gt gt Figure 6 28 Matching morphological things However even morphological filters cannot allow any query since they cannot refer to dic tionaries Thus it is impossible to formulate this way a query like a word made of the prefix un followed by an adjective suffixed
61. operation on a list of words or expressions from a text editor to a box in a graph In order to avoid having to copy every term manually Unitex 5 2 EDITING GRAPHS 81 January February March April May June July em e A August year month September October November December month year Figure 5 16 Inverting month and year in a date provides a mean to copy lists To use this select the list in your text editor and copy it using lt Ctrl C gt or the copy function integrated in your editor Then create a box in your graph and press lt Ctrl V gt or use the Paste command in the Edit menu to paste it into the box A window as in Figure 5 17 opens Messaoe x Choose your left and right contexts tem Figure 5 17 Selecting a context for copying a list This window allows you to define the left and right contexts that will automatically be used for each term of the list By default these contexts are empty If you use the contexts lt and V gt with the following list eat sleep drink play read you will get the box in figure 5 18 82 CHAPTER 5 LOCAL GRAMMARS lt eat V gt lt sleep V gt lt drink V gt O lt play V gt lt read V gt Figure 5 18 Box resulting from copying a list and applying contexts 5 2 7 Special Symbols The Unitex graph editor interprets the following symbol in a special manner 2 lt gt Table 5 1 summarizes the meanin
62. order to let a dictionary entry E be recognized by mask M it is necessary that at least one inflectional code of E contains all the characters of an inflectional code of M Consider the following example E pretext V W P1s P2s P1p P2p P3p M lt V P3s P3 gt No inflectional code of E contains the characters P 3 and s at the same time However the code P3p of E does contain both characters P and 3 The code P3 is included in at least one code of E mask M thus recognizes entry E The order of the characters inside an inflectional code is without importance 4 3 5 Negation of a lexical mask It is possible to negate a lexical mask by placing the character immediately after the char acter lt Negation is possible with the masks lt MOT gt lt MIN gt lt MAJ gt lt PRE gt lt DIC gt as well as with the masks that carry grammatical semantic of inflectional codes i e lt V z3 P3 gt The masks and are the negation of each other The mask lt MOT gt recognizes all tokens that do not consist of letters except for the sentence separator S and the STOP marker Negation has no effect on lt NB gt lt SDIC gt lt CDIC gt and lt TOKEN gt The negation is interpreted in a special way in the lexical masks lt DIC gt lt MIN gt lt MAJ gt and lt PRE gt Instead of recognizing all forms that are not recognized by the mask without negation these masks find only forms that are sequences of letters Thus the ma
63. see the frame shown on Figure 9 1 You provide texts under two formats raw unicode text as you do for your corpus or TEl encoded texts an XML format see 41 In the last text field you can select a XML alignment file if you have already built one If you select a raw text Unitex will need to build a basic TEl version of it for more details see section 11 32 about the XMLizer program So when you click on OK you will be asked to provide a XML file name as shown on Figure 9 2 Then Unitex builds the XML versions of your texts if needed and displays the frame shown on Figure 9 3 As you can see each text is presented as a list each cell representing a sentence 159 160 CHAPTER 9 TEXT ALIGNMENT Target text IN E Alignment file optional Figure 9 1 Text alignment selection frame Source text D My UnitexiFrenchiCorpusiA funtana fr bd oa DiM ral Your source file is a txt one Please select the Alignt destination file to be used by XAlign TEI format _ OK Figure 9 2 Warning about raw texts 9 2 ALIGNING TEXTS 161 D My UnitexiXAlign funtana xml Je vous demande pardon cer scuze stimat doamna ca nu pot ch re madame de ne pas s v r spund n limba dumneavoastr pouvoir MENS ENEE aa Sint probabil sigura persoan de pe ope Tongue aceasta insula c reia i s a ters din Je suis sans dout sur memorie lumea de dincolo de mare cette ile la seule e a 2
64. spaces inside compound words The case of lexical units is retained For example if the word Here is encountered the capital letter is preserved cf figure 7 1 This choice allows you to keep this information during the transition to the text automaton which could be useful for applications where case is important as for recognition of proper names 7 2 2 Normalization of ambiguous forms During construction of the automaton it is possible to effect a normalization of ambiguous forms by applying a normalization grammar This grammar has to be called Norm fst2 and must be placed in your personal folder in the subfolder Graphs Normalization of the desired language The normalization grammars for ambiguous forms are described in section 6 1 3 If a sequence of the text is recognized by the normalization grammar all the interpretations that are described by the grammar are inserted into the text automaton Figure 7 4 shows the part of the grammar used for the ambiguity of the sequence 1 in French la le PRO PpvLE z1 3fs Figure 7 4 Normalization of the sequence 1 If this grammar is applied to a French sentence containing the sequence 1 a sentence au tomaton that is similar to the one in figure 7 5 is obtained You can see that the four rules for rewriting the sequence 1 have been applied which has added four labels to the automaton These labels are not concurrent with the two preexisting 7 2 CONSTRUCTION 129
65. the Temple and Sir Brian de BoisGuilbert TITLE 5Sir well knows have offended replied Sir Brian TITLE Sir I crave your Figure 6 43 Concordance obtained by application of graph TitleName 117 118 CHAPTER 6 ADVANCED USE OF GRAPHS Outputs with variables can be used to move word groups In fact the application of a trans ducer in REPLACE mode inserts only the produced sequences into the text In order to invert two word groups you just have to store them into variables and produce an out put with these variables in the desired order Thus the application of the transducer in Figure 6 44 in REPLACE mode to the text Ivanhoe results in the concordance of Figure 6 45 SHE EEN 0 ADJ ADJ NOUN NOUN PNOUN ADJ Figure 6 44 Inversion of words using two variables EN Concordance D My UnitexlEnglishiCorpus ivanhoe_snticoncord html stopping the course of a brook small which glided smoothly round the foot a when his return from his captivity long had become an event rather wished t El heir gnarled arms over a carpet thick of the most delicious green sward 5 ight as it were to the chains feudal with which they were loaded 5 At c arance of that wild and character rustic which belonged to the woodlands gorget was engraved in characters Saxon an inscription of the following nd the sufferings of the classes inferior arose from the consequences of t PO momo PTE CS ee A ee BETEN ee PR DS PPT SS gt
66. the information you received as to the offer to distribute cor responding source code This alternative is allowed only for noncommercial dis tribution and only if you received the program in object code or executable form with such an offer in accord with Subsection b above 242 CHAPTER 12 FILE FORMATS The source code for a work means the preferred form of the work for making modifica tions to it For an executable work complete source code means all the source code for all modules it contains plus any associated interface definition files plus the scripts used to control compilation and installation of the executable However as a spe cial exception the source code distributed need not include anything that is normally distributed in either source or binary form with the major components compiler kernel and so on of the operating system on which the executable runs unless that component itself accompanies the executable If distribution of executable or object code is made by offering access to copy from a designated place then offering equivalent access to copy the source code from the same place counts as distribution of the source code even though third parties are not compelled to copy the source along with the object code You may not copy modify sublicense or distribute the Program except as expressly provided under this License Any attempt otherwise to copy modify sublicense or distribute the Program is v
67. the pleasant town 123 127 the noble seats 157 161 the fabulous Dragon 189 193 the Civil Wars 455 459 the feeble interference 463 467 the English Council 568 572 the national convulsions 592 596 the inferior gentry 628 632 the English constitutionY 698 702 the petty kingsY 815 819 the certain hazard 898 902 the great Barons 940 944 the very edge The first line indicates in which transduction mode the concordance has been constructed The three possible values are 12 6 CONCORDANCES 225 e 1 transducer outputs have been ignored e M transducer outputs have been inserted before the corresponding inputs MERGE mode e R transducer outputs have replaced the recognized sequences REPLACE mode Each occurrence is described in one line The lines start with the start and end position of the occurrence These positions are given in lexical units If the file has the heading line 1 the end position of each occurrence is immediately fol lowed by a newline Otherwise it is followed by a space and a sequence of characters In REPLACE mode that sequence corresponds to the output produced for the recognized se quence In MERGE mode it represents the recognized sequences into which the outputs have been inserted In MERGE or REPLACE mode this sequence is displayed in the con cordance If the outputs have been ignored the contents of the occurrence is extracted from the
68. the same axis e Bottom boxes are aligned with the bottom most box alignment sl Horizontal Vertical Top Left Center Bottom Right _ Use Grid every 30 pixels OK Cancel Figure 5 22 Alignment window The possibilities for vertical alignment are e Left boxes are aligned with the left most box e Center boxes are centered on the same axis e Right boxes are aligned with the right most box Figure 5 23 shows an example of alignment The group of boxes to the right is quite a copy of the ones to the left that was aligned The option Use Grid in the alignment window shows a grid as the background of the graph This allows you to approximately align the boxes 5 3 DISPLAY OPTIONS 87 es SN more aset N Sir DY fits DEE Figure 5 23 Example of box alignment grid grf X1BOULOTiRechercheimanuelunitexiresourceslimg Figure 5 24 Example of using the grid 5 3 5 Display options fonts and colors You can configure the display style of a graph by pressing lt Ctrl R gt or by clicking on Pre sentation in the Format sub menu of the FSGraph menu which opens the window as in figure 5 25 The font parameters are e Input font used within the boxes and in the text area where the contents of the boxes is edited e Output font used for the attached transducer outputs 88 CHAPTER 5 LOCAL GRAMMARS Presenta
69. them as well as a reference directing the user to the copy of this License Also you must do one of these things a Accompany the work with the complete corresponding machine readable source code for the Library including whatever changes were used in the work which must be dis tributed under Sections 1 and 2 above and if the work is an executable linked with the Library with the complete machine readable work that uses the Library as object code and or source code so that the user can modify the Library and then relink to produce a modified executable containing the modified Library It is understood that the user who changes the contents of definitions files in the Library will not necessarily be able to recom pile the application to use the modified definitions b Use a suitable shared library mechanism for linking with the Library A suitable mech anism is one that 1 uses at run time a copy of the library already present on the user s computer system rather than copying library functions into the executable and 2 will op erate properly with a modified version of the library if the user installs one as long as the modified version is interface compatible with the version that the work was made with c Accompany the work with a written offer valid for at least three years to give the same user the materials specified in Subsection 6a above for a charge no more than the cost of performing this distribution 252 CHAPTER
70. then reset the automaton of that sentence by clicking on the botton Reset Sentence Graph cf figure 7 24 During the construction of the text automaton all the modified sentence graphs in the text file are erased 2344 sentences Ivanhoe by Sir Walter Scott Sentence Y Reset Sentence Graph Rebuild FST Text close elag frame Explode Implode JH Ivanhoe by Sir Walter Scott N ProperNoun PREP N ProperNoun Apply Elag Rule j a v Figure 7 24 Modified sentence automaton NOTE After you reconstruct the text automaton you can save your manual modifications In order to do that click on the button Rebuild FST Text All sentences that have been modified are then replaced in the text automaton by their modified versions The new text automaton is then automatically reloaded 7 4 3 Display configuration Sentence automata are subject to the same presentation options as the graphs They use the same colors and fonts as well as the antialiasing effect In order to configure the ap pearance of the sentence automata you modify the general configuration by clicking on Preferences in the Info menu For further details refer to section 5 3 5 You can also print a sentence automaton by clicking on Print in the FSGraph menu or by pressing lt Ctrl P gt Make sure that the printer s page orientation is set to landscape mode To configure this parameter click on Page Setup in
71. these graphs are normalization of non ambiguous forms and sentence boundary recognition The interpretation of these graphs in Unitex is very close to that of syntactic graphs used by the search for patterns The differences are the following e you can use the special symbol lt gt that recognizes a newline e if you work in character by character mode you can use the special symbol lt L gt that recognizes one letter as defined in the alphabet file e it is impossible to refer to information in dictionaries e it is impossible to use morphological filters e it is impossible to use morphological mode 6 1 TYPES OFGRAPHS 95 e it is impossible to use contexts The figures 2 9 page 26 and 2 10 page 28 show examples of preprocessing graphs 6 1 3 Graphs for normalizing the text automaton Graphs for normalizing the text automaton allow you to normalize ambiguous forms They can describe several labels for the same form These labels are then inserted into the text automaton thus making the ambiguity explicit Figure 6 3 shows an extract of the normal ization graph used by default for French de DET Dind z1 mp fp Figure 6 3 Extract of the normalization graph used for French The paths describe the forms that have to be normalized Lower case and upper case vari ants are taken into account according to the following principle uppercase letters in the graph only recognize uppercase letters in the text automaton
72. to display a concordance you have to click on the Build concordance button You can parameterize the size of left and right contexts in characters You can also choose the sorting mode that will be applied to the lines of the concordance in the Sort Accord ing to menu For further details on the parameters of concordance construction refer to section 4 8 2 The concordance is produced in the form of an HTML file You can parameterize Unitex so that concordance files can be read using a web browser cf section 4 8 2 If you display concordances with the window provided by Unitex you can access a recog nized sequence in the text by clicking on the occurrence If the text window is not iconified and the text is not too long to be displayed you see the selected sequence appear cf Fig ure 6 49 Furthermore if the text automaton has been constructed and if the corresponding window is not iconified clicking on an occurrence selects the automaton of the sentence that contains this occurrence 6 8 3 Modification of the text You can choose to modify the text instead of constructing a concordance In order to do that type a file name in the Modify text field in the window of Figure 6 48 This file has to have the extension txt If you want to modify the current text you have to choose the corresponding txt file If you choose another file name the current text will not be affected Click on the GO button 6 8 APPLYING GRAPHS
73. underscore In order to define the boundings of the zone to be stored in a variable you have to create two boxes that contain the name of the variable enclosed in the characters and and for the end of a variable In order to use a variable in a transducer output its name must be surrounded by the character cf Figure 6 42 Variables are global This means that you can define a variable in a graph and reference it in another as is illustrated in the graphs of Figure 6 42 If the graph Tit leName is applied in MERGE mode to the text Ivanhoe the concordance in Figure 6 43 is obtained 6 7 RULES FOR APPLYING TRANSDUCERS TitleName grf X BOULOTiRechercheimanuelunitex resourcesigr n 7 Figure 6 42 Definition of a variable in a subgraph Concordance D My Unitex EnglishiCorpuslivanhoe_snticoncord html lders and was silent 5 Prince John TITLE Prince resumed his retreat he hermit his name is Sir Anthony of Scrabelstone TITLE Sir as if I again passed round To Sir Athelstane of Coningsburgh TITLE Sir r shall call thee Saxon Sir Baron TITLE Sir replied Cedric offended to say lady answered Sir Brian de Bois TITLE Sir Guilbert ory Sir Palmer said Sir Brian de Bois TITLE Sir Guilbert so unsafe the escort of Sir Brian de Bois TITLE Sir Guilbert is not to er to be a handmaiden to Sir Brian de Bois TITLE Sir Guilbert after the ghts of
74. wish you can convert your data into other encodings as for example UTF 8 in order for instance to create web pages The button Add Files enables you to select the files to be converted The button Remove Files makes it possible to remove a list of files erro neously selected The button Transcode will start the conversion of all the selected files If an error occurs with a file is processed for example a file which is already in Unicode the conversion continues with the next file E Transcode Files Source encoding Replace LATIN15 Rename source with prefix LATIN2 S LATIN3 A LATINA Name destination with prefix Rename source with suffix LATINS 8 Name destination with suffix LATIN LATING Prefix suffix y LITTLE ENDIAN utf1 6 Selected files ere D A Unitex English Corpus novel txt D A Unitex English Corpus wiki monoide en txt Cancel Figure 2 3 Transcoding files To obtain a text in the right format you can also use a text processor like the free software from OpenOffice org 57 or Microsoft Word and save your document with the format Unicode text In OpenOffice Writer you have to choose the Coded Text txt format and then select the Unicode encoding in the configuration window as shown on figure 2 4 Options de filtre ASCII D Ke Jeu de caract res Unicode Annuler Saut de paragraphe RALF CR CU trader Aide
75. with able To overcome this difficulty we introduced a morphological mode in the Locate program It consists of bounding a part of your grammar with the special symbols lt and gt Within this zone things are matched letter by letter as shown on Figure 6 29 Figure 6 29 Example of morphological zone in a grammar 6 4 2 The rules In this mode the content of the graph is not interpreted as it is in the normal way 1 There is no implicit space between boxes So if you want to match a space you have to make it explicit with a space between double quotes 2 You can still use subgraphs but the end of the morphological zone must occur in the same graph as its beginning 3 You cannot declare variables with xxx and xxx 6 4 THE MORPHOLOGICAL MODE 109 4 You can use morphological filters on lt DIC gt and patterns referring to dictionaries like lt be gt lt N ms gt etc 5 Left and right contexts are forbidden 6 You can use outputs 7 lt MOT gt will match any letter as defined in the alphabet file 8 lt MIN gt will match any lowercase letter as defined in the alphabet file 9 lt MAJ gt will match any uppercase letter as defined in the alphabet file 10 lt DIC gt will match any word present in the morphological dictionaries see below 11 You can use patterns that refer to the morphological dictionaries like lt have gt lt V K gt etc 12 The meta lt PRE gt lt
76. 0 describes the compound word inflection module as a complement of the simple word inflection mechanism presented in chapter 3 Chapter 11 contains a detailed description of the external programs that make up the Unitex system Chapter 12 contains descriptions of all file formats used in the system The reader will find in appendix the GPL and LGPL licenses under which the Unitex source code is released as well as the LGPLLR license which applies for the linguistic data dis tributed with Unitex 14 CONTENTS Chapter 1 Installation of Unitex Unitex is a multi platform system that runs on Windows as well as on Linux or MacOS This chapter describes how to install and how to launch Unitex on any of these systems It also presents the procedures used to add new languages and to uninstall Unitex 1 1 Licenses Unitex is a free software This means that the sources of the programs are distributed with the software and that anyone can modify and redistribute them The code of the Unitex programs is under the LGPL licence 33 except for the TRE library for dealing with reg ular expressions from Ville Laurikari 50 which is under GPL licence 32 The LGPL Licence is more permissive than the GPL licence because it makes it possible to use LGPL code in nonfree software From the point of view of the user there is no difference because in both cases the software can freely be used and distributed All the data that go with Un
77. 12 FILE FORMATS d If distribution of the work is made by offering access to copy from a designated place offer equivalent access to copy the above specified materials from the same place e Verify that the user has already received a copy of these materials or that you have already sent this user a copy For an executable the required form of the work that uses the Library must include any data and utility programs needed for reproducing the executable from it However as a special exception the materials to be distributed need not include anything that is normally distributed in either source or binary form with the major components compiler kernel and so on of the operating system on which the executable runs unless that component itself accompanies the executable It may happen that this requirement contradicts the license restrictions of other propri etary libraries that do not normally accompany the operating system Such a contradiction means you cannot use both them and the Library together in an executable that you dis tribute 7 You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities not covered by this License and distribute such a combined library provided that the separate distribution of the work based on the Library and of the other library facilities is otherwise permitted and provided that you do these two things a Accompany
78. 21 fs NC_NN cx caf caf Nl ms au carte carte N21 fs cousin cousin N8 ms franc franc A47 ms postale pos m moir germain germain A8 ms NC_NNmf ma on ma on N41 ms NC_ANI lait NC_NXXXX tal A8 fs NC_NNS m moire N21 fs microscope microscope Nl ms porte serviette serviett vive vi a N2l E f A48 fs NC_NN effet tunnel NC_NXXXXXX s NC_VNm The corresponding inflection graphs for MWUs are shown on figures 10 21 through 10 27 The DELACF dictionary resulting from the inflection via MULTIFLEX of the above DELAC is as follows avant garde avant garde NC_XXN avant gardes avant garde NC_XXN fs fp bateau mouche bateau mouche NC_NN ms bateaux mouches bateau mouche N caf au lait caf au lait NC_NX caf s au lait caf au lait NC_N carte postale carte postale NC_ cartes postales carte postale N cousin germain cousin germain N cousins germains cousin germain cousine germaine cousin germain cousines germaines cousin germa franc macon franc macon NC_AN1 franc maconne franc macon NC_ franc macon franc macon NC_AN1 AN franc maconne franc ma on NC_ francs maconnes francs ma ons franc macon NC_AN1 C_NN mp XXX ms XXXX mp NN fs C_NN f C_NNmf ms NC_NNmf mp NC_NNmf fs in NC_NNmf ms AN ms Jace mp
79. 226 CHAPTER 12 FILE FORMATS Here is an example of a file lt html lang en gt lt head gt Y q lt meta http equiv Content Type content text html charset UTE 8 gt Y lt title gt 6 matches lt title gt Y lt head gt Y lt body gt Y lt table border 0 width 100 gt lt td nowrap gt Y lt font face Courier new size 3 gt g on there lt a href 116 124 2 gt extended lt a gt amp nbsp i amp nbsp lt br gt amp nbsp extended lt a href 125 127 2 gt in lt a gt amp nbsp ancient enbsp lt br gt amp nbsp Scott S lt a href 32 34 2 gt IN lt a gt amp nbsp THAT PL nbsp lt br gt STRICT of lt a href 61 66 2 gt merry lt a gt nbsp Engl amp nbsp lt br gt S IN THAT lt a href 40 48 2 gt PLEASANT lt a gt amp nbsp D amp nbsp lt br gt amp nbsp which is lt a href 84 91 2 gt watered lt a gt amp nbsp by amp nbsp lt br gt lt font gt lt td gt lt table gt lt body gt J lt html gt Figure 12 2 shows the page that corresponds to the file below F Concordance Di o D MATTRE L AUTRE DOMESTIQUE habit e MA TRE membres la maison portant Figure 12 2 Example of a concordance 12 6 4 The diff html file The diff html file is an HTML file that presents the differences between two concordances This file is encoded in UTF 8 Here is an example of file new lines have been introduced for presentation convenience lt html gt lt
80. 2XN1 N C zxiro racyun zxiro racyun NC_2XN1 N Co zxiro racyuna zxiro racyun NC_2XN1 N C zxiro racyunu zxiro racyun NC_2XN1 N C zxiro racyun zxiro racyun NC_2XN1 N Co zxiro racyune zxiro racyun NC_2XN1 N C zxiro racyunom zxiro racyun NC_2XN1 N zxiro racyunu zxiro racyun NC_2XN1 N C zxiro racyuni zxiro racyun NC_2XN1 N C zxiro racyuna ZXiro racyun C_ 2XN1 N C zxiro racyunima zxiro racyun NC_2XN1 N zxiro racyune zxiro racyuni ZXiro racyun ZXiro racyun C_ C_ 2XN1 N C 2XN1 N C zxiro racyunima zxiro racyun NC_2XN1 N zxiro racyunima zxiro racyun NC_2XN1 N zxiro racyuna zxiro racyuna avio prevozni avio prevozni avio prevozni avio prevozni avio prevoznicye avio prevoznik NC_2XN avio prevozni avio prevozni avio prevoznici avio prevoznik avio prevozni avio prevoznicima avio prevoznik NC_2X ZxXiro racyun ZXiro racyun C_ C_ 2XN1 N C 2XN1 N C X 1 1 XN1 N Comp p6qm X 1 omp w2qm omp w4qm mp slqm omp s2qm omp s3qm mp s4qm omp s5qm Comp s6qm omp s7qm omp plqm omp p2qm Comp p3qm omp p4qm omp p5qm Comp p6qm Comp p7qm omp w2qm omp w4qm k avio prevoznik NC_2XN2 N Comp slvm ka avio prevoznik ku avio prevoznik ka avio prevoznik kom avio prevozni ku avio prevoznik ka avio prevoznik avio prevozni avio prevoznici avio prevoznik avio prevoznicima avio prevoznik NC_2X avio prevoz
81. 9 LOCATE 207 11 19 Locate Locate OPTIONS lt fst2 gt This program applies a grammar to a text and constructs an index of the occurrences found OPTIONS t TXT text TXT complete path of the text file without omitting the snt ex tension a ALPH alphabet ALPH complete path of the alphabet file m DICS morpho DICS this optional parameter indicates which morphological dictionaries are to be used if needed by some fst2 dictionaries DICS represents a list of bin files with full paths separated with semi colons s start_on_space this parameter indicates that the search will start at any position in the text even before a space This parameter should only be used to carry out morphological searches x dont_start_on_space forbids the program to match expressions that start with a space default c char_by_char works in character by character tokenization mode This is useful for languages like Thai w word_by_word works in word by word tokenization mode default d DIR sntdir DIR puts produced files in DIR instead of the text directory Note that DIR must end with a file separator or Search limit options 1 a11 looks for all matches default n N number_of_matches N stops after the first N matches Matching mode options S shortest_ matches L longest_matches default A a1l matches Output options e 1 ignore ignore transducer outputs de
82. ACOLOR COMMENT NODES COLOR SCOLOR SELECTED NODES COLOR CCOLOR Table 12 3 Meaning of the parameters The PACKAGE NODES parameter defines the color to be used for displaying calls to sub graphs located in the repository The CONTEXT NODES parameter defines the color to be used for displaying boxes that cor respond to context bounds 12 10 CONFIGURATION FILES 235 The CONTEXT NODES indicates if the current language must be tokenized character by char acter or not The ANTIALIASING parameter indicates whether graphs as well as sentence automata are displayed by default with the antialiasing effect The HTML VIEWER parameter indicates the name of the navigator to be used for displaying concordances If no navigator name is defined concordances are displayed in a Unitex window The MAX TEXT FILE SIZE parameter is deprecated The ICON BAR POSITION parameter indicates the default position of icon bars in graph frames The PACKAGE PATH parameter specifies the location of the repository The MORPHOLOGICAL DICTIONARY parameter specifies the list of morphological dictio naries to use separated with semi colons The MORPHOLOGICAL NODES COLOR parameter specifies the color to use to render the lt and gt tags The MORPHOLOGICAL USE OF SPACE parameter indicates if the Locate program is al lowed to start matching on spaces Default is false 12 10 2 The system_dic def file The system_dic def file is a text
83. ADV NS in A Ze NS in PART in PREP is be V P3s is i N p law N s law V W P1s P2s Pl1p P2p P3p Ss NS as well as a dictionary of compound words consisting of a single entry father in law N NPN Hum zl s Since the sequence Igor is neither a simple English word nor a part of a compound word it is treated as an unknown word The application of dictionaries is done through the program Dico The three files produced d1f for simple words dlc for compound words and err for unknown words are placed in the text directory The d1f and dlc files are called text dictionaries As soon as the dictionary look up is finished Unitex displays the sorted lists of simple compound and unknown words found in a new window Figure 2 12 shows the result for an English text It is also possible to apply dictionaries without preprocessing the text In order to do this click on Apply Lexical Resources in the Text menu Unitex then opens a window see figure 2 13 in which you can select the list of dictionaries to apply The list User resources lists all dictionaries present in the directory current language Dela of the user The dictionaries installed in the system are listed in the scroll list named System resources Use the lt Ctrl click gt combination to select sev eral dictionaries System dictionaries will be applied prior to user dictionaries Within the system or user list you can fix the order of dic
84. At the time of disambiguation the El ag program is launched in a processing window which displays the messages printed by the program during its execution For example when the text automaton contains symbols which do not correspond to the set of ELAG labels see the following section a message indicates the nature of the error In the same way when a sentence is rejected all possible analyses were eliminated by grammars a message indicates the number of the sentence That makes it possible to locate the source of the problems quickly lEntries which gather several different inflectional interpretations such as for example se PRO PpvLE 3ms 3fs 3mp 3fp 140 CHAPTER 7 TEXT AUTOMATON Evaluation of ambiguity removal The evaluation of the ambiguity rate is not based solely on the average number of interpre tations per word In order to get a more representative measure the system also takes into account the various combinations of words While instances of ambiguities are resolved the Elag program calculates the number of possible analyses in the text automaton before and after the modification which corresponds to the number of possible paths through the automaton On the basis of this value the program computes the average ambiguity by sentence and word It is this last measure which is used to represent the ambiguity rate of the text because it does not vary with the size of the corpus nor with the number of sentences wi
85. Blandine Courtois and Max Silberztein editors Les dictionnaires lectroniques du francais Larousse Langue francaise vol 87 1990 3 7 10 2 1 10 2 2 21 Anne DISTER Nathalie FRIBURGER and Denis MAUREL Am liorer le d coupage en phrases sous INTEX In Anne Dister editor Revue Informatique et Statistique dans les Sciences Humaines volume Actes des 3 mes Journ es INTEX pages 181 199 2000 2 5 2 22 Pamela DOWNING On the Creation and Use of English Compound Nouns In Proceed ings of CICLING 2002 volume 53 pages 810 842 Linguistic Society of America 1977 10 1 23 Dana Marina DUMITRIU and S bastien PAUMIER Requ tes linguistiques sur aligne ments multilingues In Directia Terminologie si Inginerie Lingvistica DTIL 08 February 2008 ISBN 978 9 291220 37 3 9 24 Inkscape Vector Graphics Editor http www inkscape org 5 4 1 25 Anibale ELIA Le verbe italien Les compl tives dans les phrases un compl ment Schena Nizet Fasano Paris 1984 8 1 26 Anibale ELIA Lessico grammatica dei verbi italiani a completiva Tavole e indice generale Liguori Napoli 1984 8 1 27 Anibale ELIA and Simoneta VIETRI Electronic dictionaries and linguistic analysis of italian large corpora In Actes des 5es Journ es internationales d Analyse statistique des Donn es Textuelles Ecole Polytechnique f d rale de Lausanne 2000 3 7 BIBLIOGRAPHY 265 28 Anibale ELIA and Simoneta VIETRI L analisi automatica dei t
86. Concordance obtained in MERGE mode with the transducer of figure 6 39 6 7 2 Application while advancing through the text During the preprocessing operations the text is modified as it is being read In order to avoid the risk of infinite loops it is necessary that the sequences that are produced by a transducer will not be re analyzed by the same one Therefore whenever a sequence is inserted into the text the application of the transducer is continued after that sequence This rule only applies to preprocessing transducers because during the application of syntactic graphs the transductions do not modify the processed text but a concordance file which is distinct from the text 6 7 3 Priority of the leftmost match During the application of a local grammar overlapping occurrences are all indexed Note that we talk about real overlapping occurrences like abc and bcd not nested occurrences like abc and bc During the construction of the concordance all these overlapping occur rences are presented cf Figure 6 41 116 CHAPTER 6 ADVANCED USE OF GRAPHS iver Don there extended in ancient times a large f r Don there extended in ancient times a large fore here extended in ancient times a large forest covering the greater part orest covering the gr st covering the great Figure 6 41 Overlapping occurrences in concordance On the other hand if you modify a text instead of constructing a concordance itis necessary to choose
87. Dictionary entry variables Whereas you cannot define standard variables in morphological mode you can associate variables to patterns that refer to the morphological dictionaries except lt DIC gt To do that you must set the output of the box with xxx where xxx is a valid variable name That defines a special variable named xxx that represents the dictionary entry that has matched with your pattern Now you can get the inflected form lemma and codes of the entry with xxx INFLECTEDS xxx LEMMAS and xxx CODES as shown on Figure 6 32 Moreover such variables can be used even after the end of the morphological mode as shown on Figure 6 34 6 5 EXPLORING GRAMMAR PATHS 111 lt Hre gt a Inflected form a INFLECTEDS Lemma a LEMMAS Codes a CODE Figure 6 32 Using a morphological variable gn of Stephen e of Henry the nry the Second ond had scarce jection to the to the croim crom had now their ancient ost extent 5 ference of the Figure 6 33 Results of grammar of Figure 6 32 applied in MERGE mode lt gt Inflected form a INFLECTED Lemma a LEMMA Codes a CODES Figure 6 34 Using a morphological variable in normal mode 6 5 Exploring grammar paths It is possible to generate the paths recognized by a grammar if they are in finite number for example to check that it correctly generates the expected forms For that open the main graph of your grammar and ensure that the gr
88. Directory field select the root directory which you want to explore in our example the directory Dicos In the field Resulting GRF grammar enter the name of the produced grammar WARNING Do not place the output grammar in the tree structure which you want to ex plore because in this case the program will try to read and to write simultaneously in this file which will cause a crash When you click on OK the program will copy the graphs to the directory of the output grammar and will create subgraphs corresponding to the various sub directories as one can see in figure 6 38 which shows the output graph generated for our example One can observe that one box contains the calls with subgraphs corresponding to sub directories here directories Banque and Nourriture and that the other box calls all the graphs which were in the directory here the graph truc grf Grammars corresponding to sub directories Grammars corresponding to graphs Figure 6 38 Main graph of a graph collection 6 7 Rules for applying transducers This section describes the rules for the application of transducers along with the operations of preprocessing and the search for patterns The following does not apply to inflection graphs and normalization graphs for ambiguous forms 6 7 1 Insertion to the left of the matched pattern When a transducer is applied in REPLACE mode the output replaces the sequences that have been read in the text Wh
89. French Unitex proposes to convert your text 1 assuming that it is coded using a French code page By default Unitex proposes to either replace the original text or to rename the original file by inserting old at the beginning of its extension For example if one has an ASCII file named balzac txt the conversion process will create a copy of this ASCII file named balzac old txt and will replace the contents of balzac txt with its equivalent in Unicode D My Unitex EnglishiCorpusiskepticism txt is not a Unicode Little Endian one Do you want to transcode it from ENGLISH to Unicode Little Endian e Replace Rename source with suffix old Transcode Transcode all Ignore Ignore all Figure 2 2 Automatic conversion of a non Unicode text If the encoding suggested by default is not correct or if you want to rename the file differ ently than with the suffix old you must use the Transcode Files command in the File Edition menu This command allows you to choose source and target encodings of the documents to be converted see figure 2 3 By default the selected source encoding is that Unitex also proposes to automatically convert graphs and dictionaries that are not in Unicode Little Endian 2 2 TEXT FORMATS 21 which corresponds to the current language and the destination encoding is Unicode Little Endian You can modify these choices by selecting any source and target encodings Thus if you
90. IONS lt dic_1 gt lt dic_2 gt lt dic_3 gt This program applies dictionaries to a text The text must have been cut up into lexical units by the Tokenize program OPTIONS e t TXT text TXT complete snt text file name e a ALPH alphabet ALPH the alphabet file to use e m DICS morpho DICS this optional parameter indicates which morphological dictionaries are to be used if needed by some fst2 dictionaries DICS represents a list of bin files with full paths separated with semi colons lt dic_i gt represents the path and name of a dictionary The dictionary must be a bin dic tionary obtained with the Compress program or a dictionary graph in the st2 format see section 3 6 page 50 It is possible to give priorities to the dictionaries For details see section 3 6 1 The program Dico produces the following files and saves them in the directory of the text e dlf dictionary of simple words in the text e dlc dictionary of compound words in the text 11 7 ELAG 201 e err list of unknown words in the text e tags ind sequences to be inserted in the text automaton see section 3 6 3 page 51 e stat_dic n file containing the number of simple words the number of compound words and the number of unknown words in the text NOTE Files d1f dlc and err are not sorted Use the program Sort Txt to sort them 11 7 Elag Elag OPTIONS lt txtauto gt This program takes a fst2 text a
91. In order to enforce the presence of a space you have to enclose it in double quotes For prohibiting the presence of a space you have to use the special symbol Syntactic graphs can reference subgraphs cf section 5 2 2 They also have outputs includ ing outputs with variables The produced sequences are interpreted as strings of characters that will be inserted in the concordances or in the text if you want to modify it cf sec tion 6 8 3 Syntactic graphs can use contexts see section 6 3 Syntactic graphs can use morphological filters see section 4 7 Syntactic graphs can use morphological mode see section 6 4 The special symbols that are supported by the syntactic graphs are the same as those that are usable in regular expressions cf section 4 3 1 It is not obligatory to compile syntactic graphs before using them for pattern matching If a graph is not compiled the system will compile it automatically 6 1 5 ELAG grammars ELAG grammars for disambiguation between lexical symbols in text automata are described in section 7 3 1 page 134 6 2 COMPILATION OF A GRAMMAR 97 6 1 6 Parameterized graphs Parameterized graphs are meta graphs that allow you to generate a family of graphs using a lexicon grammar table It is possible to construct parameterized graphs for all possible kinds of graphs The construction and use of parameterized graphs are explained in chapter 8 6 2 Compilation of a grammar 6 2 1 Compilation of a
92. It means that you can take an existing alignment as a set of mandatory links in input of the alignment process This can be useful if you want to work with cognates For more details about cognates and XAlign see discussion in 60 162 CHAPTER 9 TEXT ALIGNMENT D My UnitexiXAlign funtana xml 78 s entre d chirent He inc p tin m s le vener m pe amindou in timp ce ele se devor ous plait reciproc nu scrieti asta v rog cineva on pourrait me le ar putea s m trag la r spundere 79feprocher ntr o buna zi je ne suis ici que depuis quelques minutes un quart d heure tout au plus N am comandat nimic v asteptarn pe dumneavoastr 8 All sentences Plain text All sentences Plain text 8 O Matched sentences Matched sentences All sentences HTML All sentences HTML Aligned with target concordance Aligned with source concordance O Locate Clear alignment Save alignment Save alignment as Figure 9 4 Aligned sentences D My UnitexiXAlign funtana xm E E EE Continentul numit o 10 E Oui c tait l Italie rame Terra Ferma Comme vous madame comme italia 11 ous Ou comme Altea ma ch re 12 comme Altea 13 Pin mai ieri Leag nul civilizatiei noastre lingvigtii sustin chiar ca apartinem unei arii italice 8 All sentences Plain text All sentences Plain text 8 O Matched sentences Matched sentences All sentences HTML All s
93. J orange orange N s orange juice orange juice N XN z1 s 4 juice juice N Conc s juice juice V W P1s P2s P1p P2p P3p 4 1 dp d d dp AP AP X d d AL AL AL d d dp hh 4 224 CHAPTER 12 FILE FORMATS 12 5 2 The cursentence grf file The cursentence grf file is generated by Unitex during the display of a sentence au tomaton The Fst2Grf program constructs a grf file from the text fst2 file that repre sents a sentence automaton 12 5 3 The sentenceN grf file Whenever the user modifies a sentence automaton that automaton is saved under the name sentenceN grf where N represents the number of the sentence 12 5 4 The cursentence txt file During the extraction of the sentence automaton the text of the sentence is saved in the file called cursentence txt That file is used by Unitex to display the text of the sentence under the automaton That file contains the text of the sentence followed by a newline 12 6 Concordances 12 6 1 The concord ind file The concord ind file is the index of the occurrences found by the program Locate during the application of a grammar It is a text file that contains the starting and ending position of each occurrence possibly accompanied by a sequence of letters if the construction of the concordance took into account the possible transducer outputs of the grammar Here is an example of such a file M4 59 63 the ADJ greater part 67 71 the beautiful hillsY 87 91
94. Jo pue a6eanos Jo UOTI13X3 peulMiejep E TY 03 SSaUU129 TRuoTaTppe 2AE m01q sty uo 1695 dasp E pue a3oueua3moo STU oq SSaUU131S TEUOT Pappe ay 124080 2393 OL Ss sanojToo quaz se suoxes oThuy 343 30 30uaqstxa IJ YA Jo 28042 aXIT 212q Saa IA 2331 JTE YITM Pa312409 sem Peau sty Ss arom pues Jo ahaeyo 349 UT ST PUE SaATT and ay ogur uns 243 Jo asouanTJutT ayq Aq paqgozo UTU ayhneq PEU UOTIENIATE pue uotssajord aTtym 3auoxos9 e HuttTqmasaz xaom usado Jo 56019 amp mot am UT MI Sem 213 SB PaATOAUT Hutaq Jo paezey UTR139 IUA Saye pue a3o9uepuag saatnbaz ay uaym uo Jo pue Zeretiaipm 1299234 30 sem m103 U JO Z2auue 243 UT S12PINOUS pue PEau ay J0 zeah peay ai 10 feq ATTal e 10 de PaUTeMar WopTas ay se pue s irzayqo 10 g atiauaa P340T JEI Aaptazaqno uy ati dn Hutyoqes zaqge Aasuanol sty ue aq oyn Hutaq adeys UT uotuedmod sty jo 38491 P Aq paxaa409 azam Aa13 ed qzadns sty Jo S peoa a UO UTIT2AE19 103 200m que SMT JUSTIUe UT papua3xa 21349 U0Q 13 y noya JOTI UOSMTAD Jo sem 3T Ss UT 4314 380192 241 qe pa1n0938 aTIUEN sty 1334 samosaq mq nom se yons uamspuo fuotatsoddo sat Aq anef aouautua au Pau107 2pnatrfuoT Jo quem S3T UITN pags Aq peor sty 1013 at burdaams jo aanseat S TTTm Jo pue aheanod Jo U0TI13X2 pau fapeTh ESTUZ 30 Asptm 243 UT soeds Wado sTqereptsuos Fis Zen Atay spem 43493 om 03 3an3 3 THE pauueq aya Jo pasodmos saaaaTs yata Je_oel 26010 E ABUTJ YONM sTetTrzs
95. LA of compound inflected forms in the case of compound forms The second type is a dictionary of non inflected forms called DELAS DELA de formes simples simple forms DELA or DELAC DELA de formes compos es compound forms DELA Unitex programs make no distinction between simple and compound form dictionaries We will use the terms DELAF and DELAS to distinguish the inflected and non inflected dictionaries no matter they contain simple word compound words or both 3 1 1 The DELAF format Entry syntax An entry of a DELAF is a line of text terminated by a newline that conforms to the following syntax apples apple N conc p this is an example The different elements of this line are e apples is the inflected form of the entry it is mandatory 35 36 CHAPTER 3 DICTIONARIES e apple is the canonical form lemma of the entry For nouns and adjectives in French it is usually the masculine singular form for verbs it is the infinitive This information may be left out as in the following example apple N Conc s This means that the canonical form is the same as the inflected form The canonical form is separated from the inflected form by a comma N Conc is the sequence of grammatical and semantic information In our example N designates a noun and Conc indicates that this noun designates a concrete object see table 3 2 Each entry must have at least one grammatical or semantic code separated from the canonica
96. NB gt lt TOKEN gt lt SDIC gt and lt CDIC gt are forbidden 13 If you reach the end of the morphological zone and if you are not at the end of a token the match will fail For instance if the text contains enabled you can not only match enable 6 4 3 Morphological dictionaries In morphological mode you can perform queries using dictionaries For instance you can ask for every word made of the prefix un followed by an adjective with the grammar shown on Figure 6 30 uHe gt O Figure 6 30 Matching words made of un adjective ending with able However if we want to match with this grammar the word unaware we must know that aware is an adjective But avare may not be present in the text so that we cannot rely on the text dictionaries This is the reason why we must define a list of dictionaries to lookup in in morphological mode To do that go in Info gt Preferences gt Morphological dictionaries as shown on Figure 6 31 You can select as many dictionaries as you want but they MUST be bin ones Once done you can apply your grammar and get results 110 CHAPTER 6 ADVANCED USE OF GRAPHS Preferences for English Graph Presentation Morphological dictionaries Directories Language amp Presentation Choose the bin dictionaries to use in Locate s morphological mode D WUnitex2 0beta EnglishiDeladela en public bin Figure 6 31 Configuration of morphological dictionaries 6 4 4
97. NLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER OR ANY OTHER PARTY WHO MAY MOD IFY AND OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE BE LIABLE TO YOU FOR DAMAGES INCLUDING ANY GENERAL SPECIAL INCIDENTAL OR CON SEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LI BRARY INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING REN DERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAIL URE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAM AGES END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Libraries If you develop a new library and you want it to be of the greatest possible use to the public we recommend making it free software that everyone can redistribute and change You can do so by permitting redistribution under these terms or alternatively under the terms of the ordinary General Public License To apply these terms attach the following notices to the library It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty and each file should have at least the copyright line and a pointer to where the full notice is found lt one line to give the library s name and a brief idea of what it does gt Copyright C lt year gt lt name of author gt This library is free software you can redistribute it
98. TE The program will also try to use the tags ind file if any see section 12 7 3 11 32 XMLizer XMLizer OPTIONS lt txt gt This program takes the raw text file lt txt gt and produces a corresponding basic TEI or XML file The difference between TEI and XML is that TEI files will contain a TEI header OPTIONS e x xml produces a XML file e t tei produces a TEI file default e n XXX normalization XXX specify the normalization rule file to be used see section 12 11 5 e o OUT output 0UT optional output file name default file txt gt file xml e a ALPH alphabet ALPH alphabet file e s SEG segmentation_grammar SEG sentence delimitation grammar to be used This grammar should be like the Sentence grf one used during the preprocessing of a corpus but it can include the special tag P to indicate paragraph bounds Chapter 12 File formats This chapter presents the formats of files read or generated by Unitex The formats of the DELAS and DELAF dictionaries have already been presented in sections 3 1 1 and 3 1 2 NOTE In this chapter the symbol Y represents the newline symbol Unless otherwise indi cated all text files described in this chapter are encoded in Unicode Little Endian 12 1 Unicode Little Endian encoding All text files processed by Unitex have to be encoded in Unicode Little Endian This en coding allows the representation of 65536 characters by coding each of them
99. UCH DAMAGES END OF TERMS AND CONDITIONS Appendix How to Apply These Terms to Your New Programs If you develop a new program and you want it to be of the greatest possible use to the pub lic the best way to achieve this is to make it free software which everyone can redistribute and change under these terms To do so attach the following notices to the program It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty and each file should have at least the copyright line and a pointer to where the full notice is found one line to give the program s name and a brief idea of what it does Copyright C yyyy name of author This program is free software you can redistribute it and or modify it under the terms of the GNU General Public License as published by the Free Software Foundation either version 2 of the License or at your option any later version This program is distributed in the hope that it will be useful but WITHOUT ANY WARRANTY without even the implied warranty of MERCHANTABIL ITY or FITNESS FOR A PARTICULAR PURPOSE See the GNU General Public License for more details You should have received a copy of the GNU General Public License along with this program if not write to the Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA Also add information on how to contact you by electronic and paper mail If the program
100. UNITEX 2 1 USER MANUAL Universit Paris Est Marne la Vall e http www igm univ mlv fr unitex unitex univ mlv fr S bastien Paumier English translation of previous version by the local grammar group at the CIS Ludwig Maximilians Universitat Munich Oct 2003 Wolfgang Flury Franz Guenthner Friederike Malchok Clemens Marschner Sebastian Nagel Johannes Stiehler http www cis uni muenchen de Contents Introduction 11 What s new from version 1 2 0 0 0 0 000 00 cee sn sssr r rars 12 Content 2 4 we a ar mt Doe QUE un du e ou a 12 1 Installation of Unitex 15 TL MICENSES 2c A AR ENCRES A eee Bee ee oe INR een 15 12 Javarunime environment a a uoa a he eed usa ua 15 13 Installationon Windows s serer ce 4 du or babe us 16 1 4 Installation on Linux and MacOSX 16 WS a lt lt oada on A he S ee a ti o reet Y 17 16 Adding new la puages EEN A e e A A 17 17 Umnstaling Unitex io cia ea a RR A A 18 2 Loading a text 19 21 Selecting a language re Eh EE E RR AAA 19 22 Tenom cados ea A ada a a a 19 a AI 22 E OPENT SUENE bo be ARA a RARAS ARS A 22 gt Egger EE cb E E a a aaa 23 25 1 Normalization of separators Los su EEN aride tes 25 252 a so so oaae ENEE me pee ob o dax 25 2 5 3 Normalization of non ambiguous forms 27 254 Spitting a text ito tokens s lt lt ue dans mx mate pt 29 255 Applying dicHonaries oo ke eroe une den EA 30 2 5 6 Analysis o
101. Unitex It serves to indicate the end of the header information The lines after the header give the contents and the position of the boxes in the graph The following example corresponds to a graph recognizing a number 34 lt E gt 84 248 1 2 9 272 248 0 d s 1 2 3 4 5 6 7 8 9 0 172 248 1 1 4 The first line after the header indicates the number of boxes in the graph immediately fol lowed by a newline This number can not be lower than 2 since a graph always has an initial and a final state The following lines define the boxes of the graph The boxes are numbered starting at 0 By convention state 0 is the initial state and state 1 is the final state The contents of the final state is always empty Each box in the graph is defined by a line that has the following format contents X Y N transitions Y contents is a sequence of characters enclosed in quotation marks that represents the contents of the box This sequence can sometimes be preceded by an s if the graph is imported from Intex this character is then ignored by Unitex The contents of the sequence is the text that has been entered in the editing line of the graph editor Table 12 2 shows the encoding of two special sequences that are not encoded in the same way as they are entered into the gr f files 220 CHAPTER 12 FILE FORMATS Sequence in the graph editor Sequence in the grf file Table 12 2 Encoding of spec
102. XN2 p2ngea hungry as a wolf gladnih kao vukovi gladan kao vuk AC_A3XN2 p2ngea hungry as a wo gladnima kao vuk gladan kao vuk AC_A3XN2 p3mgea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p3mgea hungry as a wol gladnima kao vukovi gladan kao vuk AC_A3XN2 p3mgea hungry as a gladnim kao vuk gladan kao vuk AC_A3XN2 p3mgea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p3mgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p3mgea hungry as a wo gladnima kao vuk gladan kao vuk AC_A3XN2 p3fgea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p3fgea hungry as a wol gladnima kao vukovi gladan kao vuk AC_A3XN2 p3fgea hungry as a gladnim kao vuk gladan kao vuk AC_A3XN2 p3fgea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p3fgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p3fgea hungry as a wo gladnima kao vuk gladan kao vuk AC_A3XN2 p3ngea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p3ngea hungry as a wol gladnima kao vukovi gladan kao vuk AC_A3XN2 p3ngea hungry as a gladnim kao vuk gladan kao vuk AC_A3XN2 p3ngea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p3ngea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p3ngea hungry as a wol gladne kao vuk gladan kao vuk AC_A3XN2 p4mgea hungry as a wolf gladne kao vuci gladan kao vuk AC_A3XN2 p4mgea hungry as a wolf gladne kao vukovi gladan kao vuk AC_A3XN2 p4mgea hungry as a wolf glad
103. You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change 12 11 VARIOUS OTHER FILES 241 b You must cause any work that you distribute or publish that in whole or in part contains or is derived from the Program or any part thereof to be licensed as a whole at no charge to all third parties under the terms of this License PR CH x If the modified program normally reads commands interactively when run you must cause it when started running for such interactive use in the most ordinary way to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty or else saying that you provide a warranty and that users may redistribute the program under these conditions and telling the user how to view a copy of this License Exception if the Program itself is interactive but does not normally print such an announcement your work based on the Program is not required to print an announcement These requirements apply to the modified work as a whole If identifiable sections of that work are not derived from the Program and can be reasonably considered independent and separate works in themselves then this License and its terms do not apply to those sections when you distribute them as separate works But when you distribute the same sections as part of a whole which is a work based on the Program the distribution of the
104. a modified version of the text and save it in a file named TXT see section 6 8 3 Other options d DIR directory DIR indicates to the program that it must not work in the same directory than lt index gt but in DIR a ALPH alphabet ALPH alphabet file used for sorting 198 CHAPTER 11 USE OF EXTERNAL PROGRAMS The result of the application of this program is a file called concord txt if the concordance was constructed in text mode a file called concord html if the output mode was html or glossanet and a text file with the name defined by the user of the program if the program has constructed a modified version of the text In html mode the occurrence is coded as a hypertext link The reference associated to this link is of the form lt a href X Y Z gt X et Y represent the beginning and ending po sitions of the occurrence in characters in the file text_name snt Z represents the number of the sentence in which the occurrence was found 11 4 ConcorDiff ConcorDiff OPTIONS lt concorl gt lt concor2 gt This program takes two concordance files and produces an HTML page that shows their differences see section 6 8 5 page 122 lt concor1 gt and lt concor2 gt concordance index files must have absolute names because Unitex uses these names to deduce on which text there were computed OPTIONS e o X out X output HTML page e f FONT font FONT name of the font to use in output HTML page e
105. a tem plate graph OPTIONS e r GRF reference_graph GRF name of the template graph e o OUT output OUT name of the result main graph e s XXX subgraph_pattern XXX if this optional parameter if specified all the produced subgraphs will be named according to this pattern In order to have un ambiguous names we recommend to include in the parameter remind that will be replaced by the line number of the entry in the table For instance if you set the pattern parameter to subgraph grf subgraph names will be such as subgraph 0013 grf By default subgraph names look like result_0013 grf where result grf designates the result main graph 11 28 TagsetNormFst2 TagsetNormFst2 OPTIONS lt txtauto gt This program normalizes the specified fst2 text automaton according to a tagset de scription file discarding undeclared dictionary codes and incoherent lexical entries Inflec tional features are unfactorized so that rouge A fs ms will be divided into the 2 tags rouge A fs and rouge A ms The text automaton is modified OPTIONS e t TAGSET tagset TAGSET name of the tagset description file 11 29 TEI2Txt TEI2Txt OPTIONS lt xml gt Produces a raw text file from the given lt xm1 gt TEI file OPTIONS e o TXT output TXT name of the output text file By default the output file has the same name than the input one replacing xml by txt 11 30 TOKENIZE 213
106. aP3ms SFX Figure 3 8 A toy semitic inflection grammar Such a grammar obey the following rules 1 All standard inflection operators can be used L R etc 2 A digit stands for a consonant of the skeleton 1 for the first 2 for the second etc In our example 1 2 and 3 will respectively stand for k t and b 3 The output of a path must be made of sequences of the form XXX Each symbol must appear alone in a box The current content of the stack will be dumped between and XXX each time an output containing will be found In our example the output will be yakotobu V aP3ms da SFX 4 The DELAF output is of the following form yakotobuda ktb V yakotobu V aP3ms da SFX The inflected form corresponds to the concatenation of all the inflection productions the lemma is the consonant skeleton and the inflected forms is replaced by the output of the grammar 3 5 COMPRESSION 49 NOTE for the moment such a dictionary cannot be exploited by Unitex programs but further versions will take this kind of dictionary into account for the construction of the text automaton 3 5 Compression Unitex applies compressed dictionaries to the text The compression reduces the size of the dictionaries and speeds up the lookup This operation is done by the Compress program This program takes a dictionary in text form as input for example my_dico dic and produces two files e my_dico bin contains the minim
107. aches several thousands of occurrences it is advisable to display it in a web browser Firefox 11 Netscape 12 Internet Explorer etc instead Check Use a web browser to view the concordance cf figure 4 6 This option is activated by default if the number of occurrences is greater than 2000 You can configure which web browser to use by clicking on Preferences in the menu Info Click on the tab Language amp Presentation and select the program to use in the field Html Viewer cf figure 4 7 If you choose to open the concordance in Unitex you will see a window as shown on Figure 4 8 Utterances react as hyperlinks If you click on an occurrence the text frame is opened and the corresponding sequence is highlighted Moreover if the text automaton is available and if this window is not iconified the sentence automaton that contains the occurrence will be shown 4 8 SEARCH 69 Preferences for English Graph Presentation Morphological dictionaries Directories Language amp Presentation C Analyze this language char by char L Enable morphological use of space C Right to left rendering for corpus and graphs Text Font Courier New 10 Concordance Font Courier New 12 Html Viewer OOOO Figure 4 7 Selection of a web browser for displaying concordances CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS 70 224 pey yotTym sada sty Jo auo 09 Wotssaadxa 193ISTUTS E STU umop Zem 3784 payoe
108. al automaton of the inflected forms of the dictio naries e my_dico inf contains the codes extracted from the original dictionary The minimal automaton in the my_dico bin file is a representation of inflected forms in which all common prefixes and suffixes are factorized For example the minimal automaton of the words me te se ma ta et sa can be represented by the graph shown in Figure 3 9 Figure 3 9 Representation of a minimal automaton To compress a dictionary open it and click on Compress into FST in the DELA menu The compression is independent from the language and from the content of the dictionary The messages produced by the program are displayed in a window that is not closed auto matically You can see the size of the resulting bin file the number of lines read and the number of inflectional codes created Figure 3 10 shows the result of the compression of a dictionary of simple words The resulting files are compressed to about 95 for dictionaries containing simple words and 50 for those with compound words 50 CHAPTER 3 DICTIONARIES O Messages with a colored background are generated by the interface not by the external programs Compressing Minimizing Minimization done Binary file 111437 bytes 13976 lines read 2179 INF entries created 11358 states 16340 transitions Figure 3 10 Results of a compression 3 6 Applying dictionaries Dictionaries can be applied 1 a
109. am CheckDic It is a text file that contains information about the analysed dictionary and has four parts The first part is the possibly empty list of all syntax errors found in the dictionary absence of the inflected or the canonical form the grammatical code empty lines etc Each error is described by the number of the line a message describing the error and the contents of the line Here is an example of a message Line 12451 no point found garden N s The second and third parts display the list of grammatical codes and or semantic and inflec tional codes respectively In order to prevent coding errors the program reports encodings that contain spaces tabs or non ASCII characters For instance if a Greek dictionary con tains the ADV code where the Greek A character is used instead of the Latin A character the program reports the following warning ADV warning 1 suspect char 1 non ASCII char 0391 D V 12 8 DICTIONARIES 231 Non ASCII characters are indicated by their hexadecimal character number In the example below the code 0391 represents Greek A Spaces are indicated by the SPACE sequence Km s warning 1 suspect char 1 space K m SPACE s When the following dictionary is checked 1 2 et 3 INTJ abracadabra INTJ supercalifragilisticexpialidocious INTJ damned INTJ the following CHECK_DIC TXT file is obtained Line 1 unprotected comma in lemma 1 2 et Shp INTE Line 2 no point found Y
110. an expression Y Of course it was already possible to do that with a grammar like the one shown on Figure 6 18 However with such a grammar the context part on the left will be included in the match as shown on Figure 6 19 Figure 6 18 Matching a noun that occurs after a numerical determiner To avoid that you can use the special symbol to indicate the end of the left context of the expression you want to match This symbol will be represented by a green star in the graph as shown on Figure 6 20 The effect of such a context is to use this part of the grammar for computing matches but to ignore it in the results as shown on Figure 6 21 6 3 CONTEXTS 105 Concordance D My UnitexiEnglishiCorpusianhoe_snticoncord html horseback at any secure place within eight days after our liberation 5 wh were briefly as follows First the five challengers were to undertake all which betwixt sun and sun he baptized five hundred heathen Danes and Britons At length the barriers were opened and five knights chosen by lot advanced urse of spectators fixed upon them the five knights advanced up the platform n a champion that could bear down these five knights in one day s jousting 5 et and black the chosen colours of the five knights challengers The cords hed their vow by each of them breaking five lances the Prince was to declare e courses and cast to the ground three utes to keep at sword s point his three entinels to
111. an kao vuk AC_A3XN2 p7fgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p7fgea hungry as a wolf gladnim kao vuk gladan kao vuk AC_A3XN2 p7fgea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p7fgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p7fgea hungry as a wolf gladnima kao vuk gladan kao vuk AC_A3XN2 p7ngea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p7ngea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p7ngea hungry as a wolf gladnim kao vuk gladan kao vuk AC_A3XN2 p7ngea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p7ngea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p7ngea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 w2mgea hungry as a wolf gladna kao vuci gladan kao vuk AC_A3XN2 w2mgea hungry as a wolf gladna kao vukovi gladan kao vuk AC_A3XN2 w2mgea hungry as a wolf gladne kao vuk gladan kao vuk AC_A3XN2 w2fgea hungry as a wolf gladne kao vuci gladan kao vuk AC_A3XN2 w2fgea hungry as a wolf gladne kao vukovi gladan kao vuk AC_A3XN2 w2fgea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 w2ngea hungry as a wolf gladna kao vuci gladan kao vuk AC_A3XN2 w2ngea hungry as a wolf gladna kao vukovi gladan kao vuk AC_A3XN2 w2ngea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 w4mgea hungry as a wolf gladna kao vuci gladan kao vuk AC_A3XN2 w4mgea hungry as a wolf
112. and any later version you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation If the Library does not specify a license version number you may choose any version ever published by the Free Software Foundation 14 If you wish to incorporate parts of the Library into other free programs whose distri bution conditions are incompatible with these write to the author to ask for permission For software which is copyrighted by the Free Software Foundation write to the Free Software Foundation we sometimes make exceptions for this Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally NO WARRANTY 15 BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE THERE IS NO WAR RANTY FOR THE LIBRARY TO THE EXTENT PERMITTED BY APPLICABLE LAW EX CEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND OR OTHER PARTIES PROVIDE THE LIBRARY AS IS WITHOUT WARRANTY OF ANY KIND 254 CHAPTER 12 FILE FORMATS EITHER EXPRESSED OR IMPLIED INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LIBRARY IS WITH YOU SHOULD THE LIBRARY PROVE DEFECTIVE YOU ASSUME THE COST OF ALL NECESSARY SERVICING REPAIR OR CORRECTION 16 IN NO EVENT U
113. and or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation either version 2 1 of the License or at your option any later version This library is distributed in the hope that it will be useful but WITHOUT ANY WAR RANTY without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE See the GNU Lesser General Public License for more details You should have received a copy of the GNU Lesser General Public License along with this library if not write to the Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA Also add information on how to contact you by electronic and paper mail You should also get your employer if you work as a programmer or your school if any to sign a copyright disclaimer for the library if necessary Here is a sample alter the names Yoyodyne Inc hereby disclaims all copyright interest in the library Frob a library for tweaking knobs written by James Random Hacker 12 11 VARIOUS OTHER FILES 255 lt signature of Ty Coon gt 1 April 1990 Ty Coon President of Vice That s all there is to it 256 CHAPTER 12 FILE FORMATS Appendix C Lesser General Public License For Linguistic Resources This license was designed by the University of Marne la Vall e and it has received the ap proval of the Free Software Foundation 1 Preamble The licenses for most data are designe
114. aph window is the active window the active window has a blue title bar while the inactive windows have a gray title bar Now go to the FSGraph menu and then to the Tools menu and click on Explore Graph paths The Window of figure 6 35 appears The upper box contains the name of the main graph of the grammar to be explored The following options are connected to the outputs of the grammar and to subgraph calls e Ignore outputs outputs are ignored e Separate inputs and outputs outputs are displayed after inputs a b c A B C e Merge inputs and outputs each output is emitted immediately after the input to which it corresponds a A b B c C 112 CHAPTER 6 ADVANCED USE OF GRAPHS Explore graph paths Graph BOULOTRechercheimanuelunitetresourcesigmiglace grf 8 Ignore outputs O Separate inputs and outputs O Merge inputs and outputs Maximum number of sequences 100 8 Only paths Do not explore subgraphs recursively Figure 6 35 Exploring the paths of a grammar e Only paths calls to subgraphs are explored recursively e Do not explore subgraphs recursively calls to subgraphs are printed but not ex plored recursively If the option Maximum number of sequences is activated the specified number will be the maximum number of generated paths If the option is not selected all paths will be generated if they are in finite number Here you see what is created for the graph
115. ar ATaorzeos yotym NEOTO 3108 E Ya WelTTtm 30 Ota ayy 09 qguanbasqns sTdoad sjeiredss E om 3324 115 qayoel 242 aye OL Ss 12pPUElURIH US111006 E 32393 YOTYM PUTA JEY JO 1N3 YATA Par des 1371808 E mosaq nq ameu uoxes ray Aq sao0h ays SABTS UOXES E AO 343 YATA 38e13u09 e urmao3 moqo p21 318p AS E 09 PTNos ay UOTUM aoueUSIUNON STU 1340 PUEMICS Ape31 e uo UMOP 1123 Pue 11 UTYITM moaz 25018 Peq pabuoro1d e J 18 geya paIe20u09 2q02 1addn styl 103 181 N99d E aya JO UOTITQUE ayy UO0TITpadxa yser 13439eqn UT 4318d E Ag auamAolua Jo 139980 samoo2q 24 u ayn meu UEULON E e38 4324 peu qayoel STH S souearadde 91188308 23101 E n pmoq sTepues s HAsqney QUATOUE 10 Jats ui pom e qeya des ayq 30 31ed Sr oi sem 11 ABSENT Ul13por e ptseuos aq 44tu pinos aq aanasod ames ayy UT SI E Y atutep e Auen TIM aT 30qqy ue aq oi DEn ATUEN F TY aptsaq sserzh ai uodn Aer yotym 33235 13318n5 Huot e 211895 futaq 1N0T00 am ang S 2TIUEN DTJSEUON Dor E pumnox6 ayy 09 Area papa qotqa 41019 13007 Dor e UTE 243 UT Ppa20TT03 oun asoa 30 suo 1201014 APT Y Taneaq aq Jo 21ed 1298216 2342 UTI13409 153107 2b18T E se pue S imoTTaA 29 TIG YATM PaUTT PATIOS Teep pooh E 13px0 sty 03 1ado0o1d ssaap aTou ya pue adseo usprTob E s10M 3243 23103234 SaATI1E sy usqa querTeb qual 41313 Te azaymasTa pue PIOETA ayy 03 200101 Jo 39104 31439 E BAG 124 TS UTU PEU ay Z4iadezp Jo 39314 913581083 e daap e g fTTTm
116. ar in the text default LC left context for primary sort then occurrence for secondary sort LR left context then right context CL occurrence then left context CR occurrence then right context RL right context then left context RC left context then occurrence For details on the sorting modes see section 4 8 2 Output options H html produces a concordance in HTML format encoded in UTF 8 default t text produces a concordance in Unicode text format g SCRIPT glossanet SCRIPT produces a concordance for GlossaNet in HTML format The HTML file is encoded in UTF 8 i index produces an index of the concordance made of the content of the occur rences with the grammar outputs if any preceded by the positions of the occurrences in the text file given in characters u uima the same as index but the ending position of each occurrence is also given A axis quite the same as index but the numbers represent the median char acter of each occurrence Fore more information see 29 x xalign another index file used by the text alignment module Each line is made of 3 integers X Y Z followed by the content of the occurrence X is the sen tence number starting from 1 Y and Z are the starting and ending positions of the occurrence in the sentence given in characters m TXT merge TXT indicates to the program that it is supposed to produce
117. constructions intransitives Droz Gen ve 1976 8 1 11 Firefox Web browser http www mozilla com firefox 4 8 2 12 Netscape Web browser http www netscape com 4 8 2 13 Pierre CADIOT A entre deux noms vers la composition nominale Lexique 11 193 240 1992 10 1 263 264 BIBLIOGRAPHY 14 Folker CAROLI Les verbes transitifs compl ment de lieu en allemand Lingvistice In vestigationes 8 2 225 267 1984 Amsterdam Philadelphia John Benjamins Publishing Company 8 1 15 A CHROBOT B COURTOIS M HAMMANI MC CARTHY M GROSS and K ZELLAGUI Dictionnaire electronique DELAC anglais noms compos s Technical Report 59 LADL Universit Paris 7 1999 3 7 16 Unicode Consortium http www unicode org 2 2 17 Matthieu CONSTANT and Anastasia YANNACOPOULOU Le dictionnaire lectronique du grec moderne Conception et d veloppement d outils pour son enrichissement et sa validation In Studies in Greek Linguistics Proceedings of the 23rd annual meeting of the Department of Linguistics Faculty of Philosophy Aristotle University of Thessaloniki 2002 3 7 18 Danielle CORBIN Hypoth ses sur les fronti res de la composition nominale Cahiers de grammaire 17 26 55 1992 Universit de Toulouse Le Mirail 10 1 19 Blandine COURTOIS Formes ambigu s de la langue fran aise Lingvistice Investiga tiones 20 1 167 202 1996 Amsterdam Philadelphia John Benjamins Publishing Com pany 3 7 20
118. containing the Program or a portion of it either verbatim or with modifications and or translated into another language Hereinafter translation is included without limitation in the term modification Each licensee is addressed as you Activities other than copying distribution and modification are not covered by this License they are outside its scope The act of running the Program is not restricted and the output from the Program is covered only if its contents constitute a work based on the Program independent of having been made by ruming the Program Whether that is true depends on what the Program does 1 You may copy and distribute verbatim copies of the Program s source code as you receive it in any medium provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty keep intact all the notices that refer to this License and to the absence of any warranty and give any other recipients of the Program a copy of this License along with the Program You may charge a fee for the physical act of transferring a copy and you may at your option offer warranty protection in exchange for a fee 2 You may modify your copy or copies of the Program or any portion of it thus forming a work based on the Program and copy and distribute such modifications or work under the terms of Section 1 above provided that you also meet all of these conditions a
119. ctional behavior in this case the complete part is made up of only one line per subcategory Let us consider for example the following lines from the pronoun description Pdem lt genre gt lt nombre gt PpvIl lt genre gt lt nombre gt lt pers gt PpvPr These lines mean e all the demonstrative pronouns PRO Pdem gt have only a gender and a number e clitic pronouns in the nominative lt PRO Ppv11 gt are labelled grammatically in per son gender and number e the prepositional pronouns en y do not have any inflectional feature All combinations of inflectional features and discriminant subcategories which appear in the dictionaries must be described in the tagset def file otherwise the information in the corresponding entries will be discarded by ELAG 144 CHAPTER 7 TEXT AUTOMATON If words of the same subcategory differ by their inflectional profile it is necessary to write several lines into the complete part The disadvantage of this method of description is that it becomes difficult to make the distinction between such words in an ELAG grammar If one considers the description given by the previous example of a tagset def file certain adjectives of French take a gender and a number whereas others to not have any inflectional feature This allows for coding fixed sequences like de bonne humeur as adjective on the basis of their syntactic behavior Consider a French dictionary with such sequences as invariab
120. ctionaries with simple and compound words For more details see the references page on the Unitex website http www igm univ mlv fr unitex CHAPTER 3 DICTIONARIES 54 yo lt lt CTAA dml XXII TIRER mnanaa nahalal annn 666 0001 lt lt s GaimalralralalaimiaidOxboortbarctihhxboodxxbomalanaalanalaalalaalanalanlo gt gt 666 001 lt lt GITLAITTAILALALATIIIIID Oxea haaa gt gt 66 01 lt lt sGalralrialralalarimitiD gt gt 61 Eli ES Rat Figure 3 13 Dictionary graph of roman numerals 3 7 BIBLIOGRAPHY 55 as gt PFX x x LEMMAS x CODES Figure 3 14 Example of morphological dictionary graph FST Text 2344 sentences It is unlucky to travel where your path is crossed hy a monk a hare or a howling dog until you have eaten Sentence j your next meal Away said Cedric impatiently Reset Sentence Graph mm FST Text close closeelagframe frame Explode Implode Figure 3 15 Path added by a morphological dictionary graph Language Simple words Compound words English 43 56 15 63 French 19 20 48 20 85 65 37 Modern Greek 2 17 45 46 47 Italian 27 28 69 Spanish 8 7 Table 3 4 Some bibliographical references for electronic dictionaries 56 CHAPTER 3 DICTIONARIES Chapter 4 Searching with regular expressions This chapter
121. ctory This directory allows you to save your personal data For each language that you will be using the program will copy the root directory of that language to your personal directory except the dictionaries You can then modify your copy of the files without risking to damage the system files Welcome A Welcome paumier To use Unitex you must choose a private directory to store your data that you can change later if you want Click on OK to choose your directory Figure 1 1 First use under Windows K Welcome Welcome paumier Your private Unitex directory where you can store your own data is fhome thesards paumier unitex Figure 1 2 First use under Linux 16 Adding new languages There are two different ways to add languages If you want to add a language that is to be accessible by all users you have to copy the corresponding directory to the Unitex system 18 CHAPTER 1 INSTALLATION OF UNITEX L Choose your private directory J Mes vid os CC Downloads Cf Updater5 3 Ma musique J Visual Studio 2005 Ci Mes eBooks J Mes fichiers re us Ci Mes images Mes sites Web File Name ICADocuments and SettingsipaumieriMes documents Figure 1 3 Creating the personal work directory directory for which you will need to have the access rights this might mean that you need to ask your system administrator to do it On the other hand if the language is only used by a sin
122. cts that are easier to manipulate and to which all classical algorithms on automata can be applied In order to compile and thus transform a grammar select the command Compile amp Flatten FST2 in the Tools submenu of the FSGraph menu The window of Figure 6 5 allows you to configure the approximation process Compile amp Flatten x 2 Expected result grammar format a equivalent FST2 subgraph calls may remain Finite State Transducer can be just an approximation Flattening depth Maximum flattening depth 10 Cancel Figure 6 5 Configuration of approximation of a grammar The box Flattening depth lets you specify the level of embedding of subgraphs This value represents the maximum depth up to which the callings of subgraphs will be replaced by the subgraphs themselves The Expected result grammar format box allows you to determine the behavior of the pro gram beyond the selected limit If you select the Finite State Transducer option the calls to subgraphs will be replaced by lt E gt beyond the maximum depth This option guaran tees that we obtain a finite state transducer however possibly not equivalent to the original grammar On the contrary the equivalent FST2 option indicates that the program should allow for subgraph calls beyond the limited depth This option guarantees the strict equiva lence of the result with the original grammar but does not necessarily produce a finite state tra
123. d hui ADV Parameter lt text gt represents the complete path of the text file The program creates a modified version of the text that is saved in a file with extension snt OPTIONS e n no_carridge_return every separator sequence will be turned into a single space e r XXX replacement_rules XXxX specifies the normalization rule file to be used See section 12 11 5 for details about the format of this file By default the program only replaces and by and WARNING if you specify a normalization rule file its rules will be applied prior to anything else So you have to be very careful if you manipulate separators in such rules 210 CHAPTER 11 USE OF EXTERNAL PROGRAMS 11 23 PolyLex PolyLex OPTIONS lt list gt This program takes a file containing unknown words lt list gt and tries to analyse each of the words as a compound obtained by concatenating simple words The words that have at least one analysis are removed from the file of unknown words and the dictionary lines that correspond to the analysis are appended to file OUT OPTIONS e a ALPH alphabet ALPH the alphabet file to use e d BIN dictionary BIN bin dictionary to use e o OUT output 0UT designates the file in which the produced dictionary lines are to be printed if that file already exists the produced lines are appended at the end of the file e i INFO info INFO designates a text file in which the information about
124. d to take away your freedom to share and change it By contrast this License is intended to guarantee your freedom to share and change free data to make sure the data are free for all their users This license the Lesser General Public License for Linguistic Resources applies to some specially designated linguistic resources typically lexicons grammars thesauri and textual corpora TERMS AND CONDITIONS FOR COPYING DISTRIBUTION AND MODIFICATION 0 This License Agreement applies to any Linguistic Resource which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License for Linguistic Resources also called this License Each licensee is addressed as you A linguistic resource means a collection of data about language prepared so as to be used with application programs The Linguistic Resource below refers to any such work which has been distributed under these terms A work based on the Linguistic Resource means either the Lin guistic Resource or any derivative work under copyright law that is to say a work containing the Linguistic Resource or a portion of it either verbatim or with modifica tions and or translated straightforwardly into another language Hereinafter trans lation is included without limitation in the term modification Legible form for a linguistic resource means the preferred form of the resource
125. definitions of a graphical unit obviously has an impact on the def inition of a multi word unit However we wish our formalism for MWUs to be adaptable to different morphological systems for simple words Thus the definition of a graphical unit is a parameter to our system each time MULTIFLEX is used with an external module for single units this module has to decide how a sequence of characters is to be divided into units In our formalism units are referred to by numerical variables 1 2 3 etc For example with Unitex a sequence like e Athens 04 consists of five constituents referred to in MULTIFLEX as 1 Athens 2 lt space gt 3 4 0 5 4 Each simple unit subject to inflection within a MWU has to be morphologically identified The identification means providing sufficient data so that any inflected form of the same item may be generated on demand For instance in e m moire vive we need to know that vive is the feminine singular form of a lemma and we have to be able to generate the feminine plural form of the same lemma vives We suppose that the external module for single units working with MULTIFLEX is responsible for such identification and generation of inflected forms of single units In Unitex the generation of forms is strongly inspired by the DELA system 20 In order to be able to generate one or more inflected forms of a word we have to know e its lemma 10 2 FORMALISM FOR
126. dependently from each other pranie m zg w prania m zgu prania m zg w etc That s why either of them has a different unification variable for number inflection n1 and 12 The three variables n1 n2 and c may be instantiated to any value from their respective domains sing pl sing pl and Nom Gen Dat Acc Inst Loc Voc cf Morphology txt file in section 10 2 1 The whole MWU inherits its gender number and case from its first con stituent Its gender is fixed Gen g while its number and case are instantiated to any of the 14 possible combinations The single path in this graph would have to be replaced by 28 different ones if the use of unification variables were not allowed 10 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 177 H lt 1 Gen g Nb n1 Case c gt lt 2 gt I lt 3 Nb n2 gt DO lt Gen g Nb n1 Case c gt e g pranie m zgu Figure 10 7 Inflection graph for pranie m zgu Orthographic and Other Variants Our formalism allows for any constituent to be omitted or moved within different inflected forms if there is a need for that It also enables the insertion of extra graphical units which do not appear in the base form of the MWU This allows to extend an inflection paradigm to a more general variation description e g orthographic or partly syntactic variation see 42 for an extensive study on term variation For example in English student union appears in corpus al
127. der and compares them line by line The result is an HTML page that presents results in two columns A blue line indicates that an utterance is common to the two concordances A red line indicates that a match is common to both concordances but with different range i e the two matches only overlap partially A green line indicates an utterance that appears in only one concordance Figure 6 50 gives an example NOTE you cannot click on utterances in a concordance comparison If you have no previous concordance the button is deactivated 123 6 8 APPLYING GRAPHS TO TEXTS ll as sen youaza 310q98 UI S anbu0 SWes 243 UT P2134TT2p 2313 S JIEM 1500 243 UMOUS PEU 3981 UBMION sq Jo SUOIEUON 343 Trei ae 4580 243 UMOUS PEU 2981 UEMION 243 Jo SUOIEUON 243 TTY aya 30 quaaa aqu Aq AQITTqou UEMION 243 Jo spuey 243 UT pas STITU TnytTanesq ayy 30 318d 133831b ay UTI2409 953107 ab zed 1298316 au futasaoo 1583103 2b18T Y samt quatoue ut pal azed 1398216 ai futisaoo 15310 aber E sagt JUATOUE UT pa Sa3JU2PJ03U09 OM BY JO BUD Auo Ul 1230 zey Saduanbas u3319 saauanbas JuaJayyip yng Jews pay saauanbas je nuap ang a aauepsoauo5 Figure 6 50 Example of a concordance comparison 124 CHAPTER 6 ADVANCED USE OF GRAPHS Chapter 7 Text automaton Natural languages contain much lexical ambiguity The text automaton is an effective and visual way of representing such ambiguity Each se
128. double quotes Thus this graph will correctly match Fe but not FE while this restriction cannot be specified in a normal DELAF Another advantage of dictionary graphs is that they can use results given by previous dic tionaries Thus itis possible to apply the standard dictionary and then tag as proper names all the unknown words that begin with an uppercase letter thanks to the graph NP r shown in figure 3 12 The in the graph name gives to it a low priority so that it will be applied after the standard dictionary This graph works with words that are still unknown after the application of the standard dictionary Square brackets stand for a context definition For more information about contexts see section 6 3 52 CHAPTER 3 DICTIONARIES SCH Yp No E Tm Md Ep Fm AZ Tp Dy Ho Wi Eg bel Ry Rh Pg Ag tg Kay LL Py Am Sm Bk Pr Nd Pm nyju Np Nb Mo A E A La Ce Ac Eise Pa Figure 3 11 Dictionary graph of chemical elements 3 7 BIBLIOGRAPHY 53 E 6 NPr Figure 3 12 Dictionary graph that tags unknown words beginning with an uppercase letter as proper names Since dictionary graphs are applied using the engine of Locate they have exactly the same properties than syntactic graphs So you can use morphological filters and or morpho logical mode For instance the graph shown on Figure 3 13 use morphological
129. drzxave NC_N2X1 N Comp slvm predsednika drzxave predsednik drzxave NC_N2X1 N Comp s2vm predsedniku drzxave predsednik drzxave NC_N2X1 N Comp s3vm predsednika drzxave predsednik drzxave NC_N2X1 N Comp s4vm predsednicye drzxave predsednik drzxave NC_N2X1 N Comp s5vm predsednikom drzxave predsednik drzxave NC_N2X1 N Comp s6vm predsedniku drzxave predsednik drzxave NC_N2X1 N Comp s7vm predsednici drzxave predsednik drzxave N2X1 N Comp plvm C predsednici drzxava predsednik drzxave NC_N2X1 N Comp plvm G e predsednika drzxave predsednik drzxave N2X1 N Comp p2vm predsednika drzxava predsednik drzxave N2X1 N Comp p2vm predsednicima drzxave predsednik drzxave NC_N2X1 N Comp p3vm predsednicima drzxava predsednik drzxave NC_N2X1 N Comp p3vm predsednike drzxave predsednik drzxave NC_N2X1 N Comp p4vm predsednike drzxava predsednik drzxave NC_N2X1 N Comp p4vm predsednici drzxave predsednik drzxave NC_N2X1 N Comp p5vm predsednici drzxava predsednik drzxave NC_N2X1 N Comp p5vm predsednicima drzxave predsednik drzxave NC_N2X1 N Comp p6vm predsednicima drzxava predsednik drzxave NC_N2X1 N Comp p6vm predsednicima drzxave predsednik drzxave NC_N2X1 N Comp p7vm predsednicima drzxava predsednik drzxave NC_N2X1 N Comp p7vm predsednika drzxave predsednik drzxave NC_N2X1 N Comp w2vm predsednika drzxava predsednik drzxave NC_N2X1 N Comp w2vm predsednika drzxave predsednik drzxave NC_N2X1 N Comp w4vm predsednika drzxava predsednik drzxave NC_N2X1 N Com
130. duced by others Finally software patents pose a constant threat to the existence of any free program We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder Therefore we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license Most GNU software including some libraries is covered by the ordinary GNU Gen eral Public License This license the GNU Lesser General Public License applies to certain designated libraries and is quite different from the ordinary General Public License We use this license for certain libraries in order to permit linking those libraries into non free programs When a program is linked with a library whether statically or using a shared library the combination of the two is legally speaking a combined work a derivative of the original library The ordinary General Public License therefore permits such linking only if the entire combination fits its criteria of freedom The Lesser General Public License permits more lax criteria for linking other code with the library We call this license the Lesser General Public License because it does Less to protect the user s freedom than the ordinary General Public License It also provides other free software developers Less of an advantage over competing non free programs These disadvantages are t
131. e conditions of this License If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations then as a consequence you may not distribute the Linguistic Resource at all For example if a patent license would not permit royalty free redistribution of the Linguistic Resource by all those who receive copies directly or indirectly through you then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Linguistic Resource If any portion of this section is held invalid or unenforceable under any particular circumstance the balance of the section is intended to apply and the section as a whole 12 11 VARIOUS OTHER FILES 261 is intended to apply in other circumstances It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims this section has the sole purpose of protecting the integrity of the free resource distribution system which is implemented by public license practices Many people have made generous con tributions to the wide range of data distributed through that system in reliance on consistent application of that system it is up to the author donor to decide if he or she is willing to distribute resources through any other system and a licensee cannot impose that choice This section is intended to make thoroughly clear what is be
132. e consonant skeletons for semitic languages introduction of the text alignment tool XAlign e no size limit for text file display e SVG export of graphs From a computational point of view a special effort has been made to clean and comment the source code of Unitex programs in order to facilitate the integration of new components Moreover the development of Unitex is now made with a SVN server which makes collab orative work much more easier Content Chapter 1 describes how to install and run Unitex Chapter 2 presents the different steps in the analysis of a text Chapter 3 describes the formalism of the DELA electronic dictionaries and the different operations that can be applied to them Chapters 4 and 5 present different means for making text searches more effective Chapter 5 describes in detail how to use the graph editor Chapter 6 is concerned with the different possible applications of grammars The particu larities of each type of grammar are presented Chapter 7 introduces the concept of text automaton and describes the properties of this no tion This chapter also describes operations on this object in particular how to disambiguate lexical items with the ELAG program CONTENTS 13 Chapter 8 contains an introduction to lexicon grammar tables followed by a description of the method of constructing grammars based on these tables Chapter 9 describes the text alignment module based on the XAlign tool Chapter 1
133. e des Langues 2 5 41 1993 10 1 40 IGM Lesser General Public License for Linguistic Resources http igm univ mlv unitex lgpllr html 1 1 41 Text Encoding Initiative http www tei c org 9 1 42 Christian JACQUEMIN Spotting and Discovering Terms through Natural Language Process ing MIT Press 2001 7 43 Gaby KLARSFLED and Mary HAMMANI MC CARTHY Dictionnaire lectronique du ladl pour les mots simples de l anglais DELASa Technical report LADL Universit Paris 7 1991 3 7 44 Cvetana KRSTEV Du ko VITAS and Agata SAVARY Prerequisites for a Comprehensive Dictionary of Serbian Compounds LNCS 4139 552 563 2006 10 2 266 BIBLIOGRAPHY 45 Tita KYRIACOPOULOU Les dictionnaires lectroniques la flexion verbale en grec moderne 1990 These de doctorat Universit Paris 8 3 7 46 Tita KYRIACOPOULOU Un syst me d analyse de textes en grec moderne repr senta tion des noms compos s In Actes du 5 me Colloque International de Linguistique Grecque 13 15 septembre 2001 Sorbomne Paris 2002 3 7 47 Tita KYRIACOPOULOU Safia MRABTI and AnastasiaYANNACOPOULOU Le diction naire lectronique des noms compos s en grec moderne Lingvistice Investigationes 25 1 7 28 2002 Amsterdam Philadelphia John Benjamins Publishing Company 3 7 48 Jacques LABELLE Le traitement automatique des variantes linguistiques en francais l exemple des concrets Lingvistice Investigationes 19 1 137 152 1995 Amst
134. e is an automaton for each sentence of the text Therefore the combination of all these automata corresponds to the automaton of the text This is why we use the term text automaton even if this object is not manipulated as a global automaton for practical reasons 125 126 CHAPTER 7 TEXT AUTOMATON E FST Text S 2344 sentences Here haunted of yore the fabulous Dragon of Wantley Sentence Y Reset Sentence Graph Rebuild FST Text Elag Frame Explode Here haunted here yore zl DET Ddef s p Here Implode here haunted haunt V K Ils 2s I3s Ilp I2p I3p Apply Elag Rule Figure 7 1 Sentence automaton example 7 2 Construction In order to construct the text automaton open the text then click on Construct FST Text in the menu Text One should first split the text into sentences and apply dictionaries If sentence boundary detection is not applied the construction program will arbitrarily split the text in sequences of 2000 lexical units instead of constructing one automaton per sen tence If no dictionaries are applied the text automaton that you obtain will consist of only one path made up of unknown words per sentence 7 2 1 Construction rules for text automata Sentence automata are constructed from text dictionaries The resulting degree of ambiguity is therefore directly linked to the granularity of the descriptions of dictionaries From the 7 2 CONSTRUCTION 127
135. ed form of the same lemma on demand by the same morphological module For instance in case of Unitex the form porte yields 7 morphological identifications 6 of which are factorized with respect to their inflection code porte porte porte N21 s porte porter V3 P1s P3s S1s S3s Y2s In case of ambiguity as above the proper identification has to be done for the time being by the user during the edition of the MWU lemma to be inflected in future this task will be partly automated For instance in case of porte fen tre the first constituent has to be identified by the user as a noun rather than a verb e For a given morphological identification and a set of inflectional values it returns all corresponding inflected forms For instance in Polish if the instrumental forms of the word reka are to be produced three forms should be returned rekq singular instru mental rekami and rekoma two variants of the plural instrumental reka lt Case Inst gt reka lt Nb sing Gen fem Case Inst gt rekami lt Nb pl Gen fem Case Inst gt rekoma lt Nb pl Gen fem Case Inst gt Such definition of an interface between the morphological system for simple words and the one for MWUs allows a better modularity and independence of one another The latter doesn t need to know how inflected forms of simple words are described analyzed and gen erated It only requires a set of correct inflected forms of a MWU s constituents Co
136. een space and newline is maintained at this point because the presence of newlines may have an effect on the process of splitting the text into sentences The result of the normalization of a text named my_text txt is a file in the same directory as the txt file and is named my_text snt NOTE When the text is preprocessed using the graphical interface a directory named my_text_snt is created immediately after normalization This directory called text direc tory contains all the data associated with this text 2 5 2 Splitting into sentences Splitting texts into sentences is an important preprocessing step since this helps in determin ing the units for linguistic processing The splitting is used by the text automaton construc tion program In contrast to what one might think detecting sentence boundaries is not a trivial problem Consider the following text The family has urgently called Dr Martin The full stop that follows Dr is followed by a word beginning with a capital letter Thus it may be considered as the end of the sentence which would be wrong To avoid the kind of problems caused by the ambiguous use of punctuation grammars are used to describe the different contexts for the end of a sentence Figure 2 9 shows an example grammar for sentence splitting for French sentences When a path of the grammar recognizes a sequence in the text and when this path produces the sentence delimiter symbol S this symbol is inserted
137. eferred to as 2 and royal referred to as 3 Ifa variable appears alone in a box the constituent has to be the same as in the lemma of the MWU For instance lt 3 gt in the uppermost path means that the unit royal is to be recopied as such If the variable is 174 CHAPTER 10 COMPOUND WORD INFLECTION accompanied by a set of category feature equations the constituent has to be inflected to the required form E g lt 3 Nb p gt means that the plural form of royal is needed In order to generate all inflected forms of the MWU we have to explore all the paths existing in the graph Each path starts at the leftmost right arrow and ends at the final encircled box Each time we come to a node we perform the action contained in the box a recopy or an inflection of a constituent and we accumulate the morphological features contained under the box The total of the accumulated node outputs should result in the complete morphological description of the inflected form For example in the graph on Figure 10 1 if we follow the intermediate path shown on Fig ure 10 2 lt Nb p gt Figure 10 2 One path of the inflection graph for battle royal we recopy battle 1 and the space 2 and we put royal into plural which yields the plural form battle royals of the whole MWU As the graph on Figure 10 1 contains three different paths the whole set of inflected forms generated for battle royal would be battle royal lt Nb s gt battle royals lt
138. els between the initial state and the call to the subgraph is crucial In fact if there were at least one label different from epsilon between the beginning of the graph Det and the call to DetCompose this would mean that the Unitex programs exploring the graph Det would have to read the pattern described by that label in the text before calling Det Compose recursively In this case the programs would loop infinitely only if they recognized the pattern an infinite number of times in the text which is impossible 6 2 4 Error detection In order to keep the programs from blocking or crashing Unitex automatically detects er rors during graph compilation The graph compiler checks that the main graph does not recognize the empty word and searches for all possible forms of void loops When an error 6 3 CONTEXTS 101 Det grf X BOULOTiRecherch o ea Bd DetCompose grf X BOULOT Recherche Ce EI Bd lt DET gt DetCompose Compiling graph Det Compiling graph DetCompose Recursion detection started Resolving lt E gt conditions Looking for lt E gt loops Looking for infinite recursions Recursion detection completed ERROR Det calls DetCompose that recalls the graph Det OK Cancel Figure 6 11 Error message when trying to compile Det is encountered an error message is displayed in the compilation window Figure 6 11 shows the message that appears if one tries to
139. en a box in a transducer has no output it is processed as if it had an lt E gt output In MERGE mode the output is inserted to the left of the recognized sequences 6 7 RULES FOR APPLYING TRANSDUCERS 115 the lt a gt Y gt 0 A Adj Figure 6 39 Example of a transducer Look at the transducer in Figure 6 39 If this transducer is applied to the novel Ivanhoe by Sir Walter Scott in MERGE mode the following concordance is obtained SS Concordance D My Unitex English Corpusivanhoe_snticoncord html of pointed beams which the Adj adjacent forest supplied defended the o f the outlaws with whom the Adj adjacent forest abounded or by the viol es may be still seen in the Adj antique Colleges of Oxford or Cambridge insolence fellow said the Adj armed rider breaking in on his prattle an 3 take a turn round the Adj back oi the hill to gain the wind on the ring the greater part of the Adj beautiful hills and valleys which lie be mantle and hood were of the Adj best Flanders cloth and fell in ample dest wine cask 5 place the Adj best mead the mightiest ale the riches Then sad relief from the Adj bleak coast that hears The German Ocean e bring to the shrine of the Adj Blessed Virgin Well you have said en rong And yellow hair d the Adj blue eyed Saxon came Thomson s Liber the son of Beowulph is the Adj born thrall of Cedric of Rotherwood Be Figure 6 40
140. ence 8011 if the state is final the three following bytes encode the index in the inf file of the compressed form to be used to reconstruct the dictionary lines for this inflected form Example if the state refers to the compressed form with index 25133 the correspond ing hexadecimal sequence is 00622D each leaving transition is then encoded in 5 bytes The first 2 bytes encode the character that labels the transition and the three following encode the byte position of the result state in the bin file The transitions of a state are encoded next to each other Example a transition that is labeled with the A letter and goes to the state of which the description starts at byte 50106 is represented by the hexadecimal sequence 004100C3BA By convention the first state of the automaton is the initial state 12 8 2 The inf files A inf file is a text file that describes the compressed files that are associated to a bin file Here an example of a inf file 00000000064 _10 0 0 7 N4 12 8 DICTIONARIES 229 PREPY _3 PREPS PREP _3 PREPY 1 1 N Hum mpY 3er 1 N AN Hum fs The first line of the file indicates the number of compressed forms that it contains Each line can contain one or more compressed forms If there are multiple forms they are separated by commas Each compressed form is made up of a sequence required to reconstruct a canonical knowing an inflected form followed by a sequence of grammatical semantic and
141. entences HTML Aligned with target concordance Aligned with source concordance O Locate Clear alignment Align Save alignment Save alignment as Locate Figure 9 5 Adding a link 9 3 PATTERN MATCHING 163 9 3 Pattern matching You can perform pattern matching queries on any of your texts by clicking on its Locate button The first time you click Unitex will ask you to build a working version of your text as shown on Figure 9 6 This text version will be preprocessed according to the text language in particular the default dictionaries will be applied WARNING the text language is determined on the basis of the path name For instance if your text file is located in MyUnitex Klingon Corpus the language will be con sidered to be Klingon So if your text is not in a subdirectory of your personal Unitex directory its language will not be identified Wee gt 4 Unitex needs a text version of your xml text in order to locate expression Do you agree to build and preprocess D My Unitex FrenchiCorpus A funtana fr_xalign txt Figure 9 6 Unitex needs to build a working version of your text XAlign Locate Pattern Locate pattern in the form of O Regular expression Graph Index O Shortest matches 8 Longest matches All matches Search limitation a Stop after 200 matches SEARCH Index all utterances in text Figure 9 7 Pattern matching f
142. ents into a graph 5 2 EDITING GRAPHS L Unitex 2 0 current language is English Text DELA Lexicon Grammar XAlign Edit File Edition Windows Info New Open Ctrl O Save as Save All Page Setup Print Ctrl P Print All Figure 5 1 FSGraph menu Unsaved Figure 5 2 Empty graph 74 CHAPTER 5 LOCAL GRAMMARS Figure 5 3 Creating a box Figure 5 4 Box containing I you he she it we they 5 2 EDITING GRAPHS 75 To connect a box to another one first click on the source box then click on the target box If there already exists a transition between two boxes it is deleted It is also possible to do that by clicking first on the target box and then on the source box while pressing Shift In our example after connecting the box to the initial and final states of the graph we get a graph as in figure 5 5 Figure 5 5 Graph that recognizes English pronouns NOTE If you double click a box you connect this box to itself see figure 5 6 To undo this double click on the same box a second time or use the Undo button Figure 5 6 Box connected to itself Click on Save as in the FSGraph menu to save the graph By default Unitex proposes to save the graph in the sub directory Graphs in your personal folder You can see if the graph was modified after the last saving by checking if the title contains the text Unsaved 5 2 2 Sub Graphs
143. erdam Philadelphia John Benjamins Publishing Company 3 7 49 Eric LAPORTE and Anne MONCEAUX Elimination of lexical ambiguities by gram mars The ELAG system Lingvistice Investigationes 22 341 367 1998 Amsterdam Philadelphia John Benjamins Publishing Company 7 7 3 50 Ville LAURIKARI TRE home page http laurikari net tre 1 1 4 7 51 Judith N LEVI The Syntax and Semantics of Complex Nominals Academic Press New York London 1978 10 1 52 XAlign Alignement multilingue LORIA 2006 http led loria fr outils ALIGN align html 9 53 Annie MEUNIER Nominalisation d adjectifs par verbes supports 1981 These de doctorat Universit Paris 7 8 1 54 Sun Microsystems Java http java sun com 1 2 55 Christian MOLINIER and Francoise LEVRIER Grammaire des adverbes description des formes en ment Droz Gen ve 2000 8 1 56 Anne MONCEAUX Le dictionnaire des mots simples anglais mots nouveaux et vari antes orthographiques Technical Report 15 IGM Universit de Marne la Vall e 1995 We 57 OpenOffice org http www openoffice org 1 8 2 2 58 Dong Ho PAK Lexique grammaire compar fran ais cor en Syntaxe des constructions com pl tives PhD thesis UQAM Montr al 1996 8 1 59 Soun Nam PARK La construction des verbes neutres en cor en 1996 Th se de doctorat Universit Paris 7 8 1 60 S bastien PAUMIER and Dana Marina DUMITRIU Editable text alignments and pow erful
144. ereas m indicates that there are several ones this mode is useful in Korean The default value is a s e 1 line maximum number of lines to be printed in the output file e i subname indicates that the recursive exploration must end when the program enters in graph subname This parameter can be used several times in order to specify several stop graphs e p s f d s displays paths graph by graph f default displays global paths d displays global paths with information on nested graph calls e c SS 0xXXXX replaces symbol SS when it appears between angle brackets by the Unicode character whose hexadecimal number is 0xXXXX e s L R specifies the left L and right R delimiters that will enclose items By default no delimiters are specified e s0 Str if the program must take outputs into account this parameter specifies the sequence Str that will be inserted between input and output By default there is no separator e f a s if the program must take outputs into account this parameter specifies the format of the lines that will be generated in0 inl out0 out1 s orin0 out0 inl outl a The default value is s e v prints information during the process verbose mode e rx L R specifies how cycles must be displayed L and R are delimiters If we consider the graph shown on Figure 11 1 here are the results for L and R il fait tr s tres il fait tr s beau 11 15 FST2TXT
145. ers to the information in the text dictionaries The four possible forms are e lt be gt matches all the entries that have be as canonical form e lt be V gt matches all entries having be as canonical form and the grammatical code V e lt V gt matches all entries having the grammatical code V e am be V or lt am be V gt matches all the entries having am as inflected form be as canonical form and the grammatical code V This kind of lexical mask is only of in terest if applied to the text automaton where all the ambiguity of the words is explicit While executing a search on the text that lexical mask matches the same as the simple token am 4 3 3 Grammatical and semantic constraints The references to dictionary information be V in these examples are basic It is possible to express more complex lexical masks by using several grammatical or semantic codes sepa rated by the character An entry of the dictionary is then only found if it has all the codes that are present in the mask The mask lt N z1 gt thus recognizes the entries broderies broderie N zl fp capitales europ ennes capital urop enne N NA Conc HumCol1 z1 fp but not Descartes Ren Descartes N Hum NPropre ms habitu A z1 ms It is possible to exclude codes by preceding them with the character instead of In order to be recognized an entry has to contain all the codes required by the lexical mask and none of the prohibited ones The mask
146. esentation of the text does not contain newlines but spaces Since a newline counts as two characters and a space as a single one it is necessary to know where newlines occur in the text when the positions of occurrences located by the Locate program are to be synchronized with the text file File enter pos is used for this by the Concord program Thanks to this when clicking on an occurrence in a concordance it is cor rectly selected in the text File enter pos is a binary file containing the list of the positions of newlines in the text All produced files are saved in the text directory 214 CHAPTER 11 USE OF EXTERNAL PROGRAMS 11 31 Txt2Fst2 Txt2Fst2 OPTIONS lt txt gt This program constructs an automaton of a text lt txt gt represents the complete path of a text file without omitting the snt extension OPTIONS e a ALPH alphabet ALPH alphabet file e c clean indicates whether the rule of conservation of the best paths see section 7 2 4 should be applied e n XXX normalization_grammar XXX name of a normalization grammar that is to be applied to the text automaton If the text is separated into sentences the program constructs an automaton for each sen tence If this is not the case the program arbitrarily cuts the text into sequences of 2000 lexical units and produces an automaton for each of these sequences The result is a file called text st2 which is saved in the directory of the text NO
147. essus s il platt Elle O All sentences Plain text si daca toate astea nu sintindeajuns avem si un fel de reminiscenta de regret 108 All sentences Plain text 8 Matched sentences All sentences HTML 8 Matched sentences O All sentences HTML Aligned with source concordance O Aligned with target concordance Locate Clear alignment Align Save alignment Save alignment as Locate Figure 9 8 Displaying matched sentences To exploit parallel texts it is then interesting to retrieve sentences aligned with matched sentences This can be done by selecting for the other text the display mode Aligned with source concordance In this mode Unitex filters sentences that are not linked to matched sentences in the source text So it is easy to lookup for an expression in one text and to find the corresponding sentences in the other as shown on Figure 9 9 9 3 PATTERN MATCHING D My Unitex XAlign funtana xml mais nous assassinons tour de bras corme nous mangeons comme nous respirons comme nous accomplissons les gestes les plus quotidiens Apr s avoir mang le sien l un d entre nous commen ait Tante donne moi le dessus s il nla t Elle All sentences Plain text 8 Matched sentences All sentences HTML Aligned with target concordance sugrumam dar noi asasinam cu at ta nongalant de parca am minca am respira am face un gest de zi
148. esti e i dizionari elet tronici In E Burattini and R Cordeschi editors Manuale di Intelligenza Artificiale per le Scienze Umane Roma Carocci 2002 3 7 29 A Simple English Axis Generator http nlp cs nyu edu GMA docs HOWTO axis 11 3 30 Jacqueline GIRY SCHNEIDER Les nominalisations en fran ais L op rateur faire dans le lex ique Droz Gen ve Paris 1978 8 1 31 Jacqueline GIRY SCHNEIDER Les pr dicats nominaux en fran ais Les phrases simples verbe support Droz Gen ve Paris 1987 8 1 32 GNU General Public License http ww gnu org licenses gp1 html 1 1 12 11 6 33 GNU Lesser General Public License http www gnu org licenses lgpl html 1 1 12 116 34 Gaston GROSS D finition des noms compos s dans un lexique grammaire Langue Francaise 87 1990 10 1 35 Gaston GROSS Les expressions fig es en francais Noms compos s et autres locutions Ophrys Paris 1996 3 7 10 1 36 Maurice GROSS M thodes en syntaxe Hermann Paris 1975 8 1 37 Maurice GROSS Grammaire transformationnelle du fran ais 3 Syntaxe de l adverbe AS STRIL Paris 1986 3 7 8 1 38 Alain GUILLET and Christian LECLERE La structure des phrases simples en francais les constructions transitives locatives Droz Gen ve 1992 8 1 39 Beno t HABERT and Christian JACQUEMIN Noms compos s termes d nominations complexes probl matiques linguistiques et traitements automatiques Traitement Au tomatiqu
149. et file is specified the SortTxt program sorts in the order of the Unicode encoding 12 3 Graphs This section presents the two graph formats the graphic format grf and the compiled format fst2 12 3 1 Format grf A grf file is a text file that contains presentation information in addition to information representing the contents of the boxes and the transitions of the graph A grf file begins with the following lines fUnigraphY SIZE 1313 9504 FONT Times New Roman 124 OFONT Times New Roman B 124 BCOLOR 167772159 FCOLOR 04 218 CHAPTER 12 FILE FORMATS ACOLOR 126322564 SCOLOR 167116804 CCOLOR 2554 DBOXES y PORIENT LY 4 The first line Unigraph is acomment line The following lines define the parameter values of the graph presentation SIZE x y defines the width x and the hight y of a graph in pixels FONT name xyz defines the font used for displaying the contents of the boxes name represents the name of the mode x indicates if the text should be in bold face or not If x is B it indicates that it should be bold For non bold face x should be a space In the same way y has value I if the text should be italic a space if not z represents the size of the text OFONT name xyz defines the mode used for displaying transducer outputs Param eters name x y and z are defined in the same way as FONT BCOLOR x defines the background color of the graph x represen
150. except for those used to mark a sentence separator S or a valid lexical tag aujourd hui ADV The newline needs to be encoded with the two special characters with hexadecimal values 000D and 000A 12 4 2 ent Files snt files are t xt files that have been processed by Unitex These files should not contain any tabs They should also not contain multiple consecutive spaces or newlines The only allowed braces in snt files are those of the sentence delimiter S and those of lexical labels aujourd hui ADV 12 4 3 File text cod The text cod file is a binary file containing a sequence of integers that represent the text Each integer i reflects the token with index i in the tokens txt file These integers are encoded in four bytes NOTE Tokens are numbered starting at 0 12 44 The tokens txt file The tokens txt file is a text file that contains the list of all lexical units of the text The first line of this file indicates the number of units found in the file Units are separated by a newline Whenever a sequence is found in the text with capitalization variants each variant is encoded as a distinct unit NOTE Newlines that might be in the snt file are encoded like spaces Therefore there is no unit encoding the newline 12 4 5 The tok_by_alph txt and tok_by_freq txt files These two files are text files that contain the list of lexical units sorted alphabetically or by frequence In the tok_by_alph txt file
151. f Sections 1 and 2 above provided that you accompany it with the complete corresponding machine readable source code which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange If distribution of object code is made by offering access to copy from a designated place then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code even though third parties are not compelled to copy the source along with the object code 5 A program that contains no derivative of any portion of the Library but is designed to work with the Library by being compiled or linked with it is called a work that uses the Library Such a work in isolation is not a derivative work of the Library and therefore 12 11 VARIOUS OTHER FILES 251 falls outside the scope of this License However linking a work that uses the Library with the Library creates an executable that is a derivative of the Library because it contains portions of the Library rather than a work that uses the library The executable is therefore covered by this License Section 6 states terms for distribution of such executables When a work that uses the Library uses material from a header file that is part of the Library the object code for the work may be a derivative work of the Library even though the source code is not Whether this is true is especially signi
152. f compound words in Dutch German Norwegian and Rus SUBIR A bn Moda e Se he im un EN Ge Hee ee Oa ae he ec 32 26 Op ningatagged text oci rara roria uwuh a hee eee ES 32 3 Dictionaries 35 31 The DELA dictionaries 8 mue gun e RS pa b 35 3211 TheDELAP OEMa 22 22 amada gants due hui m d due ws 35 312 The DELAS Pen ccoo dde dogs eae Ae we a dune 38 3d Dichonary Contents se Lis i taeog wnt A a eee ess sa 39 3 2 Checking dictionary formiat niche RAE 41 Da SERA erg At a a EE ee eS a E Eed e E geg AR 42 3 4 29 3 6 3 7 Searching with regular expressions 4 1 Definition 4 2 4 3 4 4 4 5 4 6 4 7 4 8 Automatic inflection Inflection of simple words 3 4 2 Inflection of compound words 3 4 3 Inflection of semitic languages Compression Applying dictionaries Priorities e oa Fb oo RO EWR ES 3 6 2 Application rules for dictionaries 3 6 3 Dictionary graphs 3 64 Morphological dictionary graphs Bibliography 3 4 1 3 6 1 Tokens Lexical 4 3 1 4 3 2 4 3 3 4 3 4 435 Kleene Search 4 8 1 EE Special symbols References to information in the dictionaries Grammatical and semantic constraints Inflectional constraints Negation of a lexical mask Concatenation EE teg AA A d Morphological filters Configuration of the search 4 8 2 Presentation of the results Local grammars 5 1 The local grammar formalism 52 59 5 11 Algebraic grammars 5 1 2 Extended algebraic gram
153. f files are not interpreted in the same manner as the grf files that rep resent graphs constructed by the user In fact in a normal graph the lines of a box are separated by the symbol In the graph of a sentence each box represents either a lexical unit without a tag or a dictionary entry enclosed by curly brackets If the box only repre sents an unlabeled lexical unit this unit appears alone in the box If the box represents a dictionary entry the inflected form is displayed followed in another line by the canonical form if it is different The grammatical and inflectional information is displayed below the box as a transducer output Figure 7 23 shows the graph obtained for the first sentence of Ivanhoe The words Ivanhoe Walter and Scott are considered unknown words The word by corresponds to two en tries in the dictionary The word Sir corresponds to two dictionary entries as well but since the canonical form of these entries is sir it is displayed because it differs from the inflected form by a lower case letter Figure 7 23 Automaton of the first sentence of Ivanhoe 7 4 2 Modifying the text automaton It is possible to manually modify the sentence automaton You can add or erase boxes or transitions When a graph is modified it is saved to the text file sentenceN gr where N represents the number of the sentence When you select a sentence if a modified graph exists for this sentence this one is displayed You can
154. fault 208 CHAPTER 11 USE OF EXTERNAL PROGRAMS e M merge merge transducer outputs with text inputs e R replace replace texts inputs with corresponding transducer outputs Ambiguous output options e b ambiguous_outputs allows the production of several matches with same input but different outputs default e z no_ambiguous_outputs forbids ambiguous outputs In case of ambiguous outputs one will be arbitrarily keeped depending on the internal state of the program Variable error options These options have no effect if the output mode is set with ignore otherwise they rule the behavior of the Locate program when an output is found that contains a reference to a variable that is not correctly defined e X exit_on_variable_error kills the program e Y ignore_variable_errors acts as if the variable has an empty content de fault e Z backtrack_on_variable_errors stop exploring the current path of the grammar This program saves the references to the found occurrences in a file called concord ind The number of occurrences the number of units belonging to those occurrences as well as the percentage of recognized units within the text are saved in a file called concora n These two files are stored in the directory of the text 11 20 MergeTextAutomaton MergeTextAutomaton lt txtauto gt This program reconstructs text automaton lt txtauto gt taking into account the manual mod i
155. ficant if the work can be linked without the Library or if the work is itself a library The threshold for this to be true is not precisely defined by law If such an object file uses only numerical parameters data structure layouts and acces sors and small macros and small inline functions ten lines or less in length then the use of the object file is unrestricted regardless of whether it is legally a derivative work Exe cutables containing this object code plus portions of the Library will still fall under Section 6 Otherwise if the work is a derivative of the Library you may distribute the object code for the work under the terms of Section 6 Any executables containing that work also fall under Section 6 whether or not they are linked directly with the Library itself 6 As an exception to the Sections above you may also combine or link a work that uses the Library with the Library to produce a work containing portions of the Library and distribute that work under terms of your choice provided that the terms permit modifica tion of the work for the customer s own use and reverse engineering for debugging such modifications You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License You must supply a copy of this License If the work during execution displays copyright notices you must include the copyright notice for the Library among
156. fications If the program finds a file sentenceN grf in the same directory as lt txtauto gt it replaces the automaton of sentence N with the one represented by sentenceN grf The lt txtauto gt file is replaced by the new text automaton The old text automaton is backed up in a file called text fst2 bck 11 21 MULTIFLEX 209 11 21 MultiFlex MultiFlex OPTIONS lt dela gt This program carries out the automatic inflection of a DELA dictionary containing simple see section 3 1 2 or compound word lemmas see chapter 10 OPTIONS e o DELAF output DELAF output DELAF file e a ALPH alphabet ALPH alphabet file e d DIR directory DIR the directory containing Morphology and Equivalences files and inflection graphs for single and compound words Note that st 2 inflection transducers will automatically be built from corresponding grf files if absent or older than gr f files 11 22 Normalize Normalize OPTIONS lt text gt This program carries out a normalization of text separators The separators are space tab and newline Every sequence of separators that contains at least one newline is replaced by a unique newline All other sequences of separators are replaced by a single space This program also checks the syntax of lexical tags found in the text All sequences in curly brackets should be either the sentence delimiter S the stop marker STOP or valid entries in the DELAF format aujour
157. file that describes the list of system dictionaries that are applied by default This file can be found in the directory of the current language Each line corresponds to a name of a bin file The system dictionaries are in the system directory and in that directory in the current language Dela sub directory Here is an example of this file delacf binY delaf binY 12 10 3 The user dic def file The user_dic def file is a text file that describes the list of dictionaries the user has de fined to be applied by default This file is in the directory of the current language and has the same format as the system_dic def file The dictionaries need to bein the current language Del sub directory of the personal directory of the user 236 CHAPTER 12 FILE FORMATS 12 10 4 The user cfg file Under Linux Unitex expects the personal directory of the user to be called unitex and expects it to be in his root directory HOME Under Windows it is not always possible to associate a directory to a user per default To compensate for that Unitex creates a cfg file for each user that contains the path to his personal directory This file is saved under the name user login cfgin the Unitex Users system sub directory WARNING THIS FILE IS NOT IN UNICODE WARNING 2 THE PATH OF THE PERSONAL DIRECTORY IS NOT FOLLOWED BY A NEWLINE 12 11 Various other files For each text Unitex creates multiple files that contain information that are
158. for permission NO WARRANTY 12 BECAUSE THE LINGUISTIC RESOURCE IS LICENSED FREE OF CHARGE THERE IS NO WARRANTY FOR THE LINGUISTIC RESOURCE TO THE EX TENTPERMITTED BY APPLICABLE LAW EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND OR OTHER PARTIES PRO VIDE THE LINGUISTIC RESOURCE AS IS WITHOUT WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LINGUISTIC RESOURCE IS WITH YOU SHOULD 262 CHAPTER 12 FILE FORMATS THE LINGUISTIC RESOURCE PROVE DEFECTIVE YOU ASSUME THE COST OF ALL NECESSARY SERVICING REPAIR OR CORRECTION 13 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER OR ANY OTHER PARTY WHO MAY MODIFY AND OR REDISTRIBUTE THE LINGUISTIC RESOURCE AS PERMITTED ABOVE BE LIABLE TO YOU FOR DAMAGES INCLUDING ANY GENERAL SPECIAL INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LINGUISTIC RE SOURCE INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BE ING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LINGUISTIC RESOURCE TO OPERATE WITH ANY OTHER SOFTWARE EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES END OF TERMS AND CONDITIONS Bibliography 1 Free Software Foundation http www
159. format frame indicate the name of the parameterized graph to be used In the Resulting GRF grammar frame indicate the name of the main graph that will be generated This main graph is a graph that invokes all the graphs that are going to be generated When launching a search in a text with that graph all the generated graphs are simultaneously applied The Name of produced subgraphs frame is used to set the name of each graph that will be generated Enter aname containing because for each line of the table will be replaced the line number which guarantees that each graph name will be unique For example if the main graph is called TestGraph grf and if subgraphs are called Test Graph_ grf the graph generated from the 16th line of the line will be named TestGraph_0016 grf Figures 8 8 and 8 9 show two graphs generated by applying the parameterized graph of figure 8 3 at table 31H 156 CHAPTER 8 LEXICON GRAMMAR Compile Lexicon Grammar to GRF Reference Graph in GRF format iy UnitexiFrenchiGraphsiparametrized_graph grt Resulting GRF grammar D imy UnitexiFrenchiGraphsiTestGraph grt Name of produced subgraphs D My UnitexiFrenchiGraphsiTestGraph_ grt cen Figure 8 7 Configuration of the automatic generation of graphs Figure 8 10 shows the resulting main graph Eee NO tre V ant le verbe n 0007 ne v rifie pas la propri t de la colonne A Figure 8 8 Gra
160. from his long captivity had become an Le Figure 4 2 Result of a search for the pattern lt MOT gt 4 4 Concatenation There are three ways to concatenate regular expressions The first consists in using the concatenation operator which is represented by the dot Thus the expression lt DET gt lt N gt recognizes a determiner followed by a noun The space can also be used for concatenation as well as the empty string The following expressions the lt A gt cat the lt A gt cat recognizes the token the followed by an adjective and the token cat The parenthesis are used as delimiters of a regular expression All of the following expressions are equivalent the lt A gt cat the lt A gt cat the lt A gt cat the lt A gt cat the lt A gt cat 4 5 Union The union of regular expressions is expressed by typing the character between them The expression I youthe she it we they lt V gt 4 6 KLEENE STAR 63 recognizes a pronoun followed by a verb If an element in an expression is optional it is sufficient to use the union of this element and the empty word epsilon Examples the little lt E gt cat recognizes the sequences the cat and the little cat lt E gt Anglo French Indian recognizes French Indian Anglo French and Anglo Indian 4 6 Kleene star The Kleene star represented by the character allows you to recognize zero one or several occurrences of an expressio
161. from the ones used in western languages Spaces can be forbidden optional or mandatory In order to better cope with these particularities Unitex splits texts in a language dependent way Thus languages like English are treated as follows A token can be e the sentence delimiter S e the stop marker STOP This token is a special one that can NEVER be matched in any way by a grammar It can be used to bound elements in a corpus For instance if a corpus is made of news separated by STOP it will be impossible that a grammar matches a sequence that overlaps the end of a news and the beginning of the following news e alexical tag aujourd hui ADV e a contiguous sequence of letters the letters are defined in the language alphabet file e one and only one non letter character i e all characters not defined in the alphabet file of the current language if it is a newline it is replaced by a space For other languages tokenization is done on a character by character basis except for the sentence delimiter S the STOP marker and lexical tags This simple tokenization is fundamental for the use of Unitex but limits the optimization of search operations for pat terns Regardless of the tokenization mode newlines in a text are replaced by spaces Tokenization is done by the Tokenize program This program creates several files that are saved in the text directory e tokens txt contains the list of tokens in the orde
162. from the library whereas the latter must be combined with the library in order to run GNU LESSER GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING DISTRIBUTION AND MODIFICATION 0 This License Agreement applies to any software library or other program which con tains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License also called this License Each licensee is addressed as you A library means a collection of software functions and or data prepared so as to be con veniently linked with application programs which use some of those functions and data to form executables The Library below refers to any such software library or work which has been dis tributed under these terms A work based on the Library means either the Library or any derivative work under copyright law that is to say a work containing the Library or a portion of it either verbatim or with modifications and or translated straightforwardly into another language Hereinafter translation is included without limitation in the term modification Source code for a work means the preferred form of the work for making modifications toit For a library complete source code means all the source code for all modules it contains plus any associated interface definition files plus the scripts used to control compilation and installation of the library Acti
163. fter pre processing or 2 by explicitly clicking on Apply Lexical Resources in the Text menu see section 3 6 Unitex can manipulate compressed dictionaries bin and dictionary graphs fst 2 We will now describe the rules for applying dictionaries in detail Dictionary graphs will be described in section 3 6 3 3 6 1 Priorities The priority rule says that if a word in a text is found in a dictionary this word will not be taken into account by dictionaries with lower priority This allows for eliminating a part of ambiguity when applying dictionaries For example the French word par has a nominal interpretation in the golf domain If you don t want to use this meaning it is sufficient to create a filter dictionary containing only the entry par PREP and to apply this with highest priority This way even if simple word dictionaries contain different entries they will be ignored given the priority rule There are three priority levels The dictionaries whose names without extension end with have the highest priority those that end with have the lowest one All other dictionaries are applied with medium priority The order in which dictionaries with the same priority are applied does not matter On the command line the command Dico ex snt alph txt ctr bin cities bin rivers bin regions bin will apply the dictionaries in the following order ex snt is the text to which the dictionar ies are applied and alph txt i
164. g of these symbols for Unitex as well as the ways to rec ognize these characters in texts Caracter Meaning Escape R quotation marks mark sequences that must not be in K terpreted by Unitex and whose case must be taken verbatim separates different lines within the boxes mam introduces a call to a subgraph te or indicates the start of a transduction within a box 7 lt lt indicates the start of a pattern or a meta lt or lt gt gt indicates the end of a pattern or a meta gt or gt prohibits the presence of a space ou escapes most of the special characters Table 5 1 Encoding of special characters in the graph editor 5 2 8 Toolbar Commands The toolbar on the left of a graph contains shortcuts for certain commands and allows you to manipulate boxes of a graph by using some tools This toolbar may be moved by clicking on the rough zone It may also be dissociated from the graph and appear in an separate window see figure 5 19 In this case closing this window puts the toolbar back at its initial position Each graph has its own toolbar The first two icons are shortcuts for saving and compiling the graph The following five correspond to the Copy Cut Paste Redo and Undo operations The last icon showing a key is a shortcut to open the window with the graph display options 5 3 DISPLAY OPTIONS 83 A a De Sp b ew BD Fig
165. give the alarm when any one approaches 5 But I trust soon omanlike and bravely Of twenty four arrows shot in succession ten started up and bent their bous i Six arrows placed on the string were he back of which was decorated with two ass s ears and which was placed These two squires were followed by two attendants whose dark visages ber with a grave pace followed by four attendants bearing in a table co ake part and being divided into two bands of equal numbers might fig Figure 6 21 Results of the application of the grammar shown on Figure 6 20 106 CHAPTER 6 ADVANCED USE OF GRAPHS All the outputs produced in the left context are ignored as you can see in the concordance of Figure 6 23 showing the results obtained with the grammar of Figure 6 22 seven N eight nine ten Figure 6 22 Ignored output in a left context Concordance D My UnitexiEnglishiCorpusiivanhoe_snticoncord htm e courses and cast to the ground three N antagonists 5 I add that seven of utes to keep at sword s point his three N antaqonists turning and wheeling with entinels to give the alarm when any one N approaches 5 But I trust soon to ga omanlike and bravely 5 Of twenty four N arrows shot in succession ten were fi started up and bent their bows Six N arrows placed on the string were pointe he back of which was decorated with two N ass s ears and which was placed about These two squires were follo
166. gle user he can also copy the directory to his working directory He can work with this language without this language being shown to other users 1 7 Uninstalling Unitex No matter which operating system you are working with itis sufficient to delete the Unitex directory to completely delete all the program files Under Windows you may have to delete the shortcut to Unitex jar if you have created one on your desktop The same has to be done on Linux if you have created an alias Chapter 2 Loading a text One of the main functionalities of Unitex is to search a text for expressions To do that texts have to undergo a set of preprocessing steps that normalize non ambiguous forms and split the text in sentences Once these operations are performed the electronic dictionaries are applied to the texts Then one can search more effectively in the texts by using grammars This chapter describes the different steps for text preprocessing 2 1 Selecting a language When starting Unitex the program asks you to choose the language in which you want to work see figure 2 1 The languages displayed are the ones that are present in the Unitex system directory and those that are installed in your personal working directory If you use a language for the first time Unitex copies the system directory for this language to your personal directory except for the dictionaries in order to save disk space WARNING If you already have a personal directo
167. grf graph to be generated 11 25 Reg2Grf Reg2Grf lt txt gt This program constructs a grf file corresponding to the regular expression written in file lt txt gt The parameter lt t xt gt represents the complete path to the file containing the regular expression This file needs to be a Unicode text file The program takes into account all characters up to the first newline The result file is called regexp grf and is saved in the same directory as lt txt gt 11 26 SortTxt SortTxt OPTIONS lt txt gt This program carries out a lexicographical sorting of the lines of file lt txt gt lt txt gt repre sents the complete path of the file to be sorted OPTIONS n no_duplicates remove duplicate lines default d duplicates remove duplicate lines r reverse sort in descending order 0o XXX sort_order XXX sorts using the alphabet of the order defined by file XXX If this parameter is missing the sorting is done according to the order of Unicode characters 1 XXX line_info XXX backup the number of lines of the result file in file XXX 212 CHAPTER 11 USE OF EXTERNAL PROGRAMS e t thai option for sorting Thai text The input text file is modified By default the sorting is performed in the order of Unicode characters removing duplicate lines 11 27 Table2Grf Table2Grf OPTIONS lt table gt This program automatically generates graphs from a lexicon grammar lt table gt and
168. h further by using a formalism even more powerful than automata These grammars are represented as graphs that the user can easily create and update Lexicon grammar tables are matrices describing properties of some words Many such ta bles have been constructed for all simple verbs in French as a way of describing their rele vant syntactic properties Experience has shown that every word has a quasi unique behav ior and these tables are a way to present the grammar of every element in the lexicon hence the name lexicon grammar for this linguistic theory Unitex offers a way to automatically build grammars from lexicon grammar tables Unitex can be viewed as a tool in which one can put linguistic resources and use them Its technical characteristics are its portability modularity the possibility of dealing with lan guages that use special writing systems e g many Asian languages and its openness thanks to its open source distribution Its linguistic characteristics are the ones that have motivated the elaboration of these resources precision completeness and the taking into 11 12 CONTENTS account of frozen expressions most notably those which concern the enumeration of com pound words What s new from version 1 2 Here are some interesting new features e left contexts morphological mode in Locate e brand new version of Convert replacement of Inflect by MultiFlex that can inflect compound words and that can handl
169. have received copies or rights from you under this License will not have their licenses terminated so long as such parties remain in full compliance You are not required to accept this License since you have not signed it However nothing else grants you permission to modify or distribute the Linguistic Resource or its derivative works These actions are prohibited by law if you do not accept this Li cense Therefore by modifying or distributing the Linguistic Resource or any work based on the Linguistic Resource you indicate your acceptance of this License to do so and all its terms and conditions for copying distributing or modifying the Linguis tic Resource or works based on it Each time you redistribute the Linguistic Resource or any work based on the Linguis tic Resource the recipient automatically receives a license from the original licensor to copy distribute link with or modify the Linguistic Resource subject to these terms and conditions You may not impose any further restrictions on the recipients exercise of the rights granted herein You are not responsible for enforcing compliance by third parties with this License If as a consequence of a court judgment or allegation of patent infringement or for any other reason not limited to patent issues conditions are imposed on you whether by court order agreement or otherwise that contradict the conditions of this License they do not excuse you from th
170. he Linguistic Resource but is designed to work with the Linguistic Resource or an encrypted form of the Linguistic Resource by reading it or being compiled or linked with it is called a work that uses 12 11 VARIOUS OTHER FILES 259 the Linguistic Resource Such a work in isolation is not a derivative work of the Linguistic Resource and therefore falls outside the scope of this License However combining a work that uses the Linguistic Resource with the Linguistic Resource or an encrypted form of the Linguistic Resource creates a package that is a derivative of the Linguistic Resource because it contains portions of the Linguistic Resource rather than a work that uses the Linguistic Resource If the package is a derivative of the Linguistic Resource you may distribute the package under the terms of Section 4 Any works containing that package also fall under Section 4 4 As an exception to the Sections above you may also combine a work that uses the Linguistic Resource with the Linguistic Resource or an encrypted form of the Lin guistic Resource to produce a package containing portions of the Linguistic Resource and distribute that package under terms of your choice provided that the terms per mit modification of the package for the customer s own use and reverse engineering for debugging such modifications You must give prominent notice with each copy of the package that the Linguistic Resource is used in it and
171. he first and the last constituent respectively 10 1 2 Lexicalized vs Grammar Based Approach to Morphological Description A previous study 63 has confirmed the status of MWUs as units on the frontier between morphology and syntax Their compound structure suggests productivity which can hardly be processed without a grammar based approach However some of their morphological syntactic and semantic properties exclude their processing merely in terms of the properties of their constituents For example in both examples below e chief justice e lord justice there are few automatically accessible hints indicating that the former one is morphologi cally a standard English Noun Noun phrase taking an s at its last constituent in plural while the plural of the latter has three variants e chief justices e lord justices lords justice lords justices Thus at least one of the above examples has to be considered as lexicalized in order for the automatic morphological processing to be reliable MULTIFLEX implements a unification based formalism for the description of the inflec tional behavior of MWUs presented in 64 Its features are described in section 10 2 This formalism requires the description to be fully lexicalized each MWU listed in a dictionary 170 CHAPTER 10 COMPOUND WORD INFLECTION obtains a code e g NC_NN NC_NN2 etc representing its inflectional paradigm for in stance in the DELA like format aircraft ca
172. he reason we use the ordinary General Public License for many libraries However the Lesser license provides advantages in certain special circumstances For example on rare occasions there may be a special need to encourage the widest possible use of a certain library so that it becomes a de facto standard To achieve this non free programs must be allowed to use the library A more frequent case is that a free library does the same job as widely used non free libraries In this case there is little to gain by limiting the free library to free software only so we use the Lesser General Public License In other cases permission to use a particular library in non free programs enables a greater number of people to use a large body of free software For example permission to use the GNU C Library in non free programs enables many more people to use the whole GNU operating system as well as its variant the GNU Linux operating system Although the Lesser General Public License is Less protective of the users freedom it does ensure that the user of a program that is linked with the Library has the freedom and the wherewithal to run that program using a modified version of the Library The precise terms and conditions for copying distribution and modification follow Pay close attention to the difference between a work based on the library and a work that uses 12 11 VARIOUS OTHER FILES 249 the library The former contains code derived
173. he removal of a path through that variable It is possible to swap the meaning of these signs by typing an exclamation mark in front of the symbol In that case the path is removed when there is a sign and keeped where there is a one In all other cases the variable is replaced by the content of the table cell The special variable is replaced by the number of the line in the table The fact that its value is different for each line allows for its use as a simple characterization of a line That variable is not affected by an exclamation point to the left of it Figure 8 3 shows an example of a parameterized graph designed to be applied to the lexicon grammar table 31H presented in figure 8 4 154 CHAPTER 8 LEXICON GRAMMAR le verbe n ne v rifie pas la propri t de la colonne A lt p v gt vers av NO V vers N Figure 8 3 Example of parameterized graph G Y_31H OpenOffice org Calc Fichier diter Afficher Ins rer Format Outils Donn es Fen tre Aide x A SHaABIABSRIFZ LBS Soe 1ABNHIBYIMOBEQIO l dl arial We Hei siFSeSeSmB Ml EEA lt OPT gt E AVOIT AUX abandonner Paul agabandonn s abuser Max abuse acquiescer Max aSacquiesc E de adouber Paul adoube checs agioter Max agiote sur les chan agoniser Maxagonises archaiser Cet auteur archaise volc arquer Max agarqu stoute la jou arriver Max estgarriv s atermoyer Max atermoie badaude
174. head gt lt meta http equiv Content Type content text html charset UTF 8 gt 12 7 TEXT DICTIONARIES 227 lt style type text css gt a blue color blue text decoration underline a red color red text decoration underline a green color green text decoration underline lt style gt lt head gt lt body gt lt h4 gt lt font color blue gt Blue lt font gt identical sequences lt br gt lt font color red gt Red lt font gt similar but different sequences lt br gt lt font color green gt Green lt font gt sequences that occur in only one of the two concordances lt br gt lt table border 1 cellpadding 0 style font family Courier new font size 12 gt lt tr gt lt td width 450 gt lt font color blue gt ed in ancient times lt u gt a large forest lt u gt covering the greater par lt font gt lt td gt lt td width 450 gt lt font color blue gt ed in ancient times lt u gt a largeforest lt u gt covering the greater par lt font gt lt td gt lt tr gt lt tr gt lt td width 450 gt lt font color green gt ge forest covering lt u gt the greater part lt u gt nbsp of the beautiful hills lt font gt lt td gt lt td width 450 gt lt font color green gt lt font gt lt td gt lt tr gt lt table gt lt body gt lt html gt 12 7 Text dictionaries The Dico program produces severa
175. hem sha d not dizzied thine understanding thou mightst know Clericus clericum non decimat 5 that is thine understanding thou mightst know Clericus clericum non decimat 5 that is to say we ch derstanding thou mightst know Clericus clericum non decimat 5 that is to say we churchmen d thou mightst know Clericus clericum non decimat 5 that is to say we churchmen do not exhaust ointed servants It is true replied Wamba that I being but an ass am nevertheless hon o How call d you your Franklin Prior Aymer Cedric answered the Prior 3 Cedric the Sa all d you your Franklin Prior Aymer Cedric answered the Prior 5 Cedric the Saxon T mer Cedric answered the Prior 5 Cedric the Saxon Tell me good fellow are we near road will be uneasy to find answered Gurth who broke silence for the first time and the f Figure 4 1 Result of the search for lt DIC gt Here are some examples of lexical masks with the different types of constraints e lt A Hum fs gt a non human adjective in the feminine singular e lt lire V P F gt the verb lire in the present or future tense 61 e lt suis suivre V gt the word suis as inflected form of the verb suivre as opposed to the form of the verb tre e lt facteur N Hum gt all nominal entries that have facteur as canonical form and that do not have the semantic code Hum e lt ADV gt all words that are not adverbs e lt MOT g
176. hese names by the longest prefixes made of letters if you have selected the Remove class numbers button Thus N4 is replaced by N By choosing the inflectional grammar names carefully one can construct a ready to use dictionary Let s have a look at the dictionary we get after the DELAS inflection in our example 3 4 2 Inflection of compound words See chapter 10 3 4 3 Inflection of semitic languages Semitic languages like Arabic or Hebrew are not inflected in the same way than other kinds of languages since their morphology obey a different logic In fact in such languages words are inflected according to consonant skeletons A lemma is made of consonants and the inflection process is supposed to enrich this skeleton with vowels Moreover as some agglutinative phenomena can occur the content of a semitic inflection grammar is interpre tated in a special way 48 CHAPTER 3 DICTIONARIES aviatrices aviatrix N Hum p aviatrix aviatrix N Hum s matrices matrix N Math p matrix matrix N Math s radices radix N p radix radix N s Figure 3 7 Result of automatic inflection First let us see what a semitic entry is supposed to be ktb V31 123 The sign before the grammatical code indicates that this is a semitic entry and the lemma here ktb is the consonant skeleton Figure 3 8 shows the toy grammar V31 123 grf that illustrates how the semitic inflection process works KKK yal o203u _ __ da 1D IM
177. hey will apply to the already partially disam biguated automaton which makes it possible to accumulate the effects of several grammars 7 3 4 Grammar collections Itis possible to gather several ELAG grammars into a grammar collection in order to compile and apply them in one step The sets of ELAG grammars are described in 1st files They are managed through the window for compiling ELAG grammars figure 7 16 The label on the top left indicates the name of the current collection by default elag 1st The contents of this collection are displayed in the right part of the window To modify the name of the collection click on the Browse button In the dialog box that appears enter the 1st file name for the collection To add a grammar to the collection select it in the file explorer in the left frame and click on the button Once you have selected all your grammars compile them by clicking on the Compile button This will create a rul file bearing the name indicated at the bottom right the name of the file is obtained by replacing 1st by rul You can now apply your grammar collection As explained above click on the Apply Elag Rule button in the text automaton window When the dialog asks for the rul file to use click on the Browse button and select your collection The resulting automaton is identical to that which would have been obtained by applying each grammar successively 7 3 5 Window For ELAG Processing
178. hological dictionaries Directories Language amp Presentation Display Colors v Date Background File Name Foreground Pathname Auxiliary Nodes w Frame Selected Nodes Comment Nodes Antialiasing _ Enable antialising for rendering graphs Icon Bar Position West O North East South None Fonts wm Bang 10 Reset to Default Output Arial Unicode MS 12 Figure 5 27 Default preferences configuration the image in the clipboard is too large and asks if you want to enlarge the image Click on Yes You can now edit the screen image Select the area that interests you To do so switch to the select mode by clicking on the dashed rectangle symbol in the upper left corner of the window You can now select the area of the image using the mouse When you have selected the zone press lt Ctrl C gt Your selection is now in the clipboard you can now just go to your document and press lt Ctrl V gt to paste your image On Linux Take a screen capture for example using the program xv Edit your image at once using a graphic editor for example TheGimp and paste your image in your document in the same way as in Windows 5 4 EXPORTING GRAPHS 91 Vector graphics If you prefer vector graphics you can save your graph under the SVG file format which is editable with softwares like the Open Source one Inkscape 24 With this software you can obtain PostScript exports ready to u
179. hts 239 240 CHAPTER 12 FILE FORMATS We protect your rights with two steps 1 copyright the software and 2 offer you this license which gives you legal permission to copy distribute and or modify the software Also for each author s protection and ours we want to make certain that everyone un derstands that there is no warranty for this free software If the software is modified by someone else and passed on we want its recipients to know that what they have is not the original so that any problems introduced by others will not reflect on the original authors reputations Finally any free program is threatened constantly by software patents We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses in effect making the program proprietary To prevent this we have made it clear that any patent must be licensed for everyone s free use or not licensed at all The precise terms and conditions for copying distribution and modification follow TERMS AND CONDITIONS FOR COPYING DISTRIBUTION AND MODIFICATION 0 This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License The Program below refers to any such program or work and a work based on the Program means either the Program or any derivative work under copyright law that is to say a work
180. hungry as a wolf gladnog kao vuk gladan kao vuk AC_A3XN2 s2mgda hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 s2mgka hungry as a wolf gladne kao vuk gladan kao vuk AC_A3XN2 s2fgea hungry as a wolf gladnoga kao vuk gladan kao vuk AC_A3XN2 s2ngda hungry as a wolf gladnog kao vuk gladan kao vuk AC_A3XN2 s2ngda hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 s2ngka hungry as a wolf gladnome kao vuk gladan kao vuk AC_A3XN2 s3mgda hungry as a wolf gladnom kao vuk gladan kao vuk AC_A3XN2 s3mgda hungry as a wolf gladnu kao vuk gladan kao vuk AC_A3XN2 s3mgka hungry as a wolf gladnoj kao vuk gladan kao vuk AC_A3XN2 s3fgea hungry as a wolf gladnome kao vuk gladan kao vuk AC_A3XN2 s3ngda hungry as a wolf gladnom kao vuk gladan kao vuk AC_A3XN2 s3ngda hungry as a wolf gladnu kao vuk gladan kao vuk AC_A3XN2 s3ngka hungry as a wolf gladnu kao vuk gladan kao vuk AC_A3XN2 s4fgea hungry as a wolf gladno kao vuk gladan kao vuk AC_A3XN2 s4ngea hungry as a wolf gladni kao vuk gladan kao vuk AC_A3XN2 s5mgea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 s5fgea hungry as a wolf gladno kao vuk gladan kao vuk AC_A3XN2 s5ngea hungry as a wolf gladnim kao vuk gladan kao vuk AC_A3XN2 s6mgea hungry as a wolf 190 gladnom gladnim gladnome gladnom kao vu gladnu kao vuk gladnoj kao vu gladnome gladnom Kao vu Kao vu Kao vu k gladan kao vu k gladan kao vu k gladan kao vu gladan kao vuk k gladan
181. ial sequences NOTE The characters between lt and gt or between and are not interpreted Thus the Character in sequence le lt A Conc gt is not interpreted as a line separator since the pattern lt A Conc gt is interpreted with priority X and Y represent the coordinates of the box in pixels Figure 12 1 shows how these coordi nates are interpreted by Unitex 0 0 x y Figure 12 1 Interpretation of the coordinates of boxes N represents the number of outgoing transitions of the box This number is always 0 for the final state The transitions are defined by the number of their target box Every line of the box definition ends with a newline 12 3 2 Format fst2 An fst2 file is a text file that describes a set of graphs Here is an example of an fst2 file 00000000024 1 NP 12 3 GRAPHS 221 1 14 2 2 22 3 349 t 4 9 2 Adj 6151414 t 4 4 KEEN the DETY lt A gt ADIF S lt N gt Sniceq pretty smallq DN The first line represents the number of graphs that are encoded in the file The beginning of each graph is identified by a line that indicates the number and the name of the graph 1 NP and 2 Adj in the file above The following lines describe the states of the graph If the state is final the line starts with the t character and with the character if not For each state the list of transitions is a possibly empty sequence of pairs of integers e the first in
182. if any For instance the se quence LLUx applied to the word mang s produces the inflected form mangex since U has turn the er intoa e In the example below the inflection of choose is shown The sequence LLDRRn describes the form chosen e Step 0 the canonical form is copied on the stack and the cursor is set behind the last letter c h ojo s le e Step 1 the cursor is moved one position to the left LLDRRn c h olols e Step 2 the cursor is moved one position to the left again LLDRRn l c h olo sle e Step 3 one character is deleted everything to the right of the cursor is shifted one position to the left LLDRRn l c h lol sle e Step 4 the cursor is moved to the right 3 4 AUTOMATIC INFLECTION 47 LLDRRn l c h o s le e Step 5 and to the right again c h o s le e Step 6 the character n is pushed on the stack LLDRRn c hJo sje n When all operations have been fulfilled the inflected form consists of all letters before the cursor here chosen The inflection program explores all paths of the inflectional grammar and tries all possible forms In order to avoid having to replace the names of inflectional grammars by the real grammatical codes in the dictionary used the program replaces t
183. igure 5 14 number Figure 5 14 Example of a transducer The output associated with a box is represented in bold text below it 5 2 5 Using Variables It is possible to select parts of a text sequence recognized by a grammar using variables To associate a variable var1 with parts of a grammar use the special symbols var1 and var1 to define the beginning and the end of the part to store Create two boxes contain ing one varl and the second var1 These boxes must not contain anything but the variable name preceded by and followed by a parenthesis Then link these boxes to the zone of the grammar to store In the graph in figure 5 15 you see a sequence of digits before dollar or dollars This sequence will be stored in a variable named var1 2 1 Sg VALUE varl var Li Figure 5 15 Using the variable var Variable names may contain latin letters without accents upper or lower case numbers or the _ underscore character Unitex distinguishes between uppercase and lowercase Characters When a variable is defined you can use it in transducer outputs by surrounding its name with The grammar in figure 5 16 recognizes a date formed by a month and a year and produces the same date as an output but in the order year month If you want to use the character in the output of a box you have to double it as shown on figure 5 15 5 2 6 Copying lists It can be practical to perform a copy paste
184. in 2 bytes In Little Endian the bytes are in lo byte hi byte order If this order is reversed we speak of Big Endian A text file encoded in Unicode Little Endian starts with the special character with the hexadecimal value FEFF The newline symbols have to be encoded by the two characters 000D and 000A Consider the following text Unitex B version Here is its representation in Unicode Little Endian header U n i t e x q B FFFE 5500 6E00 6900 7400 6500 7800 ODO00A00 B203 v e r s i O n 4 2D00 7600 6500 7200 7300 6900 6F00 6E00 0D000A00 Table 12 1 Hexadecimal representation of a Unicode text The hi bytes and lo bytes have been reversed which explains why the start character is encoded as FFFE in stead of FEFF and 000D and OO0A are 0D00 and 0A00 respectively 215 216 CHAPTER 12 FILE FORMATS 12 2 Alphabet files There are two kinds of alphabet files a file which defines the characters of a language and a file that indicates the sorting preferences The first is designed under the name alphabet the second under the name sorted alphabet 12 2 1 Alphabet The alphabet file is a text file that describes all characters of a language as well as the corre spondances between capitalized and non capitalized letters This file is called Alphabet txt and is found in the root of the directory of a language Its presence is obligatory for Unitex to fu
185. in the Extract units frame Figure 6 48 Then click on Extract 122 CHAPTER 6 ADVANCED USE OF GRAPHS Concordance D My UnitexEnglishiCorpuslivanhoe_snticoncord html gt o SI D ted of yore the fabulous Dragon of Wantley 5 here were fought many of lt 4 b D My Unitex EnglishiCorpus ivanhoe snt cog D 2343 sentence delimiters 186612 9300 diff tokens 83774 9274 simple forms 25 9 di 81970 occurrences 13284 DLF entries simple words 273 occurrences 274 DLC entries 5 IN THAT PLEASANT DISTRICT of merry England which is vatered by the river Don there extended in ancient times a large forest covering the greater part of the beautiful hills and valleys which lie between Sheffield and the pleasant town of Doncaster 5 The remains of this extensive wood are still to be seen at the noble seats of Wentworth of Varncliffe Park and around Rotherham S Here haunted of yore the fabulous Dragon of Wantley 5 here were fought many of the most desverate battles during the Civil Wars of the Figure 6 49 Selection of an occurrence in the text matching units At the opposite if you click on Extract unmatching units all sentences that do not contain any match will be extracted 6 8 5 Comparing concordances With the Show differences with previous concordance option you can compare the current concordance with the previous one The ConcorDiff program builds both concordances according to text or
186. information that will be produced matrix matrices Figure 6 1 Example of an inflectional grammar The paths may contain operators and letters The possible operators are represented by the Characters L R C and D All letters that are not operators are characters The only allowed 93 94 CHAPTER 6 ADVANCED USE OF GRAPHS special symbol is the empty word lt E gt It is not possible to refer to information in dictionar ies in an inflection transducer but it is possible to reference subgraphs Transducer outputs are concatenated in order to produce a string of characters This string is then appended to the produced dictionary entry Outputs with variables do not make sense in an inflection transducer Case of letters is respected lowercase letters stay lowercase the same for uppercase let ters Besides the connection of two boxes is exactly equivalent to the concatenation of their contents together with the concatenation of their outputs cf figure 6 2 Figure 6 2 Two equivalent paths in an inflection grammar Inflection transducers may be compiled before being used by the inflection program If not the inflection program will compile them on the fly For more details see section 3 4 6 1 2 Preprocessing graphs Preprocessing graphs are meant to be applied to texts before they are tokenized into lexical units These graphs can be used for inserting or replacing sequences in the texts The two customary uses of
187. into the text The path shown at the top of figure 2 9 recognizes the sequence consisting of a question mark and a word beginning with a capital letter and inserts the symbol S between the question mark and the following word The following text What time is it Eight o clock will be converted to What time is it S Eight o clock A grammar for end of sentence detection may use the following special symbols 26 CHAPTER 2 LOADING A TEXT e lt E gt empty word or epsilon Recognizes the empty sequence e lt MOT gt recognizes any sequence of letters e lt MIN gt recognizes any sequence of letters in lower case e lt MAJ gt recognizes any sequence of letters in upper case e lt PRE gt recognizes any sequence of letters that begins with an upper case letter e lt NB gt recognizes any sequence of digits 1234 is recognized but not 1 234 e lt PNC gt recognizes the punctuation symbols andthe inverted exclama tion points and question marks in Spanish and some Asian punctuation letters e lt gt recognizes a newline e prohibits the presence of a space Placement des marques de s paration de phrases S LI V Cas g n ral Ponctuation set Ponctuation suivie de cas particuliers sigles noms symboles Sigles pr noms anthroponymes cas2 Mots compos s ou suivis d une lettre majuscule symboles eg Cas particuliers
188. ion NC_NXXXX p Adam s apple Adam s apple NC_XXXXN s Adam s apples Adam s apple NC_XXXXN p air brake air brake NC_XXN s air brakes air brake NC_XXN p date of birth birth date NC_NN_NofN s dates of birth birth date NC_NN_NofN p birth date birth date NC_NN_NofN s birth dates birth date NC_NN_NofN p criminal police criminal police NC_XXXinv p cross roads cross roads NC_XXNs s cross roads cross roads NC_XXNs p eads of government head of government NC_NofNs p eads of governments head of government NC_NofNs p h h head of government head of government NC_NofNs s n n n otaries public notary public NC_NsNs p otary public notary public NC_NsNs s otary publics notary public NC_NsNs p rolling stone rolling stone NC_XXN s rolling stones rolling stone NC_XXN p students union student union NC_Ns N s students unions student union NC_Ns N p students union student union NC_Ns N s students unions student union NC_Ns N p student union student union NC_Ns N s student unions student union NC_Ns N p Figure 10 10 Inflection graph N1 for En Figure 10 11 Inflection graph N3 for English glish simple words simple words 10 3 INTEGRATION IN UNITEX 181 e g angle of reflection lt Nb n gt Figure 10 12 Inflection graph NC_NXXXX for English MWUs e g advance booking office Figure 10 13 Inflection graph NC_XXXXN fo
189. is interactive make it output a short notice like this when it starts in an interactive mode Gnomovision version 69 Copyright C yyyy name of author Gnomovision comes with ABSOLUTELY NO WARRANTY for details type show w This is free software and you are welcome to redistribute it under certain condi tions type show c for details 12 11 VARIOUS OTHER FILES 245 The hypothetical commands show w and show c should show the appropriate parts of the General Public License Of course the commands you use may be called something other than show wand show c they could even be mouse clicks or menu items whatever suits your program You should also get your employer if you work as a programmer or your school if any to sign a copyright disclaimer for the program if necessary Here is a sample alter the names Yoyodyne Inc hereby disclaims all copyright interest in the program Gnomovision which makes passes at compilers written by James Hacker signature of Ty Coon 1 April 1989 Ty Coon President of Vice This General Public License does not permit incorporating your program into propri etary programs If your program is a subroutine library you may consider it more useful to permit linking proprietary applications with the library If this is what you want to do use the GNU Library General Public License instead of this License 246 CHAPTER 12 FILE FORMATS Appendix B GNU Lesser General Public
190. ists the ELAG grammars that correspond to a given elg file 12 10 CONFIGURATION FILES 233 elg file names are surrounded with angles brackets The lines that start with a tabulation are considered as comments by the Elag program Here is the elag rul file used for French lt elag rul 0 elg gt f PPVs PpvIL elgf PPVs PpvLE elgY PPVs PpvLUI elgf PPVs PpvPR elg PPVs PpvSeq elg PPVs SE elg PPVs postpos elgq lt elag rul 1 elg gt 4 12 10 Configuration files 12 10 1 The Config file Whenever the user modifies his preferences for a given languages these modifications are saved in a text file named Config which can be found in the directory of the current lan guage The file has the following syntax the order of lines can vary Unitex configuration file of paumier for English Y Fri Oct 10 15 18 06 CEST 20084 TEXT FONT NAME Courier New TEXT FONT STYLE 0Y TEXT FONT SIZE 10Y CONCORDANCE FONT NAME Courier new CONCORDANCE FONT HTML SIZE 124 OU OU O FI RI BACKGROUND COLOR 14 FOREGROUND COLOR 167772164 AUXILIARY NODES COLOR 32896514 COMMENT NODES COLOR 655364 SELECTED NODES COLOR 167769614 INPUT FONT NAME Times New Roman INPUT FONT STYLE 04 INPUT FONT SIZE 104 TPUT FONT NAME Arial Unicode MS TPUT FONT STYLE 14 UTPUT FONT SIZE 124 DATE t rueq LEX NAME t rue PATH NAME falseY FRAME trueY
191. itex are distributed under the LGPLLR license 40 Full text versions of GPL LGPL and LGPLLR can be found in the appendices of this manual 1 2 Java runtime environment Unitex consists of a graphical interface written in Java and external programs written in C C This mixture of programming languages is responsible for a fast and portable appli cation that runs on different operating systems Before you can use the graphical interface you first have to install the runtime environment usually called Java virtual machine or JRE Java Runtime Environment For the graphical mode Unitex needs Java version 1 6 or newer If you have an older version of Java Unitex will stop after you have chosen the working language You can download the virtual machine for your operating system for free from the Sun Microsystems web site 54 at the following address http java sun com 15 16 CHAPTER 1 INSTALLATION OF UNITEX If you are working under Linux or MacOS or if you are using a Windows version with personal user accounts you have to ask your system administrator to install Java 1 3 Installation on Windows If Unitex is to be installed on a multi user Windows machine it is recommended that the systems administrator performs the installation If you are the only user on your machine you can perform the installation yourself Decompress the file unitex_2 0 zip You can download this file from the following ad dress http
192. l files that represent text dictionaries 12 7 1 dif and dlc dl f and dlc are simple and compound word dictionaries in the DELAF format see section 3 1 1 12 7 2 err This file is made of unkown words one per line 12 7 3 tags ind This file has the same format than a concord ind one obtained in MERGE or REPLACE mode but its header is T Note that the outputs DO NOT BEGIN with a slash 228 CHAPTER 12 FILE FORMATS 12 8 Dictionaries The compression of the DELAF dictionaries by the Compress program produces two files a bin file that represents the minimal automaton of the inflected forms of the dictionaries and a inf file that contains the compressed forms required for the construction of the dictionaries from the inflected forms This section describes the format of these two file types as well as the format of the CHECK_DIC TXT file which contains the result of the verification of a dictionary 12 8 1 The bin files A bin file is a binary file that represents an automaton The first 4 bytes of the file represent an integer that indicates the size of the file in bytes The states of the automaton are encoded in the following way e the first two bytes indicate if the state is final as well as the number of its outgoing transitions The highest bit is 0 if the state is final 1 if not The other 15 bits encode the number of transitions Example a non final state with 17 transitions is encoded by the hexadecimal sequ
193. l form by a period If there are more codes these are separated by the character p is an inflectional code which indicates that the noun is plural Inflectional codes are used to describe gender number declination and conjugation This information is optional An inflectional code is made up of one or more characters that represent one information each Inflectional codes have to be separated by the character for instance in an entry like the following hang V W P1s P2s Pl1p P2p P3p The character is interpreted in OR semantics Thus W P1s P2s Plp P2p P3p means infinitive or 1st person singular present or 2nd person singular present etc see table 3 3 Since each character represents one information it is not necessary to use the same character more than once In this way encoding the past participle using the code PP would be exactly equivalent to using P alone e this is an example is a comment Comments are optional and are introduced by the character These comments are left out when the dictionaries are compressed IMPORTANT REMARK It is possible to use the full stop and the comma within a dictionary entry In order to do this they have to be escaped using the character 1 000 one thousand NUMBER United Nations U N ACRONYM WARNING Each character is taken into account within a dictionary line For example if you insert spaces they are considered to be a part of the information In the follo
194. le adjectives without inflec tional features The problem is that if one wants to refer exclusively to this type of adjectives in a disambiguation grammar the lt A gt symbol is not appropriate since it will recognize all adjectives To circumvent this difficulty it is possible to deny an inflectional attribute by writing the character right before one of the possible values for this attribute Thus the lt A m p gt symbol recognizes all the adjectives which have neither a gender nor a number Using this operator itis possible to write grammars like those in figure 7 19 which imposes agreement in gender and number between a name and an adjective which suits This grammar will preserve the correct analysis of sentences like Les personnes de bonne humeur m insupportent Is is however recommended to limit the use of the operator because it harms the legibility of the grammars It is preferable to distinguish the labels which accept various inflectional combinations by means of discriminating subcategories defined in the discr part Figure 7 19 ELAG grammar that verifies gender and number agreement Optional Codes The optional syntactic and semantic codes are declared in the cat part They can be used in ELAG grammars like other codes The difference is that these codes do not intervene to This grammar is not completely correct because it eliminates for example the correct analysis of the sen tence J ai re u des coups
195. lieved to be a consequence of the rest of this License 9 If the distribution and or use of the Linguistic Resource is restricted in certain coun tries either by patents or by copyrighted interfaces the original copyright holder who places the Linguistic Resource under this License may add an explicit geographical distribution limitation excluding those countries so that distribution is permitted only in or among countries not thus excluded In such case this License incorporates the limitation as if written in the body of this License 10 The Free Software Foundation may publish revised and or new versions of the Lesser General Public License for Linguistic Resources from time to time Such new versions will be similar in spirit to the present version but may differ in detail to address new problems or concerns Each version is given a distinguishing version number If the Linguistic Resource spec ifies a version number of this License which applies to it and any later version you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation If the Linguistic Resource does not specify a license version number you may choose any version ever published by the Free Software Foundation 11 If you wish to incorporate parts of the Linguistic Resource into other free programs whose distribution conditions are incompatible with these write to the author to ask
196. lize WARNING 2 whenever a parameter contains spaces it needs to be enclosed in quotation marks so it will not be considered as multiple parameters 11 1 CheckDic CheckDic OPTIONS dic This program carries out the verification of the format of a dictionary of DELAS or DELAF type The parameter dic corresponds to the name of the dictionary that is to be verified OPTIONS e f delaf checks an inflected dictionary e s delas checks a non inflected dictionary The program checks the syntax of the lines of the dictionary It also creates a list of all characters occurring in the inflected and canonical forms of words in the text the list of grammatical codes and syntax as well as the list of inflection codes used The results of the verification are stored in a file called CHECK_DIC TXT 195 196 CHAPTER 11 USE OF EXTERNAL PROGRAMS 11 2 Compress Compress OPTIONS dictionary OPTIONS e f flip indicates that the inflected and canonical forms should be swapped in the compressed dictionary This option is used to construct an inverse dictionary which is necessary for the program Reconst rucao This program takes a DELAF dictionary as a parameter and compresses it The compression of a dictionary dico dic produces two files e dico bin a binary file containing the minimum automaton of the inflected forms of the dictionary e dico inf a text file containing the compressed forms required for the reconstructi
197. load This type of description has also been applied to adjectives 53 predicative nouns 30 31 adverbs 37 55 as well as frozen expressions in many languages 14 25 26 58 59 62 66 67 68 Figure 8 1 shows an example of a lexicon grammar table The table contains verbs that among other definitional properties do not admit passivization 151 152 CHAPTER 8 LEXICON GRAMMAR lolx Fichier diter Afficher Ins rer Format Outils Donn es Fen tre Aide x BER al asa iv s ln tJHi viMOMBQIOY O A Al arial y fio et se SS E_qmm gg q CID E 1 lt OPT gt Exemple avoir le fait que P Dnum Nmes Aux 1 1 INO Domm Van Ce salon accepte vingt personnes Ce salon accueille vingt personnes Max accuse 80 kilos Max accuse ses trente ans On admet 50 personnes dans cette salle Ces cristawgaffectentgune forme g om trique Les valeurs ont affich un repli La plante aime l eau Cette maison approche les deux millions Ce terraingarpenteg30 arpents Ma atteint 80 kilos Max a une soeur une voiture des sous Ce sac avoisine les 20 kg La montre bat les secondes Son calme cache son une grandejangoisse Ce bateau cale 80 cm y Mx gt A accepter accueillir accuser accuser admettre affecter afficher aimer approcher arpenter atteindre avoir avoisiner batt
198. lock it would be a bad idea to replace OTT by on the because a sentence like John O Connor said it s 8 O clock would be replaced by the following incorrect sentence John on the Connor said it s 8 on the clock Thus one needs to be very careful when using the normalization grammar One needs to pay attention to spaces as well For example if one replaces re by are the sentence You re stronger than him will be replaced by Youare stronger than him 7 To avoid this problem one should explicitly insert a space i e replace re by are The accepted symbols for the normalization grammar are the same as the ones allowed for the sentence splitting grammar The normalization grammar is called Replace fst2 and can be found in the following directory home directory active language Graphs Preprocessing Replace As in the case of sentence splitting this grammar is applied using the Fst2Txt program but in REPLACE mode which means that input sequences recognized by the grammar are replaced by the output sequences that are produced Figure 2 10 shows a grammar that normalizes verbal contractions in English V W A N WW CHAPTER 2 LOADING A TEXT A Wi dd Figure 2 10 Normalization of English verbal contractions 2 5 PREPROCESSING A TEXT 29 2 5 4 Splitting a text into tokens Some languages in particular Asian languages use separators that are different
199. lowercase letters can recog nize both lowercase and uppercase letters The transducer outputs represent the sequences of labels that will be inserted into the text automaton These labels can be dictionary entries or strings of characters The labels that 96 CHAPTER 6 ADVANCED USE OF GRAPHS represent dictionary entries have to respect the DELAF format and must be enclosed by the and symbols Outputs with variables do not make sense in this kind of graph You cannot use morphological filters morphological mode or contexts Itis possible to reference subgraphs Itis not possible to reference information in dictionaries in order to describe the forms to normalize The only special symbol that is recognized in this type of graph is the empty word lt E gt The graphs for normalizing ambiguous forms need to be compiled before using them 6 1 4 Syntactic graphs Syntactic graphs often called local grammars allow you to describe syntactic patterns that can then be searched in the texts Of all kinds of graphs these have the greatest expressive power because they allow you to refer to information in dictionaries Lower case upper case variants may be used according to the principle described above It is still possible to enforce respect of case by enclosing an expression in double quotes The use of double quotes also allows you to enforce the respect of spaces In fact Unitex by default assumes that a space is possible between two boxes
200. mars Editing graphs 5 2 1 Creating a graph 5 2 2 Sub Graphs 5 23 Manipulating boxes 5 2 4 Transducers 5 2 5 Using Variables 5 2 6 Copying lists 5 2 7 Special Symbols 5 2 8 Toolbar Commands Display options 5 3 1 Sorting the lines of a box 5 3 2 FAUSSES RES CONTENTS CONTENTS 5 Dime PAAR o Lis Eh is A A A E 84 594 BORIS A fe ee a A A A 86 5 3 5 Display options fonts andcolors ee kee eh ea es en 87 54 Exp rtng graphs e ss ssas ee tA eR AAA 89 5 4 1 Inserting a graph into a document ENEE EEN 89 Ca Priming atapi NI 91 6 Advanced use of graphs 93 Ol Types Of Sie oa di ri r da a A E 93 6 1 1 Inflection transducers 0 0 0 0 eee ee 93 612 Preprocessing A e scoct osa aa RA EN OS 94 6 13 Graphs for normalizing the text automaton 95 CAS Sy graphs 45055 a De er d e Er ee 96 Gilet EE ee 5 25 sue per dan se 96 616 Parameterized graphs so ccc NI sd 4 eae db da 97 62 Compilationofa EEN 97 621 Compilatonofa graph lt ss eros AA 97 6 2 2 Approximation with a finite state transducer 97 62 0 Constraints on CLASES lt s ees oee soes A 98 GAL Erordetechom 4 444 ier ada a add 100 60 COS aa LE er ee Se aE Ree a a e es 101 631 Righi Contele aa E ARIAS R s 102 632 Let tte o ic da is AE ie E EE id a 104 GE Themorpbhological modes 6665 bas xd Ae EE CRS 108 DEL AUS us AAA AAA AE 108 Gar be lee o 11 13 14 leva de Ee a dow ii da 108 643
201. n The star must be placed on the right hand side of the element in question The expression this is very cold recognizes this is cold this is very cold this is very very cold etc The star has a higher priority than the other operators You have to use brackets in order to apply the star to a complex expression The expression 0 0 1 2 3 4 5 6 7 8 9 recognizes a zero followed by a comma and by a possibly empty sequence of digits WARNING It is prohibited to search for the empty word with a regular expression If you try to search for 0 1 2 3 4 5 6 7 8 9 the program will raise an error as shown in figure 4 3 essages with a colored background are generated by the interface not by the external programs Expression converted Compiling graph regexp Recursion detection started Resolving lt E gt conditions Recursion detection completed ERROR the main graph regexp recognizes lt E gt Cancel Figure 4 3 Error message when searching for the empty string 64 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS 4 7 Morphological filters It is possible to apply morphological filters to the lexemes found For that it is necessary to immediately follow the lexeme found by a filter in double angle brackets lexical mask lt lt morphological pattern gt gt The morphological filters are expressed as regular expressions in POSIX format see 50 for the detailed syntax Here are some examples of elementary fil
202. n be found in the user directory for the current language By default the first lines of this file for French look like this AAAAaaaa Bb CECE 3 3 SORTING 43 Check Results Line 1 unexpected end of line agreeably ADV Line 2 unexpected end of line agreed INTJ Line 4 empty grammatical or semantic code File D My Unitex English Dela agreeably dic Type DELAF 5 lines read 2 simple entries for 2 distinct lemmas 0 compound entry for O distinct lemma 0061 0064 0065 0067 0069 0072 Figure 3 4 Results of checking 44 CHAPTER 3 DICTIONARIES Dd E E Ee e8s Characters in the same line are considered equivalent if the context permits If two equiv alent characters must be compared they are sorted in the order they appear in from left to right As can be seen from the extract above there is no difference between lower and upper case Accents and the c dille character are ignored as well To sort a dictionary open it and then click on Sort Dictionary in the DELA menu By default the program always looks for the file Alphabet_sort txt If that file doesn t exist the sorting is done according to the character indices in the Unicode encoding By modifying that file you can define your own sorting order NOTE After applying the dictionaries to a text the files d1 f dlc and err are automatically sorted using this program 3 4 Automatic inflection 3 4 1 Inflection of simple words
203. n feature is assigned to entries that are part of the personal pronoun sub category but not to relative pronouns These dependencies are described in the complete part 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 143 e complete this part describes the inflectional part of the tags of the words in the current part of speech Each line describes a valid combination of inflectional codes by their discriminating sub category if such a category was declared If an attribute name is specified in angle brackets lt and gt this signifies that any value of this at tribute may occur It is possible as well to declare that an entry does not take any inflexional feature by means of a line containing only the _ character underscore So for example if we consider that the following lines extracted from the section describ ing the verbs W K lt genre gt lt nombre gt They make it possible to declare that verbs in the infinitive indicated by the w code do not have other inflectional features while the forms in the past participle K code are also assigned a gender and a number Description of the inflectional codes The principal function of the discr part is to divide a part of speech into subcategories having similar inflectional behavior These subcategories are then used to facilitate writing the complete part For the legibility of the ELAG grammars it is desirable that the elements of the same sub category all have the same infle
204. nce of the rest of this License If the distribution and or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries so that distribution is permitted only in or among countries not thus excluded In such case this License incorporates the limitation as if written in the body of this License The Free Software Foundation may publish revised and or new versions of the Gen eral Public License from time to time Such new versions will be similar in spirit to the present version but may differ in detail to address new problems or concerns Each version is given a distinguishing version number If the Program specifies a version number of this License which applies to it and any later version you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation If the Program does not specify a version number of this License you may choose any version ever published by the Free Software Foundation If you wish to incorporate parts of the Program into other free programs whose dis tribution conditions are different write to the author to ask for permission For soft ware which is copyrighted by the Free Software Foundation write to the Free Software Foundati
205. nce the graph of figure 6 15 matches a sequence of two simple nouns that is not ambiguous with a com pound word In fact the pattern lt CDIC gt lt lt gt gt matches a compound word with exactly one space and the pattern lt N gt lt lt gt gt matches a noun with out space that is to say a simple noun Thus in the sentence Black cats should like the town hall this graph will match Black cats but not town hall which is a compound word ee lt N gt lt lt H gt gt lt N gt lt lt Y 4 gt gt Figure 6 15 Advanced use of right contexts You can use nested contexts For instance the graph shown in figure 6 16 matches a number that is not followed by a dot except for a dot followed by a number Thus in the sequence 5 0 7 12 this graph will match 5 0 and 12 53 0 D 40 52 gt 0 Figure 6 16 Nested contexts If a right context contains boxes with transducer outputs the outputs are ignored However it is possible to use a variable that was defined inside a right context cf figure 6 17 If you apply this graph in MERGE mode to the text the cat is white you will obtain 104 CHAPTER 6 ADVANCED USE OF GRAPHS the lt pet name cat color white gt is white Sr lt pet name 13 color C gt Figure 6 17 Variable defined inside a right context 6 3 2 Left contexts It is also possible to look for an expression X only if it occurs after
206. nction Example the English alphabet file has to be in the directory English Each line of the alphabet file must have one of the following three forms followed by a newline symbol e 7 2 a hash symbol followed by two characters X and Y which indicate that all characters between X and Y are letters All these characters are considered to be in non capitalized and capitalized form at the same time This method is used to define the alphabets of Asian languages like Korean Chinese or Japanese where there is no distinction between upper and lower case and where the number of characters makes a complete enumeration tedious e ES two characters X and Y indicate that X and Y are letters and that X is a capital ized equivalent of the non capitalized Y form e a unique character X defines X as a letter in capitalized and non capitalized form This form is used to define a single Asian character For certain languages like French it is possible that a lower case letter corresponds to mul tiple upper case letters For example in practice can have the upper case form E or To express this it suffices to use multiple lines The reverse is equally true a capitalized letter can correspond to multiple lower case letters Thus E can be the capitalization of e e or Here is an excerpt of the French alphabet file which defines different properties of letter e Eef E q q 12 3 GRAPHS 217 12 2 2 So
207. ne kao vuk gladan kao vuk AC_A3XN2 p4fgea hungry as a wolf gladne kao vuci gladan kao vuk AC_A3XN2 p4fgea hungry as a wolf gladne kao vukovi gladan kao vuk AC_A3XN2 p4fgea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 p4ngea hungry as a wolf gladna kao vuci gladan kao vuk AC_A3XN2 p4ngea hungry as a wolf gladna kao vukovi gladan kao vuk AC_A3XN2 p4ngea hungry as a wolf gladni kao vuk gladan kao vuk AC_A3XN2 p5mgea hungry as a wolf gladnu adan kao vuk gladni gladni gladni gladne gladne gladne gladna gladna gladna gladni gladni gladni gladni gladni gladni kao vuk gl N 10 3 INTEGRATION IN UNITEX gladni kao vuci gladan kao vuk AC_A3XN2 p5mgea hungry as a wolf gladni kao vukovi gladan kao vuk AC_A3XN2 p5mgea hungry as a wolf gladne kao vuk gladan kao vuk AC_A3XN2 p5fgea hungry as a wolf gladne kao vuci gladan kao vuk AC_A3XN2 p5fgea hungry as a wolf gladne kao vukovi gladan kao vuk AC_A3XN2 p5fgea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 p5ngea hungry as a wolf gladna kao vuci gladan kao vuk AC_A3XN2 p5ngea hungry as a wolf gladna kao vukovi gladan kao vuk AC_A3XN2 p5ngea hungry as a wolf gladnima kao vuk gladan kao vuk AC_A3XN2 p6mgea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p6mgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p6mgea hungry as a wolf gladnim kao
208. nicima avio prevoznik NC_2X avio prevoznika avio prevozni avio prevoznika avio prevoznik avio prevoznik avioprevoznika avio prevoznik NC_2XN2 avioprevozniku avio prevoznik NC_2XN2 avioprevoznika avio prevoznik NC_2XN2 avioprevoznik ke avio prevoznik avioprevoznicye avio prevoznik avioprevoznikom avio prevoznik avioprevozniku avio prevoznik NC_2XN2 avioprevoznici avio prevoznik NC_2XN2 avioprevoznika avio prevoznik NC_2XN2 avioprevoznicima avio prevoznik NC_2XN avioprevoznik avio prevoznik NC_2XN2 avioprevoznici avio prevoznik NC_2XN2 C_2XN2 NC_2XN2 C_2XN2 k NC_2XN NC_2XN2 NC_2XN2 NC_2XN2 C_2XN2 C_2XN2 C_2XN2 C_2XN2 N N Comp s2vm N Comp s3vm N Comp s4vm 2 N Comp s5vm 2 N Comp s6vm N Comp s7vm N Comp plvm N Comp p2vm N2 N Comp p3vm N Comp p4vm N Comp p5vm N2 N Comp p6vm N2 N Comp p7vm k NC_2XN2 N Comp w2vm N Comp w4vm Comp s1vm Comp s2vm N Comp s3vm NC_2XN2 NC_2XN2 N Comp s4vm N Comp s5vm N Comp s6vm Comp s7vm Comp plvm N Comp p2vm 2 N Comp p3vm N Comp p4vm Comp p5vm 187 188 CHAPTER 10 COMPOUND WORD INFLECTION avioprevoznicima avio prevoznik NC_2XN2 N Comp p6vm avioprevoznicima avio prevoznik NC_2XN2 N Comp p7vm avioprevoznika avio prevoznik NC_2XN2 N Comp w2vm avioprevoznika avio prevoznik NC_2XN2 N Comp w4vm predsednik drzxave predsednik
209. nine gender e g main courante moissoneuse batteuse etc a new graph has to be created which is identical to Figure 10 5 up to the final output containing lt Gen f Nb n gt That is not very intuitive since circuit s quentiel and main Up to the case when single constituents appearing in the lemma of a MWU are already in plural as in cross roads 176 CHAPTER 10 COMPOUND WORD INFLECTION courante inflect in the same way in the sense that in both cases we need to put the first and the last constituent to plural in order to obtain the plural form of the whole MWU That s why another type of instantiation for unification variables has been introduced It is accompanied by a double equal sign as opposed to the single equal sign as for n on Figure 10 5 If a unification variable is assigned to a category by this symbol then it inherits the value of this category from the corresponding constituent as it appears in the lemma of the MWU For instance Figure 10 6 contains a graph describing the inflected forms for both masculine and feminine French compounds of types Noun Noun and Noun Adjective Its first box contains the double assignment of the gender to variable g which means that this variable has its value fixed to the gender value of the first constituent For bateau mouche it is fixed to masculine because bateau is masculine while for main courante it is fixed to feminine pss sv lt Gen g Nb n gt e g bateau m
210. nsducer This option can be used for optimizing certain grammars A message indicates at the end of the approximation process if the result is a finite state transducer or an FST2 grammar and in the case of a transducer if it is equivalent to the original grammar cf Figure 6 6 6 2 3 Constraints on grammars With the exception of inflection grammars a grammar can never have an empty path This means that the paths of a main graph must not recognize the empty word but this does not prevent a subgraph of that grammar from recognizing epsilon 6 2 COMPILATION OF A GRAMMAR 99 Compiling graph loop Recursion detection started Resolving lt E gt conditions Looking for lt E gt loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Loading X BOULOT Recherche manuelunitex resources mg loop fst2 Computing grammar dependencies Flattening Cleaning graph Minimization Writing grammar Saving tags The resulting grammar is an equivalent finite state transducer Figure 6 6 Resultat of the approximation of a grammar It is not possible to associate a transducer output with a call to a subgraph Such outputs are ignored by Unitex It is therefore necessary to use an empty box that is situated to the left of the call to the subgraph in order to specify the output cf Figure 6 7 DET is ignored on this path but not on this one Figure 6 7 How to associa
211. nstructed The program will construct the text FST according to the DLF DLC and tags ind files previously built by the Dico program for the current text Cancel Construct FST Figure 7 10 Configuration of the construction of the text automaton tokens while the compound adverb path does not contain any unknown word Figure 7 11 shows the automaton of figure 7 9 after cleaning 1003 sentences aumen T ran dana iman hurnna la amdanau sa Wn 4 Sentence 13 ae Reset Sentence Graph Rebuild FST Text Elag Frame Explode wa Ew AS Ane FN 7 NOA Apply Elag Rule Figure 7 11 Automaton of figure 7 9 after cleaning 134 CHAPTER 7 TEXT AUTOMATON 7 3 Resolving Lexical Ambiguities with ELAG The ELAG program allows for applying grammars for ambiguity removal to the text au tomaton This powerful mechanism makes it possible to write rules on independently from already existing rules This chapter briefly presents the grammar formalism used by ELAG and describes how the program works For more details the reader may refer to 6 and 49 7 3 1 Grammars For Resolving Ambiguities The grammars used by ELAG have a special syntax They consist of two parts which we call the if and then parts The if part of an ELAG grammar is divided in two parts which are divided by a box containing the lt gt symbol The then part is divided the same way using the lt gt symbol The meaning of a grammar is
212. ntained in a file called Equivalences txt that describes which foreign inflectional feature corre sponds to which category value pair in our description For example the following lists Polish French s Nb sing s Nb s p Nb pl p Nb p M Case Nom f Gen f D Case Gen m Gen m C Case Dat B Case Acc I Case Inst L Case Loc V Case Voc o Gen masc_pers z Gen masc_anim r Gen masc_inanim f Gen fem n Gen neu describe the equivalences between the previous Morphology txt file for Polish and French respectively and the single character features that might be used in DELA dictionaries for those languages under Unitex 172 CHAPTER 10 COMPOUND WORD INFLECTION 10 2 2 Decomposition of a MWU into Units The notion of an elementary graphical unit is controversial and varies across languages and NLP systems For instance in nitex an alphabet i e a set of characters is first defined for each language Each non alphabet character is called a separator A graphical unit is then either a single separator usually a punctuation mark a digit etc or a contiguous sequence of alphabet characters e g aujourd hui in French consists according to this definition of 3 units In other systems a graphical unit may contain a punctuation mark e g c est dire or a limit between two graphical units may occur within a sequence of alphabet characters widziat bym cf 61 This variety of possible
213. ntence of a text is represented by an automaton whose paths represent all possible interpretations This chapter presents the concept of text automaton the details of their construction and the operations that can be applied in particular ambiguity removal with ELAG 49 It is not possible at the moment to search the text automaton for patterns 7 1 Displaying text automaton The text automaton explicit all possible lexical interpretations of the words These different interpretations are the different entries presented in the dictionary of the text Figure 7 1 shows the automaton of the fourth sentence of the text Ivanhoe You can see in Figure 7 1 that the word Here has three interpretations here adjective ad verb and noun haunted two adjective and verb etc All the possible combinations are expressed because each interpretation of each word is connected to all the interpretations of the following and preceding words In case of an overlap between a compound word and a sequence of simple words the au tomaton contains a path that is labeled by the compound word parallel to the paths that express the combinations of simple words This is illustrated in Figure 7 2 where the com pound word courts of law overlaps with a combination of simple words By construction the text automaton does not contain any loop One says that the text au tomaton is acyclic NOTE The term text automaton is an abuse of language In fact ther
214. nternally converted by the graph compiler to E greek delta grf Graph repository When you need to call a grammar X inside a grammar Y a simple method is to copy all 5 2 EDITING GRAPHS 77 the graphs of X into the directory that contains the graphs of Y This method raises two problems e the number of graphs in the directory grows quickly e two graphs cannot share the same name To avoid that you can store the grammar X in a special directory called the graph repository This directory is a kind of library where you can store graphs and then call them using instead of To use this mechanism you first need to set the path to the graph repository Go into the Info gt Preferences gt Directories menu and select your directory in the Graph repository frame see Figure 5 9 There is one graph repository per language so feel free to share or not the same directory for all the languages you work with Preferences for English Graph Presentation y Morphological dictionaries Directories Language amp Presentation Private Unitex directory where all user s data is to be stored Didy Unitex Graph repository D irepository Figure 5 9 Setting the path to the graph repository 78 CHAPTER 5 LOCAL GRAMMARS Let us assume that we have a repository tree as on Figure 5 10 If we want to call the graph named DET that is located in sub directory Johnson we must use the call Det
215. nversely the former system knows nothing about how the latter one combines the provided forms to produce multi word sequences 10 3 Integration in Unitex One of the major design principles of MULTIFLEX is to be as independent as possible of the morphological system for simple words However the existence of such a system is inevitable because MWUs consist of simple words which we need to be able to inflect in order to inflect a MWU as a whole In its present version MULTIFLEX relies on the Unitex simple word inflection system e MULTIFLEX uses the same character encoding standards as Unitex i e Unicode 3 0 e MULTIFLEX uses the Unitex graph editor for the representation of inflectional paradigms of MWUs 10 3 INTEGRATION IN UNITEX 179 e MULTIFLEX admits similar principles of the morphological description as those ad mitted in the DELA system implemented in Unitex Thus an inflection paradigm is a set of actions to be performed on the lemma in order to generate its inflected forms and of corresponding inflection features to be attached to each generated form e MULTIFLEX allows to extend the Unitex dictionary treatment to the inflection of a DELAC DELA electronic dictionary of compounds into a DELACF DELA electronic dictionary of compounds inflected forms The format of the generated DELACF is compatible with Unitex while the format of the DELAC is novel but inspired from the one of the DELAS DELA electronic dictionary
216. ode Implode Apply Elag Rule V MC C1s C4s C3s PRO Pes R4ms R4fs R4mp R4fp Figure 7 7 Normalized phrase automaton 7 2 CONSTRUCTION 131 The Reconstrucao program allows you to construct a normalization grammar for these forms for each text dynamically The grammar thus produced can then be used for normal izing the text automaton The configuration window of the automaton construction suggests an option Build clitic normalization grammar cf figure 7 10 This option automatically starts the construction of the normalization grammar which is then used to construct the text automaton if you have selected the option Apply the Normalization grammar 7 24 Keeping the best paths An unknown word can perturb the text automaton by overlapping with a completely la beled sequence Thus in the automaton of figure 7 8 it can be seen that the adverb aujourd hui overlaps with the unknown word aujourd followed by an apostrophe and the past participle of the verb huir Je n ai pas le temps aujourd hui 3653 sentences _ Restez r pondit Fix Sentence ES Reset Sentence Graph Rebuild FST Text Elag Frame Explode Implode Apply Elag Rule Figure 7 8 Ambiguity due to a sentence containing an unknown word This phenomenon can also take place in the treatment of certain Asian languages like Thai When words are not delimited there is no other solution than to con
217. of simple words The following sections present for several languages complete examples of a DELAC into DELACF inflection within the MULTIFLEX Unitex interface 10 3 1 Complete Example in English Let us assume that the description of morphological features of English is given by the fol lowing Morphology t xt file English lt CATEGORIES gt Nb s p lt CLASSES gt noun Nb lt var gt adj and that the equivalences between these features and their corresponding codes in DELA dictionaries are given by the following Equivalences txt file English s Nb s p Nb p Consider the following sample English DELAC file angle angle N1l s of reflection NC_NXXXX Adam s apple apple Nl s NC_XXXXN air brake brake Nl s NC_XXN birth date date N1 s NC_NN_NofN criminal police NC_XXXinv cross roads NC_XXNs head head Nl1l s of government government N1 s NC_NofNs notary notary N3 s public public Nl s NC_NsNs rolling stone stone Nl s NC_XXN student student N1 s union union N1l s NC_Ns N 180 CHAPTER 10 COMPOUND WORD INFLECTION The corresponding inflection graphs N1 and N3 for simple words are represented on fig ures 10 10 and 10 11 while those for compounds are shown on figures 10 12 through 10 20 The DELACF dictionary resulting from the inflection via MULTIFLEX of the above DELAC is as follows angle of reflection angle of reflection NC_NXXXX s angles of reflection angle of reflect
218. oid and will automatically terminate your rights under this License However parties who have received copies or rights from you under this License will not have their licenses terminated so long as such parties remain in full compliance You are not required to accept this License since you have not signed it However nothing else grants you permission to modify or distribute the Program or its deriva tive works These actions are prohibited by law if you do not accept this License Therefore by modifying or distributing the Program or any work based on the Pro gram you indicate your acceptance of this License to do so and all its terms and conditions for copying distributing or modifying the Program or works based on it Each time you redistribute the Program or any work based on the Program the re cipient automatically receives a license from the original licensor to copy distribute or modify the Program subject to these terms and conditions You may not impose any further restrictions on the recipients exercise of the rights granted herein You are not responsible for enforcing compliance by third parties to this License If as a consequence of a court judgment or allegation of patent infringement or for any other reason not limited to patent issues conditions are imposed on you whether by court order agreement or otherwise that contradict the conditions of this License they do not excuse you from the condi
219. okens or sequences of tokens 4 3 1 Special symbols There are two kinds of lexical masks The first category contains all symbols that have been introduced in section 2 5 2 except for the symbol lt PNC gt which matches punctuation signs and lt gt which matches a line feed Since all line feeds have been replaced by spaces this symbol cannot longer be useful when searching for lexical masks These symbols also called meta symbols are the following e lt E gt the empty word or epsilon Matches the empty string e lt TOKEN gt matches any token except the space used by default for morphological filters e lt MOT gt matches any token that consists of letters e lt MIN gt matches any lower case token e lt MAJ gt matches any upper case token e lt PRE gt matches any token that consists of letters and starts with a capital letter e lt DIC gt matches any word that is present in the dictionaries of the text e lt SDIC gt matches any simple word in the text dictionaries e lt CDIC gt matches any composed word in the dictionaries of the text e lt NB gt matches any contiguous sequence of digit 1234 is matched but not 1 234 e prohibits the presence of space NOTE as described in section 2 5 4 NO meta can be used to match the STOP marker not even lt TOKEN gt 4 3 LEXICAL MASKS 59 4 3 2 References to information in the dictionaries The second kind of lexical masks ref
220. ological Descrip GOW EEE 169 10 2 Formalism for the Computational Morphology of MWUs 170 10 2 1 Morphological Features of the Language 170 1022 D composition ofa MWU into Units sa cca sso o rss 172 1025 Inflection paradigem ofa MWU i esc eae k dew ee sm rea 173 10 3 E MEN 5 535 de sas Nue CRRA eines Babe MS 178 10 3 1 Complete Example in Easel sv s ecos dv ek Ree REE OO AS 179 CONTENTS 105 2 Complete Example in French socorro eer 1033 Complete Example In Serbian ao 11 Use of external programs IT ee a a A E EN ET E e AE e AE A PE COM A e E ale abe a os o OMS oes MoC MA 11 4 Concor DIE ua a a A A A NEE TES ONMETE ee iaa dara a a sn LG NEO ii Le ee E A A A tie A er d Ti DA lt p AS E AA AA AE ee RU ES Ep IE ECOS a a EE A RA AS ADS A UFR VAIN a e daa o ii da a o id Eu cios dr A PU AVERAGE rt Las en o A a A eld Bee E en den S Ee e ene ook dr EE eee Beno Ae ET EE A e a EE gt aia e E e EE gi Ae ee OO DS we ee a TLTAESCTASE oa a ne de de a ee ad he Le 9 we dd de A A AE a lee ead ee Ee E WAGON ora IA a ds A a WARS te Use 4 a Ge Se ae HG de EE ee eh BG A She ee A VARAS EE UE ea re A oOo BO a TIAPLO E ria ra a a A a hs LZ2OMetcc Te RAMIODIO ON lt EE 11 21MultiFlex TLZ NOMA du uer a pueh ENN EEN NEEN See ade ee A EE AAA A ENEE 11 24 e E Ce EEN EE Ach ag a Ee EAT TA AE EEN le Weg A E AE EI TOO e AIDA A s A A AAA TETTED I 0 a a ph
221. on This notion which is derived from the field of finite state automata enables a grammar to produce some output With an eye towards clarity we will use the terms grammar or graph When a grammar produces outputs we will use the term transducer as an extension of the definition of a transducer in the area of finite state automata 5 2 Editing graphs 5 2 1 Creating a graph In order to create a graph click on New in the FSGraph menu You will then see the window coming up as in figure 5 2 The symbol in arrow form is the initial state of the graph The round symbol with a square is the final state of the graph The grammar only recognizes expressions that are described along the paths between initial and final states In order to create a box click inside the window while pressing the Ctrl key A blue rectangle will appear that symbolizes the empty box that was created see figure 5 3 After creating the box it is automatically selected You see the contents of that box in the text field at the top of the window The newly created box contains the lt E gt symbol that represents the empty word epsilon Replace this symbol by the text I you he she it we they and press the Enter key You see that the box now contains seven lines see figure 5 4 The character serves as a separator The box is displayed in the form of red text lines since it is not connected to another one at the moment We often use this type of boxes to insert comm
222. on of the dictionary lines from the inflected forms contained in the automaton For more details on the format of these files see chapter 12 11 3 Concord Concord OPTIONS lt index gt This program takes a concordance index file produced by the program Locate and pro duces a concordance It is also possible to produce a modified text version taking into ac count the transducer outputs associated to the occurrences Here is the description of the parameters OPTIONS e f FONT font FONT the name of the font to use if the output is an HTML file e s N fontsize N the font size to use if the output is an HTML file The font parameters are required if the output is an HTML file e 1 X left xX number of characters on the left of the occurrences In Thai mode this means the number of non diacritic characters e r X right X number of characters non diacritic ones in Thai mode on the right of the occurrences If the occurrence is shorter than this value the concordance line is completed up to right If the occurrence is longer than the length defined by right it is nevertheless saved as whole NOTE For both 1eft and right you can add the s character to stop at the first S tag For instance if you set 40s for the left value the left context will end at 40 characters at most less if the S tag is found before 11 3 CONCORD 197 Sort order options TO order in which the occurrences appe
223. on we sometimes make exceptions for this Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally NO WARRANTY BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE THERE IS NO WARRANTY FOR THE PROGRAM TO THE EXTENT PERMITTED BY APPLICABLE LAW EXCEPT WHEN OTH ERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND OR OTHER PARTIES PRO VIDE THE PROGRAM AS IS WITHOUT WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MER CHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU SHOULD THE PROGRAM PROVE DEFECTIVE YOU ASSUME THE COST OF ALL NECESSARY SERVICING REPAIR OR CORRECTION 244 CHAPTER 12 FILE FORMATS 12 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER OR ANY OTHER PARTY WHO MAY MODIFY AND OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE BE LIABLE TO YOU FOR DAM AGES INCLUDING ANY GENERAL SPECIAL INCIDENTAL OR CONSEQUENTIAL DAM AGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF S
224. on of the package is made by offering access to copy from a designated place offer equivalent access to copy the above specified materials from the same place e Verify that the user has already received a copy of these materials or that you have already sent this user a copy 260 CHAPTER 12 FILE FORMATS If the package includes an encrypted form of the Linguistic Resource the required form of the work that uses the Linguistic Resource must include any data and util ity programs needed for reproducing the package from it However as a special ex ception the materials to be distributed need not include anything that is normally distributed in either source or binary form with the major components compiler kernel and so on of the operating system on which the executable runs unless that component itself accompanies the executable It may happen that this requirement contradicts the license restrictions of proprietary libraries that do not normally accompany the operating system Such a contradiction means you cannot use both them and the Linguistic Resource together in a package that you distribute You may not copy modify sublicense link with or distribute the Linguistic Resource except as expressly provided under this License Any attempt otherwise to copy modify sublicense link with or distribute the Linguistic Resource is void and will automatically terminate your rights under this License However parties who
225. onjunction because DET determiner each INTJ interjection eureka N noun evidence group theory PREP preposition without PRO pronoun you V verb overeat plug and play Table 3 1 Frequent grammatical codes Code Description Example 2l general language joke z2 specialized language floppy disk Zo very specialized language serialization Abst abstract patricide Anl animal horse AnlColl collective animal flock Conc concrete chair ConcColl collective concrete rubble Hum human teacher HumColl collective human parliament t transitive verb kill i intransitive verb agree NOTE The descriptions of tense in table 3 3 correspond to French Nontheless the majority of these definitions can be found in other languages infinitive present past participle etc In spite of a common base in the majority of languages the dictionaries contain encoding Table 3 2 Some semantic codes 40 CHAPTER 3 DICTIONARIES particularities that are specific for each language Thus as the declination codes vary a lot between different languages they are not described here For a complete description of all codes used within a dictionary we recommend that you contact the author of the dictionary directly Code Description masculine feminin neuter singular plural 1st 2nd 3rd person present indicative imperfect indicative present subjunctive imperfect subjunctive present imperative present conditional simple past indicati
226. ords by concatenating together other words For example the word aftenblad mean ing evening journal is obtained by combining the words aften evening et blad journal The PolyLex program parses the list of unknown words after the application of dictionaries and tries to analyze each of these words as a compound word If a word has at least one analysis as a compound word it is removed from the list of unknown words and the lines produced for this word are appended to the simple word text dictionary 2 6 Opening a tagged text A tagged text is a text containing words with lexical tags enclosed in round brackets I do not like the square bracket N sign S 2 6 OPENING A TAGGED TEXT 33 Lexical Resources Select the dictionaries to be applied You can sort them one by one using the arrows Note that system dictionaries are given to the Dico program before the user ones User resources System resources PfxW Lidia bin dico lidia bin Ce dictionnaire reconna t les chiffres romains en majuscules depuis 1 jusqu 4999 Son avantage par rapport au dictionnaire RomNum bin est qu il ne prend pas comme chiffres romains L D M et MM dans les contextes suivants Set Default Figure 2 13 Parameterizing the application of dictionaries Such tags can be used to avoid ambiguities In the previous example it will be impossible to match square bracket as the combination of two simple words
227. original licensor to copy distribute link with or modify the Library subject to these terms and conditions You may not impose any further restrictions on the recipients exercise of the rights granted herein You are not responsible for enforcing compliance by third parties with this License 11 If as a consequence of a court judgment or allegation of patent infringement or for any other reason not limited to patent issues conditions are imposed on you whether by court order agreement or otherwise that contradict the conditions of this License they do 12 11 VARIOUS OTHER FILES 253 not excuse you from the conditions of this License If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations then as a consequence you may not distribute the Library at all For example if a patent license would not permit royalty free redistribution of the Library by all those who receive copies directly or indirectly through you then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library If any portion of this section is held invalid or unenforceable under any particular cir cumstance the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances It is not the purpose of this section to induce you to infringe any patents or other prop erty right claims or to con
228. orms tokenization and application of dictionaries If you choose not to preprocess the text it will nevertheless be normalized and tokenized since these op erations are necessary for all further Unitex operations It is always possible to carry out the preprocessing later by clicking on Preprocess Text in the Text menu If you choose to preprocess the text Unitex proposes to parameterize it as in the window shown in figure 2 8 The option Apply FST2 in MERGE mode is used to split the text into sentences The option Apply FST2 in REPLACE mode is used to make replacements in the text especially for the normalization of non ambiguous forms With the option Ap ply All default Dictionaries you can apply dictionaries in the DELA format Dictionnaires Electroniques du LADL The option Analyze unknown words as free compound words is used in Norwegian for correctly analyzing compound words constructed via concatenation of simple forms Finally the option Construct Text Automaton is used to build the text automaton This option is deactivated by default because it consumes a large amount of memory and disk space if the text is too large The construction of the text automaton is described in chapter 7 NOTE If you click on Cancel but tokenize text the program will carry out the normaliza 24 CHAPTER 2 LOADING A TEXT snt novel snt y test franz txt bak xt G novel txt G test_tagges snt p5 xml AN novel txt bak G test_tagges tx
229. ouche Figure 10 6 Inflection graph for bateau mouche with two types of instantiation Note that the double assignment contrary to the single assignment no longer means that the variable is to be instantiated to all values of the corresponding category domain It has a unique value all through the path on which it appears even if it is concerned by another single assignment somewhere else on the same path For example on Figure 10 6 the final output contains Gen but g may only take one value determined by the first constituent Unification variables are particularly useful in highly inflected languages For example in Polish most nouns inflect for number 2 values and case 7 values which implies at least 14 different forms if variants and syncretic forms are distinguished This score is even higher for adjectives which inflect for number case and gender 3 till 9 values according to different approaches If no unification mechanism were available each of these numerous forms would have to be described by a separate path in the graph The use of unification variables allows to dramatically reduce the size of the graph to one path only in most cases For example Figure 10 7 shows the graph for Polish compounds that inflect like pranie m zgu brainwashing or powozenie koniem horse coaching Their third constituent has its case fixed most often to genitive or instrumental Their first and third constituent inflect in number in
230. p w4vm Ujedinxene nacije Ujedinxene nacije NC_AXN3 N Comp NProp Org fplq Ujedinxenih nacija Ujedinxene nacije NC_AXN3 N Comp NProp Org fp2q Ujedinxenima nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp3q Ujedinxenim nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp3q Ujedinxene nacije Ujedinxene nacije NC_AXN3 N Comp NProp Org fp4q Ujedinxene nacije Ujedinxene nacije NC_AXN3 N Comp NProp Org fp5q Ujedinxenima nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp6q Ujedinxenim nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp6q Ujedinxenima nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp7q Ujedinxenim nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp7q Kosovo i Metohija Kosovo i Metohija NC_N3XN N Comp NProp Top Reg nslq Kosova i Metohije Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns2q Kosovu i Metohiji Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns3q Kosovo i Metohiju Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns4q Kosovo i Metohijo Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns5q Kosovom i Metohijom Kosovo i Metohija NC_N3XN N Comp NProp ToptReg ns6q Kosovu i Metohiji Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns7q istrazxne sudije istrazxni sudija NC_AXNF N Comp lvfp istrazxnih sudija istrazxni sudija NC_AXNF N Comp 2vfp istrazxnima sudijama istrazxni sudija NC_AXNF
231. pair consists of a unit the inflected form and the corresponding unit in the canonical form If each of the two units is a space or a hyphen the compressed form of the unit is the unit itself as in the following line 0 1 N p which is the output for batt le axes battle axe N p 230 CHAPTER 12 FILE FORMATS This maintains a certain readability of the inf file when the dictionary contains compound words Whenever one or both of the units in a pair is neither a space nor a hyphen the compressed form is composed of the number of characters to be removed followed by the sequence of characters to be appended Thus the dictionary line premi re partie premier parti N AN Hum fs is encoded by the line 3er 1 N AN Hum fs The 3er code indicates that 3 characters are to be removed from the sequence premi re and the characters er are to be appended to obtain premier The 1 indicates that only one character needs to be removed from partie to obtain parti The number 0 is used whenever it needs to be indicated that no letter should be removed 12 8 3 Dictionary information file In the Apply lexical resources frame it is possible for some dictionaries to get some infor mation with a right click Such information is attached to a biniou bin orbiniou fst2 dictionary by the mean of a raw text file named biniou txt located in the same directory 12 8 4 The CHECK_DIC TXT file This file is produced by the dictionary verification progr
232. ph generated for the verb archa ser le verbe n 0011 ne v rifie pas la propri t de la colonne A ET NO V vers N Figure 8 9 Graph generated for the verb badauder 8 2 CONVERSION OF A TABLE INTO GRAPHS 157 TestGraph_0119 TestGraph_0120 TestGraph_0121 TestGraph_0122 TestGraph_0123 TestGraph_0124 TestGraph_0125 TestGraph_0126 TestGraph_0127 TestGraph_0128 TestGraph_0129 TestGraph_0130 TestGraph_0131 Figure 8 10 Main graph referring to all the generated graphs 158 CHAPTER 8 LEXICON GRAMMAR Chapter 9 Text alignment The principle of text alignment is simple aligning two or more texts one supposed to be the source and the other s supposed to be its translation s The alignment is made at the sentence level because word alignment is not possible yet and certainly not relevant Then one can look for an expression A in one of the texts and look for its translations in the sentences aligned with those containing occurrences of A To include such a functionality into Unitex Patrick Watrin integrated the Open Source text alignment tool XAlign developed at the LORIA 52 In this chapter we will explain how to use the alignment module The reader interested in details about the integration of XAlign can consult 23 or 60 and 70 for an illustration of what can be done with this module 9 1 Loading texts First you need to select your 2 texts To do that go into XAlign gt Open files and you will
233. possible The first does not modify the canonical form and adds the inflectional code s The second deletes a letter with the L operator then adds the ux suffix and adds the inflectional code mp Five operators are possible e L left removes a letter from the entry e R right restores a letter to the entry In French many verbs of the first group are conjugated in the present singular of the third person form by removing the r of the infinitive and changing the A letter from the end to peler p le acheter gt ach te g rer g re etc Instead of describing an inflectional suffix for each verb LLLL le LLLL te et LLLLere the R operator can be used to describe it in one way LLLL RR e C copy duplicates a letter in the entry and moves everything on its right by one posi tion In cases like permitted or hopped we see a duplication of the final consonant of the verb To avoid writing an inflectional graph for every possible final consonant one can use the C operator to duplicate any final consonant 46 CHAPTER 3 DICTIONARIES e D delete deletes a letter shifting anything located on the right of this letter For in stance if you want to inflect the Romanian word european into europeni you must use the sequence LDRi L will move the cursor on the a D will delete the a shifting the n on the left and then Ri will restore the n and add an i e U unaccent removes the accent of the current character
234. ppear which asks for the rul file to be used see figure 7 17 The default file is elag rul This will launch the Elag program which will try to resolve the ambiguity Once the program has finished you can view the resulting automaton by clicking on the Open Elag Frame button As you can see in figure 7 18 the windows is separated into two parts The original text automaton can be seen on the top and the result at the bottom 138 CHAPTER 7 TEXT AUTOMATON Implose resulting text automaton N Mus z1 ms mp ES Figure 7 17 Text automaton frame 3 sentences La porte du car se ferme automatiquement D Sentence 3 Reset Sentence Graph Rebuild FST Text close elag frame Explode sar Implode ly Apply Elag Rule car Implode PNJG V S3s P3s Replace Figure 7 18 Splitted text automaton frame 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 139 Don t be surprised if the automaton shown at the bottom seems more complicated This re sults from the fact that factorized lexical entries were exploded in order to treat each inflec tional interpretation separately To refactorize these entries click on the Implode button Clicking on the Explode button shows you an exploded view of the text automaton If you click on the Replace button the resulting automaton will become the new text au tomaton Thus if you use other grammars t
235. pplies to what was recognized by the lexical mask Here are some examples of such combinations 4 8 SEARCH 65 e lt V K gt lt lt i gt gt Past participle ending with i e lt CDIC gt lt lt gt gt A compound word containing a dash e lt CDIC gt lt lt gt gt a compound word containing at least two spaces e lt A fs gt lt lt pro gt gt a feminine singular adjective beginning with pro e lt DET gt lt lt u u n un gt gt a French determiner different from un e lt DIC gt lt lt es gt gt a word which is not in the dictionary and which ends with es e lt V S T gt lt lt uiss gt gt a verb in the past or present subjunctive and containing uiss NOTE By default morphological filters are subject to the same variations of case as lexical masks Thus the filter lt lt gt gt will recognize all the words starting with but also those which start with or E To force the matcher to respect case add _f_ immediately after the filter e g lt A gt lt lt gt gt _f_ 4 8 Search 4 8 1 Configuration of the search In order to search for an expression first open a text cf chapter 2 Then click on Locate Pattern in the Text menu The window of figure 4 4 appears F Locate Pattern Locate pattern in the form of O Regular expression 8 Graph SS Index Grammar outputs O Shortest matches 8 Are not taken into account 8
236. qem Jo pasodmos qmq AUON WetoazsqEsty E 2 qmq futanoaes U019981998 Ystaqanboo 30 118 Uteqiss E 338 amos usaq PEU 22393 yotym uodn any SsTdand qypbtaq E futaq aTqeuthemt MIOJ qsatdmts 241 30 JO 1241 sem S821p sty S AMUEI YT Jo maya futsodstp 30 apom ya pue STET131 J0 pauteqs usaq PEU 124080 styis ang JUNY PI0JUOF US aoyUeANSndJoj Yysibuyxapup ANG SDUPPIOJUOS m Figure 4 8 Example concordance Chapter 5 Local grammars Local grammars are a powerful tool to represent the majority of linguistic phenomena The first section presents the formalism in which these grammars are represented Then we will see how to construct and present grammars using Unitex 5 1 The local grammar formalism 5 1 1 Algebraic grammars Unitex grammars are variants of algebraic grammars also known as context free grammars An algebraic grammar consists of rewriting rules Below you see a grammar that matches any number of a characters S aS S The symbols to the left of the rules are called non terminal symbols since they can be replaced Symbols that cannot be replaced by other rules are called terminal symbols The items at the right side are sequences of non terminal and terminal symbols The epsilon symbol e designates the empty word In the grammar above S is a non terminal symbol and a a terminal symbol S can be rewritten as either an a followed by a S or as the empty word The operation of rewriting by applying a
237. r English MWUs sa 0 e g air brake lt Nb n gt Figure 10 14 Inflection graph NC_XXN for English MWUs e g birth date gt sa Figure 10 15 Inflection graph NC_NN_NofN for English MWUs lt lt s1 gt lt 2 gt jH lt s lt Nb p gt e g criminal police Figure 10 16 Inflection graph NC_XXXinv for English MWUs He lt Nb n gt e g cross roads Figure 10 17 Inflection graph NC_XXNs for English MWUs 182 CHAPTER 10 COMPOUND WORD INFLECTION e g head of government lt Nb p gt Figure 10 18 Inflection graph NC_NofNs for English MWUs lt Nb n gt Figure 10 20 Inflection graph NC_Ns N for English MWUs 10 3 2 Complete Example in French Let us assume that the description of morphological features of French is given by the fol lowing Morphology txt file French lt CATEGORIES gt Nb s p Gen m f lt CLASSES gt noun Nb lt var gt Gen lt var gt adj Nb lt var gt Gen lt var gt adv and that the equivalences between these features and their corresponding codes in DELA 10 3 INTEGRATION IN UNITEX dictionaries are given by the following Eq 183 uivalences txt file French S P Nb s Nb p m Gen m f Gen f Consider the following sample French DELAC file the DELAS inflection codes may vary from those present in UNITEX avant garde garde N21 fs NC_XXN bateau bateau N3 ms mouche mouche N
238. r Max badaude Feuille 1 1 PageStyle_c31H 100 em _ Somme 0 Figure 8 4 Lexicon grammar table 31H 8 24 Automatic generation of graphs In order to be able to generate graphs from a parameterized graph and a table first of all the table must be opened by clicking on Open in the Lexicon Grammar menu see figure 8 5 The table must be in Unicode text format The selected table is then displayed in a window see figure figure 8 6 If it does not appear 8 2 CONVERSION OF A TABLE INTO GRAPHS 153 Text DELA FSGraph XAlign Edit File Edition Windows Info Open Compile to GRF Close Figure 8 5 Menu Lexicon Grammar on your screen it may be hidden by other Unitex windows F Ew au GE S NO N hum Aux avoir lt ENT gt NO estV ant NO estVpp NOpe lui NO Y de NOpeNhum Y sur abando abuser acquie adouber agioter agoniser archaiser arquer arriver atermoyer badauder baisser E i Tall a a a a a Sa ET Ea Ea gi A EE E a a A HITVTITITITTSTVTTITIAZIIIT 7 ooo CS We ee ee prpri gprprprprprprprprprja weil IIIIIEIIuIItIIIIIIIIIIIII Figure 8 6 Displaying a table To automatically generate graphs from a parameterized graph click on Compile to GRF in the Lexicon Grammar menu The window in figure 8 7 shows this In the Reference Graph in GRF
239. r in which they are found in the text e text cod contains an integer array every integer corresponds to the index of a token in the file tokens txt e tok_by_freq txt contains the list of tokens sorted by frequency e tok_by_alph txt contains the list of tokens in alphabetical order e stats ncontains some statistics about the text Tokenizing the text A cat is a cat 30 returns the following list of tokens A SPACE cat isa CHAPTER 2 LOADING A TEXT You will observe that tokenization is case sensitive A and a are two distinct tokens and that each token is listed only once Numbering these tokens from 0 to 5 the text can be represented by a sequence of numbers integers as described in the following table Token number 0 2 3 2 5 Corresponding A cat is cat token Table 2 1 Representation of the text A cat is a cat For more details see chapter 12 F Token list By Frequence H By Char Order Figure 2 11 Tokens of an English text sorted by frequency 2 5 5 Applying dictionaries Applying dictionaries consists of building the subset of dictionaries consisting only of forms that are present in the text Thus the result of applying a English dictionary to the text Igor s father in law is ill produces a dictionary of the following simple words 2 5 PREPROCESSING A TEXT 31 father N Hum s father V W P1s P2s Plp P2p P3p i11 A ill
240. rame for aligned texts Once Unitex has created and preprocessed the working version of the text you can perform 164 CHAPTER 9 TEXT ALIGNMENT your query using the frame shown on Figure 9 7 As the matching operation is performed by the Locate program you can perform the same queries than you would perform on a normal corpus The only restriction is that you cannot exploit the outputs of your grammars if any For instance let us lookup for the pattern lt manger gt to eat in the French text of our ex ample First we see no result because we have not changed yet the display mode for the French text which by default is All sentences Plain text Clicking on Matched sentences we only see sentences that contain occurrences highlighted as usual in blue as shown on Figure 9 8 Clicking on All sentences HTML will display all sentences highlighting oc currences in blue D iMy UnitexiXAlign funtana xml mais nous assassinons tour de bras comme nous mangeons comme nous respirons comme nous accomplissons les gestes les plus quotidiens Apr s avoir mang le sien l un d entre nous commen ait Tante Desi crestini nu ne am pierdut bine n eles indeminarea daca e cazul s sugrum m dar noi asasin m cu atita nongalant de parc am minca am respira am face un gest de zi cu Zi Si apoi recurgem la cainta gi la tot ce ne ofer doctrinele noastre filosofice religioase si politice donne moi le d
241. re cacher 447 PageStyle_c32NM Somme 0 Figure 8 1 Lexicon grammar Table 32NM 8 2 Conversion of a table into graphs 8 2 1 Principle of parameterized graphs The conversion of a table into graphs is carried out by a mechanism involving parameter ized graphs The principle is the following a graph that describes the possible constructions is constructed manually That graphs refers to the columns of the table in the form of param eters or variables Afterwards for each line of the table a copy of this graph is constructed where the variables are replaced with the contents of the cell at the intersection of line and the column that corresponds to the variable If a cell of the table contains the sign the corresponding variable is replaced by lt E gt If the cell contains the sign the box containing the corresponding variable is removed interrupting the paths through that box In all other cases the variable is replaced by the contents of the cell 8 2 2 Format of the table The lexicon grammar tables are usually encoded with the aid of a spreadsheet like OpenOf fice org Calc 57 To make them usable with Unitex the tables have to be encoded in Unicode text format in accordance with the following convention the columns need to be 8 2 CONVERSION OF A TABLE INTO GRAPHS 153 separated by a tab and the lines by a newline In order to convert a table with OpenOffice org Calc save it in
242. re designed to take away your freedom to share and change it By contrast the GNU General Public License is intended to guarantee your free dom to share and change free software to make sure the software is free for all its users This General Public License applies to most of the Free Software Foundation s software and to any other program whose authors commit to using it Some other Free Software Foun dation software is covered by the GNU Library General Public License instead You can apply it to your programs too When we speak of free software we are referring to freedom not price Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software and charge for this service if you wish that you receive source code or can get it if you want it that you can change the software or use pieces of it in new free programs and that you know you can do these things To protect your rights we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights These restrictions translate to certain responsibilities for you if you distribute copies of the software or if you modify it For example if you distribute copies of such a program whether gratis or for a fee you must give the recipients all the rights that you have You must make sure that they too receive or can get the source code And you must show them these terms so they know their rig
243. rrier carrier N1 s NC_NN chief justice justice N1 s NC_NN lord lord N1 s justice justice N1 s NC_NN2 However only a few codes which can be seen as a phrase grammar of the language repre sent the big majority of all MWUs Thus the lexicalization of the description mainly consists of pointing out the MWUs which respect or don t respect the grammar 10 2 Formalism for the Computational Morphology of MWUs In 64 was proposed a formalism for describing the morphological paradigms of MWUs It has been based on studies of English Polish and French and further tested for Serbian 44 It consists of a language independent kernel which is to be completed by a set of morphological elements characteristic for the given language In this section we give an in depth description of this formalism 10 2 1 Morphological Features of the Language When processing MWUs of a given language we have to provide some general data about that language These data are included in two textual files The Morphology txt file gives the morphological classes noun adjective categories number gender case and values masculine feminine singular nominative Con sider the following example Polish lt CATEGORIES gt Nb sing pl Case Nom Gen Dat Acc Inst Loc Voc Gen masc_pers masc_anim masc_inanim fem neu lt CLASSES gt noun Nb lt var gt Case lt var gt Gen lt fixed gt adj Nb lt var gt Case
244. rs used in the in flectional and canonical forms the list of grammatical and semantic codes and the list of inflectional codes that appear in the dictionary The character list makes it possible to verify that the characters used in the dictionary are consistent with those in the alphabet file of the language Each character is followed by its value in hexadecimal notation The code lists can be used to check that there are no typing errors in the codes of the dictio nary The CheckDic program works with non compressed dictionaries 1 e the files in text for mat The general convention is to use the dic extension for these dictionaries In order to check the format of a dictionary you first open it by choosing Open in the DELA menu Unitex 2 0 current language is English Text DELA FSGraph Lexicon Grammar XAlign Edit File Edition Windows Info Open Append Suffixes to Stems Figure 3 1 DELA Menu Let s load the dictionary as in figure 3 2 Then click on Check Format in the DELA menu A window like in figure 3 3 is opened You must select the type of dictionary you want to check After checking the dictionary in Figure 3 2 results are presented as shown in Figure 3 4 The first error is caused by a missing period The second by the fact that no comma was found after the end of an inflected form The third error indicates that the program didn t find any grammatical or semantic codes
245. rted alphabet The sorted alphabet file defines the sorting priorities of the letters of a language It is used by the Sort Txt program Each line of that file defines a group of letters If a group of letters A is defined before a group of letters B every letter of group A is inferior to every letter in group B The letters of a group are only distinguished if necessary For example if the group of letters e e has been defined the word bahi should be considered smaller than estuaire and also smaller than t Since the letters that follow e and determine the order of the words it is not necessary to compare letters e and since they are of the same group On the other hand if the words chant s and chantes are to be sorted chantes should be considered as smaller It is therefore necessary to compare the letters e and to distin guish these words Since the letter e appears first in the group e e s it is considered to be smaller than chant s The word chantes should therefore be considered to be smaller than the word chant s The sorted alphabet file allows the definition of equivalent characters It is therefore possible to ignore the different accents as well as capitalization For example if the letters b c and d are to be ordered without considering capitalization and the cedilla it is possible to write the following lines Bb cece Dag This file is optional If no sorted alphab
246. rule is called derivation We say that a grammar generates a word if there exists a sequence of derivations that produces that word The non terminal that is the starting point of the first derivation is called an axiom The grammar above also generates the word aa since we can derive this word according to the axiom S by applying the following derivations Derivation 1 rewriting the axiom to aS S aS Derivation 2 rewriting S at the right side of aS S aS gt aas 71 72 CHAPTER 5 LOCAL GRAMMARS Derivation 3 rewriting S to e S aS aas aa We call the set of words generated by a grammar the language generated by the grammar The languages generated by algebraic grammars are called algebraic languages or context free languages 5 1 2 Extended algebraic grammars Extended algebraic grammars are algebraic grammars where the members on the right side of the rule are not just sequences of symbols but regular expressions Thus the grammar that generates a sequence of an arbitrary number of a s can be written as a grammar consisting of one rule S a These grammars also called recursive transition networks RTN or syntax diagrams are suited for a user friendly graphical representation Indeed the right member of a rule can be rep resented as a graph whose name is the left member of the rule However Unitex grammars are not exactly extended algebraic grammars since they con tain the notion of transducti
247. ry for a given language Unitex won t try to copy system data into it So if an update has modified a resource file other than a dictionary you will have to copy by yourself this file or to delete your personal directory for this language and let Unitex rebuild it properly Choosing the language allows Unitex to find certain files for example the alphabet file You can change the language at any time by choosing Change Language in the Text menu If you change the language the program will close all windows related to the current text if there are any The active language is indicated in the title bar of the graphical interface 2 2 Text formats Unitex works with Unicode texts Unicode is a standard that describes a universal character code Each character is given a unique number which allows for representing texts without having to take into account the proprietary codes on different machines and or operating 19 20 CHAPTER 2 LOADING A TEXT User paumier Choose the language you want to work on Figure 2 1 Language selection when starting Unitex systems Unitex uses a two byte representation of the Unicode 3 0 standard called Unicode Little Endian for more details see 16 Texts that come with Unitex are already in Unicode format If you try to open a text that is not in Unicode the program proposes to convert it see figure 2 2 This conversion is based on the current language if you are working in
248. rzxave drzxava N600 fs2q NC_N2X1 N Comp Ujedinxene Ujedinxen Al aefplg nacije nacija N600 fplq NC_AXN3 N Comp NProp Org Kosovo Kosovo N308 nslq i Metohija Metohija N623 fslq NC_N3XN N Comp NProp Top Reg istrazxni istrazxni A2 admslg sudija sudija N679 mslv NC_AXNF N Comp Mirosinka Mirosinka N1637 fslv Dinkicx Dinkicx N1028 mslv NC_ImePrezime N Comp Hum PersName gladan gladan Al8 akmslg kao vuk vuk N128 mslv AC_A3XN2 hungry as a wolf The corresponding inflection graphs for MWUs are shown on figures 10 28 through 10 35 The DELACF dictionary resulting from the inflection via MULTIFLEX of the above DELAC is as follows zxiro racyun zxiro racyun NC_2XN1 N Comp slqm zxiro racyuna zxiro racyun NC_2XN1 N Comp s2qm zxiro racyunu zxiro racyun NC_2XN1 N Comp s3qm zxiro racyun zxiro racyun NC_2XN1 N Comp s4qm zxiro racyune zxiro racyun NC_2XN1 N Comp s5qm 10 3 INTEGRATION IN UNITEX zxiro racyunom zxiro racyun NC_2XN1 N zxiro racyunu zxiro racyun NC_2XN1 N C zxiro racyuni zxiro racyun NC_2XN1 N C zxiro racyuna zxiro racyun NC_2XN1 N C zxiro racyunima zxiro racyun NC_2 zxiro racyune zxiro racyun NC_2X N C zxiro racyuni zxiro racyun NC_2XN1 N C zxiro racyunima zxiro racyun NC_2 Comp s6qm omp s7qm omp plqm omp p2qm 1 N Comp p3qm omp p4qm omp p5qm zxiro racyunima zxiro racyun NC_2XN1 N Comp p7qm zxiro racyuna zxiro racyun NC_2XN1 N C zxiro racyuna zxiro racyun NC_
249. s a collection of programs developped for the analysis of texts in natural language by using linguistic resources and tools These resources consist of electronic dictionaries grammars and lexicon grammar tables initially developed for French by Maurice Gross and his students at the Laboratoire d Automatique Documentaire et Linguistique LADL Similar resources have been developed for other languages in the context of the RELEX laboratory network The electronic dictionaries specify the simple and compound words of a language together with their lemmas and a set of grammatical semantic and inflectional codes The avail ability of these dictionaries is a major advantage compared to the usual utilities for pattern searching as the information they contain can be used for searching and matching thus de scribing large classes of words using very simple patterns The dictionaries are presented in the DELA formalism and were constructed by teams of linguists for several languages French English Greek Italian Spanish German Thai Korean Polish Norwegian Por tuguese etc The grammars used here are representations of linguistic phenomena on the basis of recur sive transition networks RIN a formalism closely related to finite state automata Nu merous studies have shown the adequacy of automata for linguistic problems at all descrip tive levels from morphology and syntax to phonetic issues Grammars created with Unitex carry this approac
250. s the alphabet file used 3 6 APPLYING DICTIONARIES 51 1 cities bin 2 regions bin 3 rivers bin 4 ctr bin 3 6 2 Application rules for dictionaries Besides the priority rule the application of dictionaries respects upper case letters and spaces The upper case rule is as follows e if there is an upper case letter in the dictionary then an upper case letter has to be in the text e if a lower case letter is in the dictionary there can be either an upper or lower case letter in the text Thus the entry peter N fs will match the words peter Peter et PETER while Peter N firstName only recognizes Peter and PETER Lower and upper case letters are defined in the alphabet file passed to the Dico program as a parameter Respecting white space is a very simple rule For each sequence in the text to be recognized by a dictionary entry it has to have exactly the same number of spaces For example if the dictionary contains aujourd hui ADV the sequence Aujourd hui will not be recog nized because of the space that follows the apostrophe 3 6 3 Dictionary graphs The Dico program can also apply dictionary graphs Dictionary graphs conform to the following rule if applied by Locate in MERGE mode they must produce output sequences that are valid DELAF lines Figure 3 11 shows a graph that recognizes chemical elements We can observe a first ad vantage of graphs over usual dictionaries we can force case with
251. se in pretty ATEX documents 5 4 2 Printing a Graph You can print a graph by clicking on Print in the FSGraph menu or by pressing lt Ctrl P gt WARNING You should make sure that the page orientation parameter portrait or land scape corresponds to the orientation of your graph You can setup the printing preferences by clicking on Page Setup in the FSGraph menu You can also print all open graphs by clicking on Print AIL 92 CHAPTER 5 LOCAL GRAMMARS Chapter 6 Advanced use of graphs 6 1 Types of graphs Unitex can handle several types of graphs that correspond to the following uses automatic inflection of dictionaries preprocessing of texts normalization of text automata dictionary graphs search for patterns disambiguation and automatic graph generation These differ ent types of graphs are not interpreted in the same way by Unitex Certain operations like transduction are allowed for some types and forbidden for others In addition special sym bols are not the same depending on the type of graph This section presents each type of graph and shows their peculiarities 6 1 1 Inflection transducers An inflection transducer describes the morphological variation that is associated with a word class by assigning inflectional codes to each variant The paths of such a transducer describe the modifications that have to be applied to the canonical forms and the corre sponding outputs contain the inflectional
252. shown on Figure 6 36 with default settings ig noring outputs limit 100 paths 22 wW w VV lt lt lt lt lt lt lt lt lt NB gt lt boule gt lt boule gt lt boule gt lt boule gt lt boule gt lt boule gt lt boule gt lt boule gt lt boule gt de de de de de de de de de glace a la pistache glace a la fraise glace a la vanille glace vanille glace fraise glace pistache pistache fraise vanille glace a la pistache glace a la fraise glace a la vanille glace vanille glace fraise glace pistache 6 6 GRAPH COLLECTIONS 113 Fous L Figure 6 36 Sample graph 6 6 Graph collections It can happen that one wants to apply several grammars located in the same directory For that it is possible to automatically build a grammar starting from a file tree structure Let us suppose for example that one has the following tree structure e Dicos Banque carte grf Nourriture x eau grf pain grf truc grf If one wants to gather all these grammars in only one one can do it with the Build Graph Collection command in the FSGraph Tools sub menu One configures this operation by means of the window seen in figure 6 37 Building Graph Collection Source directory Resulting GRF grammar Cancel Figure 6 37 Building a graph collection 114 CHAPTER 6 ADVANCED USE OF GRAPHS In the Source
253. sider all possible com binations which causes the creation of numerous paths carrying unknown words that are mixed with the labeled paths Figure 7 9 shows an example of such an automaton of a Thai sentence 132 CHAPTER 7 TEXT AUTOMATON 3 i y ai EI w Waler aran taaan Gen gie Wiuuuedantunnumala Aumann cia li qt 1003 sentences Sentence Reset Sentence Graph Rebuild FST Text Elag Frame Explode Implode Apply Elag Rule Figure 7 9 Automaton of a thai sentence It is possible to suppress parasite paths You have to select the option Clean Text FST in the configuration window for the construction of the text automaton cf figure 7 10 This option indicates to the automaton construction program that it should clean up each sentence automaton This cleaning is carried out according to the following principle if several paths are concur rent in the automaton the program keeps those that contain the fewest unlabeled tokens For instance the compound adverb aujourd hui is preferred to the sequence made of aujourd followed by a quote and hui because aujourd and the quote are both unlabeled 7 2 CONSTRUCTION 133 Normalization L Build clitic normalization grammar available only for Portuguese Portugal Apply the Normalization grammar Norm fst2 Clean Text FST C Use morpheme structures available for Korean C Normalize according to Elag tagset def Use Following Dictionaries previousty co
254. sidered until now that this information is more relevant to syntax than to lexical analysis and we thus don t have integrated them into the description of the tagset They are thus automatically eliminated at the time when the text automaton is loaded which reduces the rate of ambiguity In order to distinguish the effects bound to the tagset from those of the ELAG grammars it is advised to proceed to a preliminary stage of normalization of the text automaton before applying disambiguation grammars to it This normalization is carried out by applying to the text automaton a grammar not imposing any constraint like that of figure 7 20 Note that this grammar is normally present in the Unitex distribution and precompiled in the file norm rul Figure 7 20 ELAG grammar without any constraint 3This code indicates that the adjective must appear on the left of the nound to which it refers to as is the case for bel 146 CHAPTER 7 TEXT AUTOMATON The result of applying such a grammar is that the original is cleaned of all the codes which either are not described in the tagset def file or do not conform to this description be cause of unknown grammatical categories or invalid combinations of inflectional features By then replacing the text automaton by this normalized automaton one can be sure that later modifications of the automaton will only be effects of ELAG grammars 7 3 7 Grammar Optimization Compilation of ELAG grammars by
255. sk lt DIC gt allows you to find all unknown words in a text These unknown forms are mostly proper names neologisms and spelling errors The negation of a dictionary mask like lt V G gt will match any word except for those that are matched by this mask For instance lt V G gt will not match the word being even if there are homonymic non verbal entries in the dictionaries 4 3 LEXICAL MASKS being A being N Abst s being N Hum s Concordance D My Unitex English Corpus ivanhoe_snticoncord html istresses of the oppressed If Prior Aymer rode hard in the chase or remained long at the b emained long at the banquet if Prior Aymer was seen at the early peep of dawn to enter the whatsoever to atone for them Prior Aymer therefore and his character were well known to beisance and received his benedicite mes filz in return But the singular appearance of ance and received his benedicite mes filz in return 3 But the singular appearance of his y could scarcely attend to the Prior of Jorvaulx question when he demanded if they knew of an raising his voice and using the lingua Franca or mixed language in which the Norman and Saxo st servants of Mother Church repeated Wamba to himself but fool as he was taking care no iding would carry them to the Priory of Brinxworth where their quality could not but secure th ch would bring them to the hermitage of Copmanhurst where a pious anchoret would make t
256. so as students union and students union in singular or plural in each case Our formalism allows to include both types of variation in one description cf Figure 10 8 lt Nb n gt Figure 10 8 Inflection graph for student union Figure 10 9 shows an example in which additionally to the insertion of a new constituent the order of constituents may be reverted The upper path allows to generate e g birth date and birth dates while the lower one represents the syntactic variants of the previous forms date of birth and dates of birth e g birth date 52 32 aaa lt Nb n gt EA a gt apse Se Figure 10 9 Inflection graph for birth date Interface with the Morphological System for Simple Words MULTIFLEX is an implementation of the formalism for the inflectional morphology of MWUs presented above It supposes the existence of a morphological system for single words which satisfies the following interface constraints e For a given sequence of characters it returns its segmentation into indivisible graphical units tokens cf section 10 2 2 For instance in case of Unitex definition of a token 178 CHAPTER 10 COMPOUND WORD INFLECTION sequence Athens 04 is to be divided into 5 tokens Athens 047 Lei Athens MME ROE CAL e For a given simple inflected form it returns all its possible morphological identifica tions A morphological identification has to allow the generation of any other inflect
257. strazxni sudija istrazxni sudija NC_AXNF N Comp 5vms istrazxnim sudijom istrazxni sudija NC_AXNF N Comp 6vms Dinkicx Mirosinka Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName slvf Dinkicx Mirosinke Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s2vf Dinkicx Mirosinki Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s3vf Dinkicx Mirosinku Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s4vf Dinkicx Mirosinka Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s5vf Dinkicx Mirosinkom Mirosinka Dinkicx NC_ImePrezime N Comp Hum tPersName s6vf Dinkicx Mirosinki Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s7vf Mirosinka Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName slvf Mirosinke Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s2vf Mirosinki Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s3vf Mirosinku Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s4vf Mirosinka Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s5vf Mirosinkom Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s6vf Mirosinki Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s7Jvf gladni kao vuk gladan kao vuk AC_A3XN2 slmgda hungry as a wolf gladan kao vuk gladan kao vuk AC_A3XN2 slmgka hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 slfgea hungry as a wolf gladno kao vuk gladan kao vuk AC_A3XN2 slngea hungry as a wolf gladnoga kao vuk gladan kao vuk AC_A3XN2 s2mgda
258. sut Bons d a 226 127 Deet 21 Da on PAE A e Se OOD eee re 227 PAL AE 227 WISE AU ara A ne Huile dla EE ee e Be 227 de ASUS Cai pe Pb ent EN dede ee Tite ere 227 128 Di tiomaties occ cier dam ih duel de a a Pres ds 228 1281 The HETES 25 6 64 ci ee es ALMA RIRE ESS ire 228 12 8 2 The inf les os edron AN du dau mue de ete es 228 12 8 3 Dictionary information fle so es ax RE Oe BOER ei 230 1234 The CHECK DICTXT l cos oe ae E M we ee E pa 230 LA ELA MES sa Lure rd den es A A A a A a ED aS 232 AL TAC si AREA REA ARAS 232 TRI ASS 2 D o dd ae du da para seb 2 232 129 3 EUR Len Hk ka ee het s eee de SEP w an 232 Pod E A 232 12 10 EEN files lt a A A A 233 CH IheConie e AMI 233 ADE The system Me dere us dia OBE ORS RNE BOO 235 12 102 The user dic dere sarare dede ee OOM Ee Ewe RE 235 12104 The usenecie ss erase Ea AAA 236 12 11 Various other files a 236 12 11 1 The dlf n dlc n eterrn files 236 12112 Tine State le e ee ek ee Beek ds e L a 236 IQS Dine State EE 236 12 11 4 The concord n file 237 12 11 5 Normalization rule file 4 a se be 237 12116 Forbidden word fle ii Lea Oa wR Be ed 237 Appendix A GNU General Public License 239 CONTENTS 9 Appendix B GNU Lesser General Public License 247 Appendix C Lesser General Public License For Linguistic Resources 257 10 CONTENTS Introduction Unitex i
259. t all tokens that are not made of letters cf figure 4 2 This mask does not recognize the sentence separator S and the special tag STOP 62 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS IS Concordance D My Unitex EnglishiCorpustivanhoe_snticoncord html ngland which is watered by the river Don there extended in ancient times a large forest cover extended in ancient times a large forest covering the greater part of the beautiful hills and field and the pleasant town of Doncaster The remains of this extensive wood are still to be be seen at the noble seats of Wentworth of Warncliffe Park and around Rotherham 5 Here hau e seats of Wentworth of Warncliffe Park and around Rotherham Here haunted of yore the fab of Warncliffe Park and around Rotherham 5 Here haunted of yore the fabulous Dragon of Wantle d of yore the fabulous Dragon of Wantley 3 here were fought many of the most desperate battle ttles during the Civil Wars of the Roses and here also flourished in ancient times those ba ent times those bands of gallant outlaws whose deeds have been rendered so popular in English been rendered so popular in English song Such being our chief scene the date of our story lish song 5 Such being our chief scene the date of our story refers to a period towards the owards the end of the reign of Richard I when his return from his long captivity had become a wards the end of the reign of Richard I when his return
260. t p5_xalign snt 7 skepticism txt y toto snt p5_xalign txt G test franz snt G uima_0 snt y test franz txt TN uima_0 txt File Name skepticism bd Figure 2 7 Opening a Unicode text Preprocessing amp Lexical parsing wl Preprocessing Apply graph in MERGE mode lEnglishiGraphsiPreprocessingiSentencelSentence grf Set Ir Apply graph in REPLACE m lexEnglisiGraphsiPreprocessingiReplacelReplace g Setu Tokenizing The text is automatically tokenized This operation is language dependant s0 that Unitex can handle languages with special spacing rules Lexical Parsing C Analyse unknown words as free compound words this option Cancel but tokenize text is available only for Dutch German Norwegian amp Russian C Construct Text Automaton Cancel and close text Figure 2 8 Preprocessing Window tion of separators and split the text into tokens Click on Cancel and close text to cancel the operation 2 5 PREPROCESSING A TEXT 25 2 5 1 Normalization of separators The standard separators are the space the tab and the newline characters There can be several separators following each other but since this isn t useful for linguistic analyses separators are normalized according to the following rules e asequence of separators that contains at least one newline is replaced by a single new line e all other sequences of separators are replaced by a single space The distinction betw
261. t file Serbian lt CATEGORIES gt Nb s p w Case 1 2 3 4 5 6 7 Gen m f n Anim v q 8 Comp a b c Det d k e lt CLASSES gt noun Nb lt var gt Case lt var gt Gen lt var gt Anim lt fixed gt adj Nb lt var gt Case lt var gt Gen lt var gt Anim lt var gt Comp lt var gt Det lt var gt adv The particuliarity of this morphological model is not only its reachness but also the existence of no care features like Anim g or Det e These features agree with all other features in the same category They are used only for some particular sublasses of nouns or adjectives and are necessary for a better compactness of the inflection paradigms of simple words which are already considerably huge and would be even larger if no no care symbols were used Let us assume that the equivalences between the above features and their corresponding 186 CHAPTER 10 COMPOUND WORD INFLECTION codes in DELA dictionaries are given by the following Equivalences txt file Serbian s Nb s p Nb p w Nb w 1 Case 1 2 Case 2 3 Case 3 4 Case 4 5 Case 5 6 Case 6 7 Case 7 m Gen m f Gen f n Gen n v Anim v q Anim q g Anim g a Comp a b Comp b c Comp c d Det d k Det k e Det e Consider the following sample Serbian DELAC file the DELAS inflection codes may vary from those present in Unitex zxiro racyun racyun Nl ms1lq NC_2XN1 N Comp avio prevoznik prevoznik N10 mslv NC_2XN2 N Comp predsednik predsednik N10 mslv d
262. t it if you want it that you can change the software and use pieces of it in new free programs and that you are informed that you can do these things To protect your rights we need to make restrictions that forbid distributors to deny you these rights or to ask you to surrender these rights These restrictions translate to certain responsibilities for you if you distribute copies of the library or if you modify it For example if you distribute copies of the library whether gratis or for a fee you must give the recipients all the rights that we gave you You must make sure that they too receive 247 248 CHAPTER 12 FILE FORMATS or can get the source code If you link other code with the library you must provide com plete object files to the recipients so that they can relink them with the library after making changes to the library and recompiling it And you must show them these terms so they know their rights We protect your rights with a two step method 1 we copyright the library and 2 we offer you this license which gives you legal permission to copy distribute and or modify the library To protect each distributor we want to make it very clear that there is no warranty for the free library Also if the library is modified by someone else and passed on the recipients should know that what they have is not the original version so that the original author s reputation will not be affected by problems that might be intro
263. te an output with a call to a subgraph The grammars must not contain void loops because the Unitex programs cannot terminate the exploration of such a grammar A void loop is a configuration that causes the Locate 100 CHAPTER 6 ADVANCED USE OF GRAPHS program to enter an infinite loop Void loops can originate from transitions that are labeled by the empty word or from recursive calls to subgraphs Void loops due to transitions with the empty word can have two origins of which the first is illustrated by the Figure 6 8 This type of loops is due to the fact that a transition with the empty word cannot be eliminated automatically by Unitex because it is associated with an output Thus the transition with the empty word of Figure 6 8 will not be suppressed and will cause a void loop Ga gt 9 ADJ Figure 6 8 Void loop due to a transition by the empty word with a transduction The second category of loop by epsilon concerns the call to subgraphs that can recognize the empty word This case is illustrated in Figure 6 9 if the subgraph Adj recognizes epsilon there is a void loop that Unitex cannot detect eet a et 0 Figure 6 9 Void loop due to a call to a subgraph that recognizes epsilon The third possibility of void loops is related to recursive calls to subgraphs Look at the graphs Det and DetCompose in figure 6 10 Each of these graphs can call the other without reading any text The fact that none of these two graphs has lab
264. ted you can move them by clicking and dragging the cursor without releasing the button In order to cancel the selection click on an empty area of the graph If you click on a box all boxes of the selection will be connected to it 1To avoid confusion graph calls that refer to the repository are displayed in brown instead of grey 21f you are working on KDE you can deactivate lt Alt Click gt in kcontrol 5 2 EDITING GRAPHS 79 Figure 5 12 Selecting several boxes You can perform a copy paste with several boxes Select them and press lt Ctrl C gt or click on Copy in the Edit menu The selection is now in the Unitex clipboard You can then paste this selection by pressing lt Ctrl V gt or by selecting Paste in the Edit menu Saturday Sunday Figure 5 13 Copy Paste of a multiple selection NOTE You can paste a multiple selection into a different graph than the one where you copied it from In order to delete boxes select them delete the text that they contain i e the text presented in the text field above the window and press the Enter key The initial and final states cannot be deleted 5 2 4 Transducers A transducer is a graph in which outputs can be associated with boxes To insert an output use the special character All characters to the right of it will be part of the output Thus 80 CHAPTER 5 LOCAL GRAMMARS the text one two three number results in a box like in f
265. teger indicates the number of the label or sub graph that corresponds to the transition Labels are numbered starting at 0 Sub graphs are represented by nega tive integers which explains why the numbers preceding the names of the graphs are negative e the second integer represents the number of the result state after the transition In each graph the states are numbered starting at 0 By convention state 0 is the initial state Each state definition line terminates with a space The end of each graph is marked by a line containing an followed by a space and a newline Labels are defined after the last graph If the line begins with the character the contents of the label is to be searched without allowing case variations This information is not used if the label is not a word If the line starts with a capitalization variants are authorized If a label carries a transducer output sequence the input and output sequences are separated by the character example the DET By convention the first label is always the empty word lt E gt even if that label is never used for any transition The end of the file is indicated by a line containing the f character followed by a newline 222 CHAPTER 12 FILE FORMATS 12 4 Texts This section presents the different files used to represent texts 12 4 1 txt files txt files are text files encoded in Unicode Little Endian These files should not contain any opening or closing braces
266. ters e lt lt ss gt gt contains ss e lt lt a gt gt begins with a e lt lt ez gt gt ends with ez e lt lt a s gt gt contains a followed by any character followed by s e lt lt a s gt gt contains a followed by a sequence of any character followed by s e lt lt ss tt gt gt contains ss ortt e lt lt aeiouy gt gt contains a non accentuated vowel e lt lt aeiouy 3 5 gt gt contains a sequence of non accentuated vowels whose length is between 3 and 5 e lt lt e gt gt contains followed by an optional e e lt lt ss e gt gt contains ss followed by an optional character which is not e It is possible to combine these elementary filters to form more complex filters e lt lt ai ble gt gt ends with able or ible e lt lt anti pro gt gt begins with anti or pro followed by an optional dash e lt lt rst aeiouy 2 gt gt a word formed by 2 or more sequences beginning with r s or t followed by a non accentuated vowel e lt lt 1 1 e gt gt does not begin with 1 unless the second letter is an e in other words any word except the ones starting with le Such constraints are better de scribed using contexts see section 6 3 By default a morphological filter alone is regarded as applying it to the lexical mask lt TOKEN gt that means any token except space and STOP On the other hand when a filter follows a lexical mask immediately it a
267. test validity of any such claims this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that sys tem it is up to the author donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License 12 If the distribution and or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces the original copyright holder who places the Library under this License may add an explicit geographical distribution limitation excluding those countries so that distribution is permitted only in or among countries not thus excluded In such case this License incorporates the limitation as if written in the body of this License 13 The Free Software Foundation may publish revised and or new versions of the Lesser General Public License from time to time Such new versions will be similar in spirit to the present version but may differ in detail to address new problems or concerns Each version is given a distinguishing version number If the Library specifies a version number of this License which applies to it
268. text file 12 6 2 The concord txt file The concord txt file is a text file that represents a concordance Each occurrence is en coded in a line that is composed of three character sequences separated by a tab represent ing the left context the occurrence possibly modified by transducer outputs and the right context 12 63 The concord html file The concord html file is an HTML file that represents a concordance This file is encoded in UTF 8 The title of the page is the number of occurrences it describes The lines of the concor dance are encoded as lines where the occurrences are considered to be hypertext lines The reference associated to each of these lines has the following form lt a href X Y Z gt X and Y represent the start and end position of the occurrence in characters in the file name_of_text snt Z represents the number of the phrase in which this occurrence ap pears All spaces that are at the left and right edges of lines are encoded by a non breaking space amp nbsp in HTML which allows the preservation of the alignment of the utterances even if one of them has a left context with spaces NOTE If the concordance has been constructed with the glossanet parameter the HTML file has the same structure except for the links In these concordances the occurrences are real links pointing at the web server of the GlossaNet application For more information on GlossaNet consult the link on the Unitex web site
269. text format csv extension You can then parameterize the output format with a window as shown on Figure 8 2 Choose Unicode select tabulation as column separator and do not set any text delimiter Export de texte d x Options de champ Jeu de caract res Unicode y L S d Annuler S parateur de champ trab y S parateur de texte v Aide I Largeur de colonne fixe Figure 8 2 Saving a table with OpenOffice org Calc During the generation of the graphs Unitex skips the first line considering that it contains the headings of the columns It is therefore necessary to ensure that the headings of the columns occupy exactly one line If there is no line for the heading the first line of a table will be ignored anyway and if there are multiple heading lines from the second line on they will be interpreted as lines of the table 8 2 3 Parameterized graphs Parameterized graphs are graphs with variables referring to the columns of a lexicon grammar table This mechanism is usually used with syntactical graphs but nothing prevents the con struction of parameterized graphs for inflection preprocessing or for normalization Variables that refer to columns are formed with the symbol followed by the name of the column in capital letters the columns are named starting with A Example C refers to the third column of the table Whenever a variable takes the value of a or sign the sign corresponds to t
270. that the Linguistic Resource and its use are covered by this License You must supply a copy of this License If the package during execution displays copyright notices you must include the copyright notice for the Linguistic Resource among them as well as a reference directing the user to the copy of this License Also you must do one of these things a Accompany the package with the complete corresponding machine readable leg ible form of the Linguistic Resource including whatever changes were used in the package which must be distributed under Sections 1 and 2 above and if the package contains an encrypted form of the Linguistic Resource with the com plete machine readable work that uses the Linguistic Resource as object code and or source code so that the user can modify the Linguistic Resource and then encrypt it to produce a modified package containing the modified Linguistic Re source b Use a suitable mechanism for combining with the Linguistic Resource A suit able mechanism is one that will operate properly with a modified version of the Linguistic Resource if the user installs one as long as the modified version is interface compatible with the version that the package was made with c Accompany the package with a written offer valid for at least three years to give the same user the materials specified in Subsection 4a above for a charge no more than the cost of performing this distribution d If distributi
271. the FSGraph menu 7 5 Converting the text automaton into linear text If the text automaton does not contain any lexical ambiguity it is possible to build a text file corresponding to the unique path of the automaton Go into the Text menu and click on Convert FST Text to Text You can set the output text file in the window as shown on Figure 7 25 150 CHAPTER 7 TEXT AUTOMATON Convert Text Automaton to Text Output text file GEI UnitexiEnglishiCorpusilinear snt ee ee Figure 7 25 Setting output file for linearization of the text automaton If the automaton is not linear an error message will give you the number of the first sentence that contain ambiguity Otherwise the Fst 2Unambig program will build the output file according to the following rules e the output file contains one line per sentence e every line but the last is ended by S e for each box the program writes its content followed by a space 2 3 cats cat N nl p are be V P2s Plp P2p P3p 1 sentence d white white A Sentence Reset Sentence Graph Rebuild FST Text close elag frame Explode Implode Apply Elag Rule PEZ Figure 7 26 Example of a linear text automaton NOTE correcting spaces in the output text can only be done manually If the original text is the one of the text automaton shown on Figure 7 26 the output text will be 2 3 cats cat N Anl p are be V P2s Plp P2p
272. the analysis has been produced Language options e D dutch e G german e N norwegian e R russian NOTE for Dutch or Norwegian words the program tries to read a text file containing a list of forbidden words This file is supposed to be named ForbiddenWords t xt see section 12 11 6 and stored in the same directory than BIN 11 24 Reconstrucao Reconstrucao OPTIONS lt index gt This program generates a normalization grammar designed to be applied before the con struction of an automaton for a Portuguese text The lt index gt file represents a concordance which has to be produced by applying in MERGE mode to the considered text a grammar that extracts all forms to be normalized This grammar is called V Pro Suf and is stored in the Portuguese Graphs Normalization directory OPTIONS 11 25 REG2GRF 211 a ALPH alphabet ALPH the alphabet file to use r ROOT root ROOT the inverse bin dictionary to use to find forms in the fu ture and conditional given their canonical forms It has to be obtained by compressing the dictionary of verbs in the future and conditional with the parameter flip see section 11 2 d BIN dictionary BIN the bin dictionary to use p PRO pronoun_rules PRO the fst2 grammar describing pronoun rewrit ing rules n PRO nasal_pronoun_rules PRO the fst2 grammar describing nasal pro noun rewriting rules o OUT output 0UT the name of the
273. the ElagComp program consists in building an au tomaton whose language is the set of the sequences of lexical tags or lexical analyses of a sentence which are not accepted by the grammars This task is complex and can take a lot of time It is however possible to appreciably speed it up by observing certain principles at the time of writing gramars Limiting the number of branches in the then part It is recommended to limit the number of then parts of a grammar to a minimum This can reduce considerably the compile time of a grammar Generally a grammar having many then parts can be rewritten with one or two then parts without a loss of legibility It is for example the case of the grammar in figure 7 21 which imposes a constraint between a verb and the pronoun which follows it En ETES ETES lt PRO PpvLE gt lt PRO PpvLUI gt lt PRO PpvPR gt Figure 7 21 ELAG grammar checking verb pronoun agreement As one can see in figure 7 22 one can write an equivalent grammar by factorizing all the 7 4 MANIPULATION OF TEXT AUTOMATA 147 then parts into only one The two grammars will have exactly the same effect on the text automaton but the second one will be compiled much more quickly lt PRO PpvLE gt lt PRO PpvLUI gt lt PRO PpvPR gt lt PRO T on gt Figure 7 22 Optimized ELAG grammar checking verb pronoun agreement Using lexical symbols It is better to use lemmas only when it is necessary
274. the combined library with a copy of the same work based on the Library uncombined with any other library facilities This must be distributed under the terms of the Sections above b Give prominent notice with the combined library of the fact that part of it is a work based on the Library and explaining where to find the accompanying uncombined form of the same work 8 You may not copy modify sublicense link with or distribute the Library except as expressly provided under this License Any attempt otherwise to copy modify sublicense link with or distribute the Library is void and will automatically terminate your rights under this License However parties who have received copies or rights from you under this License will not have their licenses terminated so long as such parties remain in full compliance 9 You are not required to accept this License since you have not signed it However nothing else grants you permission to modify or distribute the Library or its derivative works These actions are prohibited by law if you do not accept this License Therefore by modifying or distributing the Library or any work based on the Library you indicate your acceptance of this License to do so and all its terms and conditions for copying dis tributing or modifying the Library or works based on it 10 Each time you redistribute the Library or any work based on the Library the recipi ent automatically receives a license from the
275. the files modified to carry prominent notices stating that you changed the files and the date of any change c You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License These requirements apply to the modified work as a whole If identifiable sec tions of that work are not derived from the Linguistic Resource and can be rea sonably considered independent and separate works in themselves then this Li cense and its terms do not apply to those sections when you distribute them as separate works But when you distribute the same sections as part of a whole which is a work based on the Linguistic Resource the distribution of the whole must be on the terms of this License whose permissions for other licensees ex tend to the entire whole and thus to each and every part regardless of who wrote it Thus it is not the intent of this section to claim rights or contest your rights to work written entirely by you rather the intent is to exercise the right to con trol the distribution of derivative or collective works based on the Linguistic Re source In addition mere aggregation of another work not based on the Linguistic Re source With the Linguistic Resource or with a work based on the Linguistic Re source on a volume of a storage or distribution medium does not bring the other work under the scope of this License 3 A program that contains no derivative of any portion of t
276. the following In the text automaton if a path of the if part is recognized then it must also be recognized by the then part of the grammar or it will be withdrawn from the text automaton If tu follows a verb in the 2nd person singular and a dash then it is a pronoun and not the past participle of taire Figure 7 12 ELAG grammar elag tu grf Figure 7 12 shows an example of a grammar The if part recognizes a verb in the 24 person singular followed by a dash and tu either as a pronoun or as a past participle of the verb taire The then part imposes that tu is then regarded as a pronoun Figure 7 13 shows the result of the application of this grammar on the sentence Feras tu cela bient t One can see in the automaton at the bottom that the path corresponding to tu past participle was eliminated Synchronization point The if and then parts of an ELAG grammar are divided into two parts by lt gt in the if part and lt gt in the then part These symbols form a synchronization point This makes it possible to write rules in which the if and then constraints are not necessarily aligned as it is the case for example in figure 7 14 This grammar is interpreted in the following way if a dash is found followed by 11 elle or on then this dash must be preceded by a verb possibly 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 135 3 sentences Sentence Reset Sentence Graph Rebuild FST Text close elag frame
277. the text and looks like the following 6 matches 6 recognized units 0 004 of the text is covered The first line gives the number of found occurrences and the second the name of units covered by these occurrences The third line indicates the ratio between the covered units and the total number of units in the text 12 11 5 Normalization rule file This file is used by the Normalization and XMLi zer programs It represents replacement rules Each line stands for a rule according to the following format stands for the tabulation character input sequence gt output sequence If you want to use the tabulation or the new line you must protect them with a backslash like this 123 t ONE_TWO_THREE_NEW_LINE 12 11 6 Forbidden word file The PolyLex programs requires a forbidden word file for Dutch and Norwegian This raw text file is supposed to be named ForbiddenWords txt If must be in the user s Dela directory corresponding to the language to work on Each line is supposed to contain one forbidden word 238 CHAPTER 12 FILE FORMATS Appendix A GNU General Public License This license can also be found in 32 Version 2 June 1991 Copyright 1989 1991 Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document but changing it is not allowed Preamble The licenses for most software a
278. the work to be licensed at no charge to all third parties under the terms of this License 250 CHAPTER 12 FILE FORMATS d If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility other than as an argument passed when the facility is invoked then you must make a good faith effort to ensure that in the event an application does not supply such function or table the facility still operates and performs whatever part of its purpose remains meaningful For example a function in a library to compute square roots has a purpose that is en tirely well defined independent of the application Therefore Subsection 2d requires that any application supplied function or table used by this function must be optional if the application does not supply it the square root function must still compute square roots These requirements apply to the modified work as a whole If identifiable sections of that work are not derived from the Library and can be reasonably considered independent and separate works in themselves then this License and its terms do not apply to those sections when you distribute them as separate works But when you distribute the same sections as part of a whole which is a work based on the Library the distribution of the whole must be on the terms of this License whose permissions for other licensees extend to the entire whole and thus to each and e
279. thin The formula applied is 4 log number of paths lexical ambiguity rate exp text Tength The relationship between the ambiguity rate before and after applying the grammars gives a measure of their efficiency All this information is displayed in the ELAG processing win dow 7 3 6 Description of the tag sets The Elag and ElagComp programs require a formal description of the tag set to be used in dictionaries This description consists essentially of an enumeration of all the parts of speech present in the dictionaries with for each of them the list of syntactic and inflectional codes compatible with it and a description of their possible combinations This description must be contained in a file called tagset def and placed in your personal folder in the Elag subfolder of the desired language tagset def file Here is an extract of the tagset def file used for French NAME francais POS ADV POS PRO inflex pers 1 23 genre nombre s discr Il D th 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 141 subcat Pind Pdem PpvIL PpvLUI PpvLE Ton PpvPR PronQ Dnom Ppossis complete Pind lt genre gt lt nombre gt Pdem lt genre gt lt nombre gt Ppossls lt genre gt lt nombre gt Pposslp lt genre gt lt nombre gt Pposs2s lt genre gt lt nombre gt Pposs2p lt genre gt lt nombre gt Pposs3s lt genre gt lt nombre gt Pposs3p lt genre gt lt nombre gt Pp
280. tion x Display Colors vi Date Background vi File Name Foreground Pathname Auxiliary Nodes vi Frame Selected Nodes vi Right to Left Comment Nodes Fonts Default Input Times New Roman 24 OK A AA AA lt lt Output Arial Unicode MS 12 gement Figure 5 25 Configuring the display options of a graph The color parameters are e Background the background color e Foreground the color used for the text and for the box display e Auxiliary Nodes the color used for calls to sub graphs e Selected Nodes the color used for selected boxes e Comment Nodes the color used for boxes that are not connected to others The other parameters are e Date display of the current date in the lower left corner of the graph e File Name display of the graph name in the lower left corner of the graph e Pathname display of the graph name along with its complete path in the lower left corner of the graph This option only has an effect if the option File Name is selected e Frame draw a frame around the graph e Right to Left invert the reading direction of the graph see an example in figure 5 26 You can reset the parameters to the default ones by clicking on Default If you click on OK only the current graph will be modified In order to modify the preferences for a lan guage as a default click on Preferences in the Info menu and choose the tab Graph
281. tionaries using the up and down arrows as shown on figure 2 13 The button Set Default allows you to define the current selection of dictionaries as the default This default selection will then be used during preprocessing if you activate the option Apply All default Dictionaries If you right click on a dictio nary name the associated documentation if any will be displayed in the lower frame of the window 32 DLF 13284 simple word lexical entries a DET Dind s a N s Aaron N PR Hurm abandoned abandoned abandon V K lis abate V W Pis P2s Pip P2p abated abate V K I1is I2s I abbey N Conc s abbot N Hum s abbots abbot N Hum p DLC 274 compound lexical entries absolute necessity N XN 2 act of violence N NPN z21 4 agnus castus N XN NX Conc all around A DA 21 all comers N 2N 21 p all in Atz21 Anglo Saxon N XN Hum z21 s Anglo Saxons dnglo Saxon N as usual i asi zl CHAPTER 2 LOADING A TEXT ERR 412 unknown simple words Abdalla Abednego acidum adale Adelaide Adsum Alfred Alicia Allan aller altereth Ambrose Armon Andalusia andTermagaunt Anjou Anthony Anwold Apollyon appeareth Arcite arguest AtasV 21 Figure 2 12 Result after applying dictionaries to an English text 2 5 6 Analysis of compound words in Dutch German Norwegian and Russian In certain languages like Norwegian German and others it is possible to form new com pound w
282. tions are deactivated by default Other options e m main names prints the list of the encoding main names e a aliases prints the list of the encoding aliases e A all infos prints all the information about all the encodings e i X info x prints all the information about the encoding X The encodings can take values in the following list non exhaustive see below FRENCH ENGLISH GREEK THAI CZECH GERMAN SPANISH PORTUGUESE ITALIAN NORWEGIAN LATIN default latin code page windows 1252 Microsoft Windows 1252 Latin I Western Europe USA 200 CHAPTER 11 USE OF EXTERNAL PROGRAMS Microsoft Windows 1250 Central Europe Microsoft Windows 1257 Baltic Microsoft Windows 1251 Cyrillic Microsoft Windows 1254 Turkish Microsoft Windows 1258 Viet Nam windows 1250 windows 1257 windows 1251 windows 1254 windows 1258 iso 8859 1 iso 8859 15 ISO 8859 1 Latin 1 Europe de l ouest amp USA ISO 8859 15 Latin 9 Western Europe amp USA iso 8859 2 ISO 8859 2 Latin 2 Eastern and Central Europe iso 8859 3 ISO 8859 3 Latin 3 Southern Europe iso 8859 4 ISO 8859 4 Latin 4 Northern Europe iso 8859 5 ISO 8859 5 Cyrillic iso 8859 7 ISO 8859 7 Greek iso 8859 9 ISO 8859 9 Latin 5 Turkish iso 8859 10 next step ISO 8859 10 Latin 6 Nordic NextStep code page LITILE ENDIAN BIG ENDIAN UTF8 11 6 Dico Dico OPT
283. tions of this License If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations then as a consequence you may not distribute the Program at all For ex ample if a patent license would not permit royalty free redistribution of the Program by all those who receive copies directly or indirectly through you then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program If any portion of this section is held invalid or unenforceable under any particular circumstance the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances 12 11 VARIOUS OTHER FILES 243 10 11 Itis not the purpose of this section to induce you to infringe any patents or other prop erty right claims or to contest validity of any such claims this section has the sole purpose of protecting the integrity of the free software distribution system which is implemented by public license practices Many people have made generous contri butions to the wide range of software distributed through that system in reliance on consistent application of that system it is up to the author donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice This section is intended to make thoroughly clear what is believed to be a conseque
284. tles cries battles royals battles of nerve_ Formally a fully explicit description of the inflectional paradigms of MWUs requires an answer to the following questions e What is the MWU s morphological class noun adjective etc and thus what inflec tion categories number gender case etc are relevant to it 61 argue for a mor phosyntactically motivated definition of morphological classes a morphological class should fully determine the inflection categories the word inflects for as well as those that are lexically fixed for the word e g in Polish a noun has a gender and inflects for number and case e What are the exceptions to the inflection categories determined above E g in Polish wybory powszechne general election is a compound noun but it doesn t have a singular form although its head word wybory does 10 1 MULTI WORD UNITS 169 e What are the inflectional characteristics base form morphological class inflection paradigm etc of the single constituents of the MWU E g in French porte door is an uninflected verb in porte avion aircraft carrier while it is an inflected noun in porte fen tre French window which takes an s in plural portes fen tres e How should we combine the inflected forms of the single constituents in order to gen erate the inflected forms of the whole compound E g to inflect battle of nerves and battle cry in number we need to inflect t
285. ts the color in RGB format FCOLOR x defines the foreground color of the graph x represents the color in RGB format ACOLOR x defines the color inside the boxes that correspond to the calls of sub graphs x represents the color in RGB format SCOLOR x defines the color used for writing in comment boxes boxes that are not linked up with any others x represents the color in RGB format CCOLOR x defines the color used for designing selected boxes x represents the color in RGB format DBOXES x this line is ignored by Unitex It is conserved to ensure compatibility with Intex graphs DFRAME x there will be a frame around the graph if x is y not if it is n DDATE x puts the date at the bottom of the graph if x is y not if it is n 12 3 GRAPHS 219 e DFILE x puts the name of the file at the bottom of the graph depending on whether xis y orn e DDIR x prints the complete path of the graph wether x is y or n This option has no effect if the DF ILE option is set to n e DRIG x displays the graph from right to left or left to right depending on whether x is y or n e DRST x this line is ignored by Unitex It isconserved to ensure compatibility with Intex graphs e FITS x this line is ignored by Unitex It isconserved to ensure compatibility with Intex graphs e PORIENT x this line is ignored by Unitex It isconserved to ensure compatibility with Intex graphs e this line is ignored by
286. u mouche lt Gen m Nb p gt Figure 10 4 Inflection graph for MWUs inflection like bateau mouche Unification Variables An important feature of our formalism are unification variables They are introduced by the dollar sign followed by an identifier which may contain any number of characters e g 1 num_10 c etc For example Figure 10 5 shows a graph roughly equivalent to the one on Figure 10 4 in the sense that it allows to generate the same inflected forms for the same MWUs However this time a single path represents both the singular and the plural form That is possible due to the unification variable n which may be instantiated to any value of the domain of its category Nb here n s or n p The instantiation is unique for all elements on a path if we fix the singular value for the first constituent the same value has to be set for the third one as well as for the whole MWU Similarly if we fix n to p while processing the first node it has to remain p until the end of the path Es 22 e g bateau mouche lt Gen m Nb n gt Figure 10 5 Inflection graph for bateau mouche with a unification variable The inflection graph on Figure 10 5 applies to most kinds of French compounds of types Noun Noun and Noun Adjective bateau mouche ange gardien circuit s quentiel etc which are of masculine gender That is because the output of the final node contains Gen m For all compounds of the same types but of femi
287. umber of matched occurrences the number of recognized tokens and the ratio between this number and the total number of tokens in the text 200 matches 644 recognized units 0 345 of the text is covered Figure 4 5 Search results After having clicked on OK you will see window 4 6 appear which allows you to configure the presentation of the matched occurrences You can also open this window by clicking on Display Located Sequences in the Text menu The list of occurrences is called a concordance 4 8 SEARCH 67 E Display indexed sequences Modify text Resulting snt file Set File Extract units Set File Extract matching units Extract unmatching units Concordance presentation Use a web browser to view the concordance better for more than 2000 matches Show differences with previous concordance Show matching sequences in context Context length Stop at Sort according to Left A0 care LIS center Left e Right 55 chars _ S Build concordance Figure 4 6 Configuration of the presentation of the found occurrences The Modify text box offers the possibility to replace the matched occurrences with the generated outputs This possibility will be examined in chapter 6 The Extract units box allows you to create a text file with all the sentences that do or do not contain matched units With the button Set File
288. ure 5 19 Toolbar The other six icons correspond to edit commands for boxes The first one a white arrow corresponds to the boxes normal edit mode The 5 others correspond to specific tools In order to use a tool click on the corresponding icon The mouse cursor changes its form and mouse clicks are then interpreted in a particular fashion What follows is a description of these tools from left to right e creating boxes creates a box at the empty place where the mouse was clicked e deleting boxes deletes the box that you click on e connect boxes to another box using this utility you select one or more boxes and connect it or them to another one In contrast to the normal mode the connections are inserted to the box where the mouse button was released on e connect boxes to another box in the opposite direction this utility performs the same operation as the one described above but connects the boxes to the one clicked on in opposite direction e open a sub graph opens a sub graph when you click on a grey line within a box 5 3 Display options 5 3 1 Sorting the lines of a box You can sort the content of a box by selecting it and clicking on Sort Node Label in the Tools submenu of the FSGraph menu This sort operation does not use the SortTxt program It uses a basic sort mechanism that sorts the lines of the box according to the order of the characters in the Unicode encoding 5 3 2 Zoom The Zoom submenu allows
289. utomaton lt txtauto gt and applies to it ambiguity re moval rules OPTIONS e 1 LANG language LANG ELAG configuration file for the language of the text er RULES rules RULES rule file compiled in the rul format e o OUT output OUT output text automaton e d DIR directory DIR directory where ELAG rules are located 11 8 ElagComp ElagComp OPTIONS This program compiles the ELAG grammar named GRAMMAR or all the grammars specified in the RULES file The result is stored in the OUT file that will be used by the Elag program OPTIONS e r RULES rules RULES file listing ELAG grammars e g GRAMMAR grammar GRAMMAR single ELAG grammars e 1 LANG language LANG ELAG configuration file for the language of the gram mar s e o OUT output OUT output file By default the output file name is the same as RULES except for the extension that is rul e d DIR directory DIR directory where ELAG rules are located 202 CHAPTER 11 USE OF EXTERNAL PROGRAMS 11 9 Evamb Evamb OPTIONS lt txtauto gt This program computes an average lexical ambiguity rate on the text automaton lt txtauto gt or just on the sentence which number is specified by N The results of the computation are displayed on the standard output The text automaton is not modified OPTIONS e s N sentence N sentence number 11 10 ExplodeFst2 ExplodeFst2 OPTIONS lt txtauto gt This program computes and stores
290. vIL lt genre gt lt nombre gt lt pers gt PpvLE lt genre gt lt nombre gt lt pers gt PpvLUI lt genre gt lt nombre gt lt pers gt Ton lt genre gt lt nombre gt lt pers gt lui elle moi PpvPR en y Prong o qui que quoi Dnom rien POS A adjectifs inflex genre m if nombre s cat gauche g droite d complete lt genre gt lt nombre gt pour de bonne humeur A au bord des larmes A par exemple POS V inflex temps pers genre nombre complete IJKPSTVWYGX 3 I D 3 Q F 2 D lt pers gt lt nombre gt lt pers gt lt nombre gt lt pers gt lt nombre gt lt pers gt lt nombre gt CE E O gt SS 142 D rr KH Ww CHAPTER 7 TEXT AUTOMATON lt pers gt lt nombre gt lt pers gt lt nombre gt lt pers gt lt nombre gt s euss duss puiss fuss je 1 p 2 lt nombre gt lt genre gt lt nombre gt The symbol indicates that the remainder of the line is a comment A comment can appear at any place in the file The file always starts with the word NAME followed by an identifier fran ais for example This is followed by the POS sections for each part of speech Each section describes the structure of the lexical tags of the lexical entries belonging to the part of speech concerned Each section is composed of 4 parts which are all optional e inflex this part enumerates the inflectional codes belonging to the grammatical cat
291. ve infinitive present participle past participle future indicative 3 x wm WIR OS G Q K H 0O H O0 ND TDlu 5 H Table 3 3 Common inflectional codes However these codes are not exclusive A user can introduce his own codes and create his own dictionaries For example for educational purposes one could use a marker faux ami false friend in a French dictionary blesser V faux ami injure casque N faux ami helmet journ e N faux ami day It is equally possible to use dictionaries to add extra information Thus you can use the inflected form of an entry to describe an abbreviation and the canonical form to provide the complete form DNA DeoxyriboNucleic Acid ACRONYM LADL Laboratoire d Automatique Documentaire et Linguistique ACRONYM UN United Nations ACRONYM 3 2 CHECKING DICTIONARY FORMAT 41 3 2 Checking dictionary format When dictionaries become large it becomes tiresome to check them by hand Unitex con tains the program CheckDic that automatically checks the format of DELAF and DELAS dictionaries This program verifies the syntax of the entries For each malformed entry the program out puts the line number the content of the line and an error message Results are saved in the file CHECK_DIC TXT which is displayed when the verification is finished In addition to eventual error messages the file also contains the list of all characte
292. vement en hongrois dans l optique d un traitement automatique In F Kiefer G Kiss and J Pajzs editors Papers in Computational Lexicography COMPLEX pages 257 265 Budapest Research Institute for Linguistics Hungarian Academy of Sciences 1996 8 1 69 Simoneta VIETRI On the study of idioms in italian In Sintassi e morfolog a della lingua italiana Congresso internazionale della Societ di Linguistica Italiana Roma Bulzoni 1984 37 70 Du ko VITAS Svetla KOEVA Cvetana KRSTEV and Ivan OBRADOVIC Tour du monde through the dictionaries In Matthieu Constant Takuya Nakamura Michele De Gioia and Sara Vecchiato editors 27th International Conference on Lexis and Grammar LGC 08 pages 249 256 September 2008 9
293. very part regardless of who wrote it Thus it is not the intent of this section to claim rights or contest your rights to work written entirely by you rather the intent is to exercise the right to control the distribution of derivative or collective works based on the Library In addition mere aggregation of another work not based on the Library with the Library or with a work based on the Library on a volume of a storage or distribution medium does not bring the other work under the scope of this License 3 You may opt to apply the terms of the ordinary GNU General Public License instead of this License to a given copy of the Library To do this you must alter all the notices that refer to this License so that they refer to the ordinary GNU General Public License version 2 instead of to this License If a newer version than version 2 of the ordinary GNU General Public License has appeared then you can specify that version instead if you wish Do not make any other change in these notices Once this change is made in a given copy it is irreversible for that copy so the ordinary GNU General Public License applies to all subsequent copies and derivative works made from that copy This option is useful when you wish to copy part of the code of the Library into a pro gram that is not a library 4 You may copy and distribute the Library or a portion or derivative of it under Section 2 in object code or executable form under the terms o
294. vities other than copying distribution and modification are not covered by this Li cense they are outside its scope The act of running a program using the Library is not restricted and output from such a program is covered only if its contents constitute a work based on the Library independent of the use of the Library in a tool for writing it Whether that is true depends on what the Library does and what the program that uses the Library does 1 You may copy and distribute verbatim copies of the Library s complete source code as you receive it in any medium provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty keep intact all the notices that refer to this License and to the absence of any warranty and distribute a copy of this License along with the Library You may charge a fee for the physical act of transferring a copy and you may at your option offer warranty protection in exchange for a fee 2 You may modify your copy or copies of the Library or any portion of it thus forming a work based on the Library and copy and distribute such modifications or work under the terms of Section 1 above provided that you also meet all of these conditions a The modified work must itself be a software library b You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change c You must cause the whole of
295. vuk gladan kao vuk AC_A3XN2 p6mgea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p6mgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p6mgea hungry as a wolf gladnima kao vuk gladan kao vuk AC_A3XN2 p6fgea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p6fgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p6fgea hungry as a wolf gladnim kao vuk gladan kao vuk AC_A3XN2 p6fgea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p6fgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p6fgea hungry as a wolf gladnima kao vuk gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnim kao vuk gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnima kao vuk gladan kao vuk AC_A3XN2 p7mgea hungry as a wolf gladnima kao vuci gladan kao vuk AC_A3XN2 p7mgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p mgea hungry as a wolf gladnim kao vuk gladan kao vuk AC_A3XN2 p7mgea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p7mgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC_A3XN2 p7mgea hungry as a wolf gladnima kao vuk gladan kao vuk AC_A3XN2 p7fgea hungry as a wolf gladnima kao vuci glad
296. wed by two N attendants whose dark visages white ber with a grave pace followed by four N attendants bearing in a table covered Figure 6 23 Results of the application of the grammar shown on Figure 6 22 However you can catch things with variables see section 6 7 5 and use them outside the left context as shown on grammar of Figure 6 24 So with left and right contexts you can make a distinction between the pattern used to match something and the thing you want to extract in your results For instance the gram mar shown on Figure 6 26 looks for expressions like the animal s but only extract nouns as you can see on Figure 6 27 6 3 CONTEXTS 107 rt num gt num Det num Figure 6 24 Using a variable in a left context Concordance D My Unitex English Corpus ivanhoe_snticoncord html e courses and cast to the ground three antagonists Det three 5 I add that sia utes to keep at sword s point his three antagqonists Det three turning and whee entinels to give the alarm when any one approaches Det one 5 But I trust soon omanlike and bravely 5 Of twenty four arrows Det four shot in succession te started up and bent their bows 5 Six arrows Det ix placed on the string wer he back of which was decorated with two ass s ears Det tw0 and which was place ber with a grave pace followed by four attendants Det four bearing in a table nts Det O ig da sages Figure 6 25 Results
297. whole must be on the terms of this License whose permissions for other licensees extend to the entire whole and thus to each and every part regardless of who wrote it Thus it is not the intent of this section to claim rights or contest your rights to work written entirely by you rather the intent is to exercise the right to control the distribu tion of derivative or collective works based on the Program In addition mere aggregation of another work not based on the Program with the Pro gram or with a work based on the Program on a volume of a storage or distribution medium does not bring the other work under the scope of this License 3 You may copy and distribute the Program or a work based on it under Section 2 in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following a Accompany it with the complete corresponding machine readable source code which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange or g Accompany it with a written offer valid for at least three years to give any third party for a charge no more than your cost of physically performing source distri bution a complete machine readable copy of the corresponding source code to be distributed under the terms of Sections 1 and 2 above on a medium customar ily used for software interchange or Aa SEA Accompany it with
298. wing line 3 1 THE DELA DICTIONARIES 37 hath have V P3s old form of has The space that precedes the character will be considered to be part of a 4 character inflec tional code It is possible to insert comments into a DELAF or DELAS dictionary by starting the line with a character Example English designates a pool spin English N z3 s Compound words with spaces or dashes Certain compound words like acorn shell can be written using spaces or dashes In order to avoid duplicating the entries it is possible to use the character At the time when the dictionary is compressed the Compress program checks for each line if the inflected or canonical form contains a non escaped character If this is the case the program replaces this by two entries one where the character is replaced by a space and one where it is replaced by a dash Thus the following entry acorn shells acorn shell N p is replaced by the following entries acorn shells acorn shell N p acorn shells acorn shell N p NOTE If you want to keep an entry that includes the character escape it using as in the following example E mc2 FORMULA This replacement is done when the dictionary is compressed In the compressed dictionary the escaped characters are replaced by simple As such if a dictionary containing the following lines is compressed E mc2 FORMULA acorn shell N s and if the dictionary
299. ww igm univ mlv fr unitex into a directory Unitex that should preferably be created within the Program Files folder After decompressing the file the Unitex directory contains several subdirectories one of which is called App This directory contains a file called Unitex jar This file is the Java executable that launches the graphical interface You can double click on this icon to start the program To facilitate launching Unitex you may want to add a shortcut to this file on the desktop 1 4 Installation on Linux and Mac OS X In order to install Unitex on Linux it is recommended to have system administrator per missions Decompress the file Unitex2 0 zip ina directory named Unitex by using the following command unzip Unitex2 0 zip d Unitex Within the directory Unitex Src C build start the compilation of Unitex with the command make install or with the following if you have a 64 bits computer make install 64BITS yes You can then create an alias in the following way alias unitex cd Unitex App java jar Unitex jar 1 5 FIRST USE 17 1 5 First use If you are working on Windows the program will ask you to choose a personal working directory which you can change later in Info gt Preferences gt Directories To create a direc tory click on the icon showing a file see figure 1 3 If you are using Linux or MacOS the program will automatically create a unitex directory in your HOME dire
300. y similar to the one used in the DELAF The only difference is that there is only a canonical form followed by grammatical and or semantic codes The canonical form is separated from the different codes by a comma There is an example horse N4 Anl The first grammatical or semantic code will be interpreted by the inflection program as the name of the grammar used to inflect the entry The entry of the example above indicates that the word horse has to be inflected using the grammar named N4 Itis possible to add inflec tional codes to the entries but the nature of the inflection operation limits the usefulness of this possibility For more details see below in section 3 4 3 1 THE DELA DICTIONARIES 3 13 Dictionary Contents The dictionaries provided with Unitex contain descriptions of simple and compound words These descriptions indicate the grammatical category of each entry optionally their inflec tional codes and various semantic information The following tables give an overview of some of the different codes used in the Unitex dictionaries These codes are the same for almost all languages though some of them are special for certain languages i e code for neuter nouns etc Code Description Examples A adjective fabulous broken down ADV adverb actually years ago CONJC coordinating conjunction but CONJS subordinating c
301. you can select the output file Then click on Extract matching units or Extract unmatching units depending on whether you are interested in sentences with or without matching units In the Show matching sequences in context box you can select the length in characters of the left and right contexts of the occurrences that will be presented in the concordance If an occurrence has less characters than its right context the line will be completed with the necessary number of characters If an occurrence has a length greater than that of the right context it will be displayed completely NOTE in Thai the size of the contexts is measured in displayable characters and not in real Characters This makes it possible to keep the line alignment in the concordance despite the 68 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS presence of diacritics that combine with other letters instead of being displayed as normal Characters You can choose the sort order in the list Sort According to The mode Text Order displays the occurrences in the order of their appearance in the text The other six modes allow you to sort in columns The three zones of a line are the left context the occurrence and the right context The occurrences and the right contexts are sorted from left to right The left contexts are sorted from right to left The default mode is Center Left Col The concordance is generated in the form of an HTML file If a concordance re
302. you to choose the zoom scale that is applied to display the graph The Fit in screen option stretches or shrinks the graph in order to fit it into the screen The Fit in window option adjusts the graph so that it is displayed entirely in the window 84 CHAPTER 5 LOCAL GRAMMARS Tools Format Zoom gt O Fit in screen Close all O Fit in window O 60 80 om O 120 O 140 Figure 5 20 Zoom sub menu 5 33 Antialiasing Antialiasing is a shading effect that avoids pixelization effects You can activate this effect by clicking on Antialiasing in the Format sub menu Figure 5 21 shows one graph displayed normally the graph on top and with antialiasing the graph at the bottom This effect slows Unitex down We recommend not to use itif your machine is not powerful enough 5 3 DISPLAY OPTIONS no_antialiasing grf XiBOULOTiRechercheimanuelunitexiresources S og Er Figure 5 21 Antialiasing example 85 86 CHAPTER 5 LOCAL GRAMMARS 5 3 4 Box alignment In order to get nice looking graphs it is useful to align the boxes both horizontally and vertically To do this select the boxes to align and click on Alignment in the Format sub menu of the FSGraph menu or press lt Ctrl M gt You will then see the window in Figure 5 22 The possibilities for horizontal alignment are e Top boxes are aligned with the top most box e Center boxes are centered on

Download Pdf Manuals

image

Related Search

Related Contents

HI 98121 - Hanna Instruments Canada  688 User Guide - Sound Devices, LLC  NS-65D260A13 Guía del usuario Televisor LED de 65  Manuel EGT700 francais  Manual de instrucciones Balanzas de plataforma/de suelo KERN VB  Información Importante, vor Leer antes de Utilizar Cajas Acústicas y  Edimax EW-7733UnD  Philips In-Ear Headphones SHE4507  SUBJECT: Radio Communication Equipment Installation  

Copyright © All rights reserved.
Failed to retrieve file