Home

AnnoLab - User Manual

1. drop empty split trim Examples A data store location A location on the file system This can be a file name or the path to an existing directory Per default exported files will be created with the suffix xm1 How ever if you transform the integrated representation using the parameter xslt you may want to specify another suffix An XSLT file used to transform the integrated representation before writing it to a file Tip It is a good idea to write the integrated representation to files without using the x s1t parameter and then transform them afterwards using a fast XSLT processor such as xsltproc This approach especially saves time when you want export to multiple target formats When enabled all segments containing only white space are excluded from the output This results in a smaller integrated representation but the original signal may not be extractable from it Default disabled When enabled one output file is generated for each signal This is turned on per default if the dest ination is a directory When enabled all trailing and leading whitespace in segments is trimmed I e the segment boundaries are changed to the first and last non whitespace characters This switch does not affect the signal the original signal can still be extracted from the integrated representation Default disabled In the following example we create a directory called integrated and then export all da
2. 29 Name info information about the installation Synopsis info Description This command shows some pieces of information about the AnnoLab installation Examples Get some basic information about the AnnoLab installation annolab info 30 Name lemmatize TreeTagger wrapper Synopsis lemmatize list lemmatize process l split filter layer suffixsuffix value xslt file source destination Description This command provides a small processing pipeline using a simple tokeniser and TreeTagger for part of speech tagging and lemmatisation The sub command list prints a list of the available tagging models One of these models has to be specified when using the process The sub command process runs the pipeline It can optionally filter the texts hide certain parts from the analysis components as not to confuse them and transform the results from the integrated representation format using an XSLT style sheet This command is deprecated It requires that the module org annolab module treetagger TreeTaggerModule has been configured in the annolab xconf file Instead the pipeline command should be used in conjunction with the treetag ger ae PEAR This combination provides more flexibility and control and does not require a local installa tion of TreeTagger or modifications to the annolab xconf file Arguments model One of the models found by the list command source One or more lo
3. lt relation casName parent mapAs dominance inverted true gt lt relation casName this mapAs segments gt lt relation casName posTag mapAs feature select value gt lt relation casName lemma mapAs feature select value gt lt element gt lt layer gt lt mapping gt rules specification Each line of the filter specification file corresponds to one rule The first part of the line before the colon is the name of a feature XML attribute After the colon follows a regular expression The rule matches if an annotation element bears the given feature and the feature value matches the regular expression The following example of a filter specification defines two rules The first rule matches all XML elements bearing an attribute class with either the value table or abstract The second rule matches all XML elements bearing an attribute speaker ending in Smith Data covered by XML elements matching either of these rules is not included in the output class table abstract speaker Smith Examples To add layers to documents already in a data store simply do not specify any source The following command will run the pipeline pos pipe xml and on each signal in the data store default and add the resulting annotations as layers to those signals annolab pipeline pos pipe xml annolab default 19 Name copy copy data Synopsis copy 4l fanout source destination Description Th
4. notator consists of a tree made up of constituents and words and a lot of relations between them we will not talk about here We want to map this tree to a tree layer So we start defining a layer Analysis pipelines Figure 3 3 Step 1 Define the layer lt mapping gt lt layer type tree name parse segType sequential gt lt layer gt lt mapping gt The t ype attribute set to tree indicates that we want to map to a tree layer The other possible value here is positional in order to create a positional layer The attribute name defines the name the layer should have and can be chosen freely The attribute segType indicates that the layer will be anchored on a sequential medium as the parser works on text Currently the only valid value here is sequent ial Next we define that two CAS types Token and Constituent should be included into the layer To do this we add lt element gt tags The casType attribute of the tags bear the qualified name of the CAS types Figure 3 4 Step 2 Define the elements lt mapping gt lt layer type tree name parse segType sequential gt lt element casType de julielab jules types Constituent gt lt element gt lt element casType org annolab uima pear stanford Token gt lt element gt lt layer gt lt mapping gt Now we define how these CAS types are mapped to annotation elements That is we define how to interpret the relations Token and Constituent have
5. annolab org annolab tree Synopsis tree following sibling A as node n as xs integer as node tree preceding sibling A as node n as xs integer as node Description These functions get n h the following or preceding siblings of the nodes in the sequence of nodes A If the parameter n is 0 the input list is returned if it is 1 all immediately following siblings are returned if it is 2 all 24 following siblings are returned and so on This function was implemented to be more efficient than using the following sibling axis Arguments SA A sequence of nodes Sn The offset of the siblings to return Examples Assume also that the layer Token contains segment elements with the part of speech encoded in the at tribute posTag The following query extracts all verbs and one token to the left and to the right of them annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution element results for v in ds layer QUERY_CONTEXT Token segment starts with posTag V return element result tree preceding sibling v 1 v tree following sibling v 1 CTRL D 51 Name txt find pattern based search http annolab org annolab textual Synopsis txt find Selements as element Spattern as xs string as element Description Search for the given regular expression pattern in the areas addressed by the elements and
6. location has to point to an eXist based data store Arguments location An AnnoLab URI pointing to an eXist based data store Login Problems It may happen that AnnoLab fails to pre configure the database login dialog If the log in fails make sure the username is admin the type is Remote and the URL is xmldb store Replace store in the URL with the name of the data store addressed by the location argument given on the command line when starting the client All other fields should remain empty Figure 18 eXist login settings eXist 1 3 0dey Database Login 2 Username Password gt 4 Type Remote v s Configuration fenidb defaurtsi default arm 2 Sap A i Examples This command starts the eXist client on the data store default annolab exist client annolab default 22 Name export export data Synopsis export l drop empty l split l trim suffix suffix xslt file source destina tion Description This command recursively exports all signals and their layers to the given destination A so called integrated representation is created for each signal This is an XML document containing all layers on that signal plus the signal itself Such a document is suitable for transformation to arbitrary target formats using XSLT style sheets Arguments Source destination Options suffix suffix xslt file Switches
7. ment and Deployment Proceedings of the Fifth Workshop on Treebanks and Linguistic Theories TLT 2006 247 258 Jan Hajic Joakim Nivre Institute of Formal and Applied Linguistics Prague Czech Republic Decem ber 2006 80 239 8009 2 55
8. subst x gt lt substitution orig subst x gt lt substitution orig subst x gt lt substitutionTable gt Examples Print a list of available tagging models annolab lemmatize list Create a new directory called results Then process all files in the directory texts using the model english and write one result file per text Transform the results using the XSLT file bnc xslt mkdir results annolab lemmatize process split format bnc xslt english texts results 33 Name list list data store contents Synopsis list source Description This command recursively lists the content of a data store Each signal is listed along with its annotation layers listed indented below it Arguments location One location to recursively list the content of Examples List the contents of the data store default annolab list annolab default 34 Name matchlist statistical distribution Synopsis matchlist lowercase f fields m model source destination Description Given a collection or a directory this command generates a table with all children of the given collection or a directory listed on the X axis and tags or signal data on the Y axis For example one can get a table showing the distribution of lemmas Y axis across all texts in a collection X axis with one row per lemma and column per one sub collection Or you can get a table showing the di
9. ae a ot 20 delete in st Rt A A ii 21 O 22 E RN NN 23 lea rs Eee 25 Ml A A A A A R 27 IMS2UAM een cepa viseeseeugssecoked svedanads sedea sd Teeadeadesasasesacoseses seca doses eaawe rn 28 MO s3shn Sosnchs see sendin Soekek ee as 30 MAIZ cor a ai tapia la ratas EE 31 Ma AA A 34 M tchlist ehren ehren AIAT NEEE STUT EERS ER 35 POAT ie ES EER rena EA A A oa 37 QUIETLY Tees 38 vam2annolab 2 2 02 20 24 A Rn DR nannten 41 A shoots stunde geiles hrerbermspbehsipebeniesghheftpedher ee 42 I XQuery Extensions Reference u dns nn A EE ai 43 dl rito rin Se Eee nee 44 USO it ir Dann N aves ened en REEE ESES 45 O sss oss O 46 manage delete sinne Essens Es od es Sed cde bition dees 47 MANALES MPO go2 sates seoes ri iii eisen G 48 SEQ CONTAINING near Sri bese Adee oda heen tesco EE 49 SEQ STOW O suPsaed Susie 50 tree following sibling 2 22 50 nissen nie sk pissen 51 A lee ant ueeapaaesbaah Oates diatdag ee inneren are 52 tht petter 2 nie n ehe ringe Guth ones EOE T EE EE sehen 53 GOSS ALY 2 essen ages ed shaddsp antes goede sb basts botaagssedeasuas hots gasea RES EONA deed wees 54 Bibliography iii RR IR es pr ER EE E REEE oan 55 lil List of Figures 2 1 2 2 2 3 2 4 2 5 2 6 2 1 2 8 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9 18 19 20 21 Set the ANNOLAB_HOME environment variable _ ccecseseseeeeseeeenenennennnnenennnnnnennnnenenennnnene 3 Setting the PATH varia
10. find the annotated text in the changed signal and transfer the annotations accordingly Examples The following command re integrates the layer Transitivity from the UAM Corpus Toolproject in the directory meaning C1 The remap option is used to integrate the annotations on the results originally extracted from annolab dfg archive into annolab default dfg annolab uam2annolab remap annolab dfg archive annolab default dfg layer Transitivity meaning C1 41 Name webserver AnnoLab server Synopsis webserver switches and options descriptor source destination Description This command starts AnnoLab as a server Two services exposed by the server are provided directly by the embedded eXist database the eXist REST service and the XML RPC service The REST service runs at http localhost 8080 stores store exist rest The XML RPC service runs at http localhost 8080 stores store exist xmlrpc In both URLs st ore has to be replaced by the name of the data store that was being addressed by the AnnoLab URL given as a parameter when starting the server Both services offer way of modifying the content of the eXist database but neither those methods nor the built XQuery management functions of eXist should be used Instead the XQuery extensions provided by AnnoLab for data store management should be used to add or remove content Queries issued through these services have access to the same AnnoLab XQu
11. gt lt stage gt lt parame ter gt and lt value gt A pipeline can contain any number of stages and a stage can contain any number of parameters A parameter may contain multiple values if the parameter allows that Use the pear explain command to find out more about the parameters and values supported by a particular PEAR Figure 3 1 AnnoLab pipeline descriptor outline lt pipeline gt lt stage component pear id gt lt parameter name param gt lt value gt param value lt value gt lt parameter gt lt stage gt lt pipeline gt For an example of a pipeline please refer to the Getting Started chapter 2 Mapping The data models used by AnnoLab and UIMA are different While AnnoLab is largely XML oriented UIMA employs the Common Analysis System CAS as its internal data structure The mapping between the UIMA CAS and XML can be configured through an XML file The basic structure of the file is as follows Figure 3 2 Outline of a mapping file lt mapping gt lt layer gt lt element gt lt relation gt lt feature gt lt element gt lt layer gt lt segment gt lt mapping gt The lt layer gt sections defines an AnnoLab layer with annotations elements and features and declares which and how CAS types are mapped to these The lt segment gt sections define which CAS types bear stand off anchors and how they should be mapped to AnnoLab segments The output ofthe StandfordParserAn
12. included positional fields are in the following order signal part of speech tag lemma signal uri start offset end offset The output encoding is UTF 8 builtin wconcord Formats suitable for WConcord The output formatis signal_pos tag The encoding is UTF 8 Module configuration annolab xconf This command requires a local installation of TreeTagger After you have installed TreeTagger on your sys tem AnnoLab has to be instructed where to find it and which tagging models are available In order the fol lowing section has to be added to the modules section of the annolab xconf file You have to adapt the following example to your installation Put the absolute path to your TreeTagger executable into the ex ecutable section Maintain a model section for each tagging model you have For each model specify a name name section the absolute path to the model file file section the model encoding encoding section and a simple full sentence in the language on which the model was trained f lushSequence sec tion Note that there has to be a space between each token of that sentence including before the final full stop Optionally a substitution file can be specified substitutions section lt module class org annolab module treetagger TreeTaggerModule gt lt executable gt Applications treetagger bin tr tagger lt executable gt lt models gt lt model gt lt name gt english lt name gt lt file gt Applications treetagger
13. lib english par lt file gt lt encoding gt Latinl lt encoding gt lt flushSequence gt This is a dummy sentence lt flushSequence gt lt substitutions gt Applications treetagger lib en xml lt substitutions gt lt model gt lt models gt lt module gt Substitution file A substitution file can be used to substitute characters or sequences of characters that are known to be broken or to be misinterpreted by an analysis component They can be replaced by the what is known to be the correct character or sequence or by some sensible substitute the analysis component can deal with For example a tagging model may not know about the Unicode quotes and and consequently tag them wrong The example below substitutes such quotes with a regular quote written as the XML entity quot here because in an XML file literal quotes have to be written as XML entities The example also substitutes some greek letters with the letter x The model does not know about greek letter but knows that x is usually a mathematical symbol and thus tags it as SYM 32 lemmatize Figure 19 Example substitutions file lt xml version 1 0 encoding UTF 8 gt lt substitutionTable gt lt substitution orig subst amp quot gt lt substitution orig subst squot gt lt x is tagged as SYM gt lt substitution orig subst x gt lt substitution orig subst x gt lt substitution orig
14. recursively loaded signal from the given AnnoLab URI s analysed and the analysis results are saved as new layers If the cpe command is invoked any collection reader specified in the CPE descriptor will be ignored and the AnnoLab collection reader will be used instead An AnnoLab internal CAS consumer is automatically added to the CPE and used in addition to any CAS consumers already present If the destination of this command is a location on the file system the command runs in export mode In this mode it accepts all parameters and switches also accepted by the export command Those are marked with export mode below Arguments descriptor source destination Options filter layer in layer For each command the descriptor has a slightly different meaning With ae it has to specify an Analysis Engine AE descriptor file with cpe it has to specify a Collection Processing Engine CPE descriptor file and with pipeline it has to specify an AnnoLab pipeline descriptor file Zero or more AnnoLab URIs from which the signals are loaded If no sources are specified the destination will be used as source and desti nation This makes it easy to add new annotations to existing signals If the source and destination locations differ the signals will be copied to the new destination and the generated annotations will be anchored on those new signals An AnnoLab URI to which the signals and analysis results are saved If no s
15. sentence ae so a new version can be installed annolab pear uninstall java sentence ae Install the PEAR java sentence ae from the file java sentence ae pear annolab pear install java sentence ae pear Get a list of all installed PEARs annolab pear list Get more information about the java sentence ae PEAR annolab pear explain java sentence ae 37 Name query mquery query a data store Synopsis query l unanchor query file repeat X V var value xslt file template source destination mquery l unanchor query file repeat X V var value xslt file source des tination Description These commands allow to perform queries from the command line Queries can be run completely manually using the mquery command or using query templates with the query command In manual mode the query is read in from the terminal after the command has been started Alternatively you can create a text file containing the query and feed it to the command using input redirection In template mode a query template descriptor file has to be specified as the template argument Arguments source destination Options query file repeat X Vvar valu xslt file Switches unanchor A data store location The addressed data store has to support querying Currently only the RepoDatastore and ExistDatastore support query ing The output fi
16. sentence boundaries e Annotate token boundaries e Create a syntactic parse annotation e Annotate the Theme of the sentences Create new directory called pipelines in your AnnoLab home and within create a new text file called theme pipe xml Figure 2 7 shows this file The example uses the editor PSPad but you can also use the windows Notepad application or any other text editor The pipeline definition contains five stages Figure 2 7 Example pipeline definition gt PSPad c programme annolab cli pipelines theme pipe xml Y Datei Projekt Bearbeiten Suchen Ansicht Format Werkzeuge Skripte HTML Einstellungen Fenster Hilfe O O0 e a k ASB BS s ES a m 20F Ri oi am ae O FG g 1 theme pipe xml lt stage component lang setter ae gt lt parameter name language gt lt value gt en lt value gt lt parameter gt lt stage gt 5 lt stage component stanford parser ae gt stage component tree rule processor uima ae gt lt PPlpeline gt 11 12 12 312 UNIX Kodierung UTF 8 1 The first stage configures the language for the documents This is a very simple component allowing to manually define as which language input documents should be treated A more sophisticated component might try to detect the language of a document automatically The following stages use this information to determine which model they should use for boundary detection and parsing In this example the language is set
17. specified in the template descrip tor and saving the results to results html annolab query xslt html word template annolab default results html 40 Name uam2annolab partial annotation re integration Synopsis uam2annolab remap from to layer name as layer name source Description This is the sister command to ims2uam It re integrates the partial annotations made in an UAM Corpus Tool project generated with ims2uam back into the corpus Arguments source The directory containing the UAM Corpus Tool project Options layer name The layer from the UAM Corpus Tool project that should be re inte grated as layer name The name of the layer into which the data should be integrated Per default this is the same that has been specified for layer remap fromto Defines a re mapping of AnnoLab URIs during the integration This can be used if the project was generated from a different data store than it will be re integrated into Switches offsets Disable if no offsets are present in the IMS CWB results Without off sets it is likely that re integration will not be possible Default on Tolerance to changes The data store into which the annotations should be integrated already has to contain the signals For best results the signals not have changed between the time the IMS CWB database has been created and the time the command is used If the data has changed AnnoLab tries its best to
18. that anchor an annotation layer to a signal e they are wrapped ina gam layer tag that carries attributes such as gam id and name that are necessary to address and handle of a layer within the framework All GAM tags traditionally reside in the XML namespace http www linglit tu darmstadt de PACE GAM and use the namespace prefix gam GAM is used in the context of AnnoLab data stores and when exporting data from AnnoLab Depending on the context different elements are defined Unttp www wagsoft com Coder http cswww essex ac uk Research nle GuiTAR 3http www tei c org 11 Analysis pipelines 3 1 GAM in data stores This section explains the GAM format used in data stores 3 1 1 Segment A segment identifies an area of a signal using a number of anchors The GAM tag representing a segment is gam seg It carries two mandatory attributes gam sig the ID of the data store and of the signal the segment anchors to sepa rated by a colon gam type determines the type of segment Depending on the type of segment additional attributes or child tags representing anchors are required The following figure shows the XML serialisation of an abstract untyped segment Figure 3 8 XML for an abstract GAM segment lt gam seg gam type gam sig default 03faa92e gt AnnoLab implements only the segment type seq which stands for sequential Sequential segments bear two additional attributes gam s th
19. the results on the file system in the directory some directory annolab copy test pdf some directory Convert the XHTML file file html to text and HTML using AnnoLab s XML importer and save the results on the file system in the directory some directory annolab copy file html some directory 20 Name delete delete from data store Synopsis delete layer name location Description This command recursively deletes resources Per default all resources are deleted The layer option can be used to delete only particular layers Arguments location One or more data store locations to delete For safety reasons AnnoLab will not delete locations on the file system Options layer name A comma separated list of layers to be deleted If this option is spec ified all specified layers are deleted recursively on any signal within the given location No signals are deleted if this option is present Examples Delete all contents within the data store default annolab delete annolab default Delete only the layer Token from all signal in the data store default annolab delet layer Token annolab default 21 Name exist eXist client Synopsis exist client location Description AnnoLab comes with an embedded eXist XML database This command starts the eXist GUI client that ships with the embedded eXist and configures it to access the data store underlying location given The given
20. to English en Check the documentation of the components to see what languages are supported 2 This stage invokes the sentence boundary detector 3 This stage invokes the token boundary detector It depends on the output of the previous stage 4 This stage invokes the Stanford Parser It depends on the two previous stages Getting started 5 The final stage invokes the Tree Rule Processor This component per default uses a rule set that annotates the Theme of a sentence based on the syntactic parse annotation created by the Stanford Parser 4 Run the pipeline Now create another directory called output in your AnnoLab home Finally we can run the pipeline as shown in Figure 2 8 Thepipeline command requires three arguments the name of the pipeline definition file a file or directory containing the input texts here examples texts ex1 htm1 and an output directory name Figure 2 8 Running the pipeline annolab pipeline in Layout pipelines theme pipe xml examples texts ex1 html output WINDOWS system32 cmd exe gt annolab pipeline in Layout pipelin theme pipe xml examples ex1 html output 27 66 2068 17 50 23 org candledark uima pear java sentence sentenceHnnotator n t al ze 44 INFO initializing Java Sentence Annotator 29 66 2608 17 58 24 org candledark uima pear stanford StanfordParserfinnotator getParserDataFromSerial izedFile 3 7 gt INFO Loading parser from serialized file file C Dokumente 2B
21. types Constituent gt lt relation casName children mapAs dominance gt lt element gt lt element casType org annolab uima pear stanford Token gt lt relation casName segments mapAs segments gt lt element gt lt layer gt lt mapping gt Finally we need to define the segments Thus we add a lt segment gt section under lt mapping gt The StandfordParserAnnotator uses the type GAMSequentialSegment to encode segments This type bears two features start and end which are mapped to the start and end anchors of a sequential segment Figure 3 7 Step 5 Define how the segments are encoded lt mapping gt lt layer type tree name parse segType sequential gt lt element casType de julielab jules types Constituent gt lt relation casName children mapAs dominance gt lt element gt lt element casType org annolab uima pear stanford Token gt lt relation casName segments mapAs segments gt lt element gt lt layer gt lt segment type sequential casType uima tcas Annotation gt lt anchor gamName start casName start gt lt anchor gamName end casName end gt lt segment gt lt mapping gt For the casName there are two special values namely which matches the name of any relation or feature that is not match by any other lt relation gt or lt feature gt section and this which matches the current element Well and that s essentially it We could add furth
22. was adopted as a framework for linguistic processing Linguistic analysis com ponents were turned into UIMA components The following plug ins were developed some are currently not released due to licensing issues Import Text XML XHTML FLOB FROWN corpus format PDF Storage BibTeX based read only storage with meta data support unreleased eX ist based storage with query support Processing non UIMA TreeTagger integration unreleased Partial annotation support with IMS Corpus Workbench and UAM Corpus Tool Processing UIMA Tokeniser sentence splitter TreeTagger wrapper Stanford parser wrapper unreleased Tree Rule Processor with rules for Theme Rheme annotation based on the Stanford parser output In the context of a cooperation with the project C2 Sustainability of Linguistic Data a plug ins to run Anno Lab in a server mode and XQuery extensions for managing data store contents were developed Rehm07a Rehm07b 1 4 AnnoLab and UIMA In the course of the development of AnnoLab it became increasingly obvious that using XML as the predom inant data model required more compromises than the development effort saved by using XML databases or transformation engines could compensate In particular the work with the UIMA data model the Common Analysis System and with UIMA itself showed that a sound model and framework for the processing of an notated data had emerged here Still the AnnoLab framework provides extended functi
23. xhtml annolab default SomeText application xhtml xml Layout CTRL D 48 Name seq containing seq contained in seq same extent seq overlapping seq left overlapping seq right overlap ping containment and overlap http annolab org module exist nativexq sequential Synopsis seq containing SA as element B as element as element seq contained in A as element B as element as element seq same extent SA as element SB as element as element seq overlapping A as element B as element as element seq left overlapping A as element B as element as element seq right overlapping A as element B as element as element Description These functions can be used to filter a set of elements A with respect to a set of elements B and a relation R between the two These relations can be containing contained in same extent overlapping left overlapping and right overlapping The functions all work following the same principle return each a in A which is in the given relation R with any bin B When an element a and or b of the sequences A or B is not a segment a segment is calculated from the left most and the right most positions addressed by any descendant segment of the element and used for comparing the two elements Arguments SA A sequence of elements being filtered SB A sequence of elements against each of the elements in the seque
24. AnnoLab User Manual Richard Eckart de Castilho Technische Universitat Darmstadt Department of Linguistic and Literary Studies English Linguistics AnnoLab User Manual Richard Eckart de Castilho Technische Universit t Darmstadt Department of Linguistic and Literary Studies English Linguistics Table of Contents T Introduction 4 2 ee ati eat be 1 1 History Of AnnoLab s vice cod oy Cotes ego A ESO Ok i ee an a es 1 1 1 PAGE Ling era 2004 2005 2 2 5 u sea gekenn pen 1 1 2 Diploma Thesis era 2005 2006 ocooccoccnconcnoconcnnconcnoconcnoroncnnroncnnroncnnroncnnroncnaroninns 1 1 3 EmePro era 200022008 irii ss 2 Rent bs iss 1 14 Annolab and UIMA 0 2 22a an Diele 2 2 Getting Started 5 0 een ENEE REE EEEE SEa r SoSe E S RIESS SR EREE TENERS RISG 3 1 Inst llation cocos echten Bu I ds 3 2 Install PEARS coto pedirse ie I ie lines 5 3 Create a Pipeline vn A A eed teeta ta needs 6 4 R n the pipeline sacd das nda dare 7 3 Amalysis pipelines tesi 8 1 Pipeline descriptors triste TO ios oli iad 8 2 Mapping Sess sesh aan lab De age A E EE E ste 8 3 Generalised Annotation Markup 2 22 2 8 0 ones kennen 11 3 1 GAM in data Stores soeckt eet ae ean eeo ee Vaa N E este ko unbe 12 3 2 GAM Integrated representation ssiri es ne a a E e eig i is 12 I Command Reference sissi irec joss ccc Sieh A a E EEEE A 14 A 15 A NR 16 COPY a et ie ta tt
25. Geben Sie den Namen eines Programms Ordners 17 Dokuments oder einer Internetressource an ffnen cmd vi Abbrechen Durchsuchen At the command line prompt type annolab help and press the Enter key to run AnnoLab This should cause AnnoLab to display a list of all available commands Figure 2 4 Testing AnnoLab annolab help ex C WINDOWS system32 cmd exe gt annolab help To get help for a specific command use help lt command gt add layer Add a layer to a signal copy Copy into from AnnoLab delete Delete resources export Export module filter Dump filtered signals help Explain a command list List Module query Query module template mquery Query module manual info AnnoLab system information uam2anno lab Reintagrate partial annotation from UAM ims2uam IMS 2 UAM ae UIMA AE runner cpe UIMA CPE runner pipe line UIMA pipeline pear PEAR management Getting started 2 Install PEARS Now it is time to install the PEARs These are the components we will later use to create a processing pipeline Open the Windows Explorer navigate to the AnnoLab home a and create a new directory called pears there Now go back to the AnnoLab home page and download the following PEAR packages using right click save link as to this pears directory you have just created e Sentence Boundary Annotator Token Boundary Annotator Manual Language Setter Stanford Parser Annotator T
26. This example gets the contents of a signal at annolab default SomeText annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution ds signal annolab default SomeText CTRL D 46 Name manage delete delete from a data store http annolab org annolab manage Synopsis manage delete uri as xs string as xs string manage delete uri as xs string name as xs string as xs string Description Delete the signal at the specified location The second argument can be the name of a layer In that case the layer is deleted from the signal The signal is deleted as well if the layer being deleted is the last one on that signal The return value of the command is a sequence of messages stating which signal and layers have been deleted Arguments Sname A layer name Suri An AnnoLab URI addressing a signal in the data store Examples This example deletes the layer Layout from the signal annolab default SomeText annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution manage delete annolab default SomeText Layout CTRL D 47 Name manage import import into a data store http annolab org annolab manage Synopsis manage import source as xs string dest as xs string Smimetype as xs string name as xs string as xs string Description Import a layer from an URL or from a lo
27. al with the data store against which the query is run The query and mquery commands define the variable SOUERY_CONTEXT that can be used here The value of the variable is the source argument to those commands Examples This example queries for all XHTML headers h1 in the layer Layout within annolab default annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution declare namespace xhtml http www w3 org 1999 xhtml ds layer SQUERY_CONTEXT Layout xhtml h1 CTRL D 44 Name ds meta access meta data http annolab org module repo exist xq datastore Synopsis ds meta Suri as xs string as element Description Get the meta data of the resource addressed by the given AnnoLab URI Meta data can be stored with a layer when it is imported using the manage import function Arguments Suri An AnnoLab URI Examples This example gets the meta data of a signal at annolab default SomeText annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution ds meta annolab default SomeText CTRL D 45 Name ds signal access signals http annolab org module repo exist xq datastore Synopsis ds signal uri as xs string as xs string Description Get the contents of the signal addressed by the given AnnoLab URI Arguments Suri An AnnoLab URI addressing a signal Examples
28. and easily readable section of the signal that includes whitespace and line breaks annolab copy SomeText pdf annolab default annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution declare namespace x http www w3 org 1999 xhtml element results for pb in ds layer QUERY_CONTEXT Layout x div class pagebreak return element result txt get text seq grow pb CTRL D 53 Glossary Layer Resource Signal A layer contains information that is overlaid on a signal Technically speaking alayer is an XML file where text nodes have been replaced by stand off anchors addressing parts of a signal A collective term for objects in a data store collections signals and layers When objects on the file system are read AnnoLab automatically tries to convert them to resources that can be stored in a data store The conversion is done through importers A primary data object that is annotated through layers 54 Bibliography Eckart06a Richard Eckart A Framework For Storing Managing and Querying Multi Layer Annotated Corpora July 2006 Technische Universit t Darmstadt Department of Linguistic and Literary Studies English Linguistics Eckart06b Richard Eckart Systemic Functional Corpus Resources Issues in Development and Deployment Proceed ings of the COLING ACL 2006 Main Conference Poster Sessions Association for Computational Lin
29. ble s Ni teh r goe re esse ns ER AEE 4 Access the Command lie incas 8 08 lasse essen Dose des sra 4 Testing Amo Sb Ber ee ee Peer 4 Installing PEARS 4 0 reise ERBE ie pen 5 Listing the installed PEARS 2 se en Da BR a dels 6 Example pipeline definition 82 8 doves ae sei un leere 6 Running the pipeline u a een eee tect renee deen 7 AnnoLab pipeline descriptor outline ussunssunssnnesnnennnennnennsnnnsnnnnennennnen esau seen seas eeae eens 8 Qutline of mapping file 0 605 rn nn nee sr nn BER 8 Step T Define the layer uscar onerosa Dosen 9 Step 2 Define the elements concen lege ces AEE SEE cusapedeccebityedovensucedeasueecusenwencos 9 Step 3 Define the dominance relation of the tree eee ccc e eee cee ca seca toca sean ecu eeneeeneees 9 Step 4 Define where the segments are located ooooccnoccnnccncconcconccnnncnnccnnnonoronccnnconnconnccnnioos 10 Step 5 Define how the segments are encoded 2 0 0 0 cece cece nee e cece ee ce ceeeeneceeeeeeeeeeeeeeeaees 10 XML for an abstract GAM segment 0 cee cece cee cee ce eece teen ceca ceca nenn een eeueeeeeeeeeeeeeaeeeaees 12 Linking between text and annotations cece cece eee noenee nE eeae eens een Te EE EEEE 13 EXISHIOSIN SEMIN ES use Gath ovale EMA ED RIE OSs en ren ee oR A Be 22 Example substitutions file sssi soises fiordos resistieron unser 33 Query template descriptor file query example template oooonconncn
30. cation within the XML database If there is already a signal at the destination URI the layer will be anchored on that signal Otherwise the text is extracted from the layer and be used to create a new signal at the destination URI The return value of the command is a sequence of messages stating which signal and layers have been deleted Arguments Sdest An AnnoLab URL indicating the destination The destination can be an existing signal to which a new layer should be added It can also be the name of a non existing signal which is then extracted from the layer being imported and stored at the given location Smimetype A MIME type information can be stored with each layer If in doubt use application xml Sname The name with which the layer is being added to the signal Ssource An URL or XML DB URL from which to read the layer It is possible to access any URL type known to Java such as http ftp or file The source has to be a valid XML file Examples This example shows how to import the file README xhtml from the current directory to the data store location annolab default README Since no signal existed at this location before a new signal is created and the XHTML is added as the layer Layout to that signal with the MIME type for XHTML data application xhtml xml annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution manage import file SomeText
31. cations from which to read data to be annotated The sources can be files or directories on the local file system or data store locations destination A location on the file system This can be a file name or the path to an existing directory Options filter layer Specifies a layer to be used for filtering If a filter layer is specified per default it will remove any parts of the signal that have been annotated with the feature class with the values keywords bibliogra phy figure footnote formula ignore pagebreak table or affiliation This command does not allow to change the filter rules suffix suffix Per default exported files will be created with the suffix xm1 How ever if you transform the integrated representation using the parameter xslt you may want to specify another suffix xslt file An XSLT file used to transform the integrated representation before writing it to a file Switches split When enabled one output file is generated for each signal This is turned on per default if the dest ination is a directory 31 lemmatize Built in XSLT style sheets The command comes with a few built in XSLT style sheets builtin bnc Formats output in BNC SGML format suitable for example for the WordSmith Tools The output encoding is UTF 16LE which is the en coding required by WordSmith builtin imscwb Formats output in a tab separated suitable for importing into the IMS Corpus Workbench The
32. e In the example below the annotations dominated by a Constituent annotation are determined by finding all Con stituent annotations referencing it in their parent attribute tree layer only If attribute inverted is set to true the specified feature does not have to be present in the mapped type Instead it has to be present in some other type and reference the mapped type In the example below the feature with the name constituent is present in the UIMA type Theme and it references the Constituent type Whenever a Theme annotation references a Constituent annotation the value of the primitive feature label as indicated by the select attribute is mapped to the XML attribute theme as per the xm1Name attribute lt mapping gt lt layer type tree name Parse Theme segType sequential gt lt element casType de julielab jules types Constituent gt lt feature casName begin mapAs ignore gt lt feature casName end mapAs ignore gt lt relation casName parent mapAs dominance inverted true gt lt relation casName cat mapAs feature gt 18 ae Filter lt relation casName constituent xmlName theme mapAs feature inverted true select label gt lt element gt lt element casType org annolab uima pear stanford Token gt lt feature casName begin mapAs ignore gt lt feature casName end mapAs ignore gt lt feature casName componentId mapAs ignore gt
33. e offset of the first character addressed by the segment gam e the offset of the character following the last character addressed by the segment If the both attributes are equal the segment has a width of zero 3 2 GAM Integrated representation This section explains the GAM format used in the integrated representation XML files produced by the export 3 2 1 Overall structure The following figure shows the outline of the overall structure of the integrated representation lt gam root gt lt gam headers gt lt gam header gt lt gam annotations gt lt gam layer gt lt gam a gt lt gam layout gt lt gam root gt lt gam seg gt lt gam content gt lt gam ref gt The integrated representation has three principal sections below the root element gam headers This section may contain one gam header child for each layer or signal contained in the integrated representation Its gam id and gam name attributes correspond with the respective attributes of a signal or layer Headers are used to store arbitrary meta data about a resource e g about authors origin licenses etc A gam header el 12 Analysis pipelines gam annotations gam layout ement has a single child which is the root of the meta data XML doc ument This section contains one gam layer child for each included annota tion layer Each of these children must bear gam id and gam name attributes A gam layer can have a single child ele
34. ed If the data has changed AnnoLab tries its best to fix the offsets so that the generated UAM Corpus Tool project is aligned to the changed signals Examples After you have prepared an IMS CWB database fulfilling the above requirements you can log into it with cqp and perform queries as usual Once you wish to export a query result to UAM Corpus Tool you need to CONFIGURE IMS CWB to produce the output format needed by ims2uam Turn on only the display of the three positional parameters mentioned above turi s e and turn off any other parameters MYCORPUS gt show uri s e Run the query again storing the results in a variable The use the cat command to save to results to a file Here we did a query extracting all sentences containing the word algorithm within all texts in annolab de fault A MYCORPUS gt results algorithm match document_uri annolab default A within s MYCORPUS gt cat results gt capresults txt Now leave cgp and run ims2uam on the saved results First create an empty directory to take up the project For this example the directory is called target mkdir target annolab ims2uam queryresults txt target After this you will find a UAM Corpus Tool project named project in the target directory Tip The original location information taken from the positional parameters uri s and e is stored in the comment field of the segments in the annolabSync layer They are encoded as a JSON string
35. er lt relation gt and lt feature gt sections to fine tune the mapping Table 3 1 Examples of lt relation gt and lt feature gt sections Section Description lt relation casName parent Completely ignore the parent relation mapAs ignore gt lt relation casName segment s Treat the presence of the segments relation as an mapAs error gt error For example the Constituent type should not ex hibit any segments only the Token type should lt feature casName value Completely ignore the feature value mapAs ignore gt lt relation casName this Indicates that the element has a double function as el mapAs segments gt ement bearing features and relations as well as stand off anchors lt feature casName mapAs error gt If the element bears any features that are not explicitly defined trigger an error The following values are valid for mapAs 10 Analysis pipelines Table 3 2 Valid mapAs values Value Description lt relation gt lt feature gt segments This contains the stand yes no off anchors ignore Don t do anything with yes yes this error Trigger an error if this is yes yes present reference Map this as a reference yes no feature Map this as a feature yes yes default Automatically determine yes yes whether something should be mapped as reference or feature dominance In a tree layer this in yes no dicates the relat
36. ery extensions that are available for the mquery and query commands For more information on how to use these services please refer to the eXist documentation at http exist sourceforge net Per default the server can only be access locally To access the server from remote machines or to change the port the server is running on please refer to the file configuration config ini in the AnnoLab installation directory There you can modify the two configuration items org eclipse equinox http jetty http host and org osgi service http port Press CTRL C to terminate the server Arguments source An AnnoLab URIs pointing to a data store which will be exposed through by the server Only one data store can be exposed at a time The addressed data store has to be eXist based Examples To start a server for the data store default use S annolab webserver annolab default 42 XQuery Extensions Reference Name ds layer access annotation layers http annolab org module repo exist xq datastore Synopsis ds layer name as xs string as xs string ds layer Suri as xs string name as xs string as element Description Find all layers of the given name Layers can be searched for in the whole data store or only within a particular collection and its sub collections Arguments Sname The layer name Suri An AnnoLab URI addressing a collection in the data store The ad dressed data store has to be identic
37. ery template Eckart06b Teich06 1 3 LingPro era 2006 2008 Following the diploma thesis I joined the project Linguistic profiles of interdisciplinary registers LingPro The requirements in this project were similar to the requirements of the PACE Ling project but more ambitious with respect to quality detail and volume of the analysed data While it had already been shown that it is in principle possible to extend XML to support linguistic annotations and still remaining substantially compatible with existing XML tools Eckart07 there were issues regarding performance and handling In addition the LingPro project required the integration of additional tools for automatic annotation and support for processing PDF files Since a large corpus of texts had to be repeatedly processed the focus shifted from interactive operation to unsupervised batch processing During this time AnnoLab underwent a major refactoring The web front end was dropped due to time con straints and shift of focus It was replaced by a command line interface Much of the Avalon related code that facilitated integration with Cocoon was also dropped to simplify the code base The OSGi framework Unttp www linglit tu darmstadt de index php id pace_ling http www linglit tu darmstadt de index php id lingpro_projekt Introduction was adopted as a component framework Data storage mechanisms and file format support were refactored into plug ins Apache UIMA
38. f the CAS type system contains the required types Specifies the rules to be used for filtering The parameter accepts the name of a text file which specifies the attribute value combinations that cause parts of the signal to be filtered out See the examples below for more information Per default exported files will be created with the suffix xm1 How ever if you transform the integrated representation using the parameter xslt you may want to specify another suffix export mode only An XSLT file used to transform the integrated representation before writing it to a file export mode only When enabled all segments containing only white space are excluded from the output This results in a smaller integrated representation but the original signal may not be extractable from it Default disabled export mode only When enabled one output file is generated for each signal This is turned on per default if the dest inat ionisa directory export mode only Enable disable display of a performance report for each processed re source When enabled all trailing and leading whitespace in segments is trimmed I e the segment boundaries are changed to the first and last non whitespace characters This switch does not affect the signal the original signal can still be extracted from the integrated representation Default disabled export mode only The mapping controls how annotations are translated from XML to the UIMA CAS mode
39. g in csv Options f field A comma separated list of field names to be used as row labels Note that field names are case sensitive These are the available fields Table 3 Available fields Field Description posTag Part of Speech lemma Lemma OFS_START Start offset of the signal in char acters 35 matchlist Field Description OFS_END End offset of the signal in char acters SIGNAL Signal data SIGNAL_ID SIGNAL_NAME ID of the signal This is unique per data store The ID is not available when the source location is on the file system Name of the signal SIGNAL_URI AnnoLab URI of the signal m model One of the models found by the lemmatize list command Switches lowercas Enable to force all data on the Y axes to be lowercase That means that e g numbers for Be and be are conflated into a single row Examples The following example accesses the data store default TreeTagger is invoked with the model english to generate the part of speech tags Data on the horizontal axis consists of the signal and part of speech tags and is lowercase The output is written to the file matchlist csv annolab matchlist lowercase f SIGNAL posTag m english annolab default matchlist csv 36 pear PEAR management Synopsis pear install file pear uninstall name pear explain name pear list Description This co
40. guistics 183 190 Sydney Australia July 2006 http www aclweb org anthology P P06 P06 2024 Eckart07 Richard Eckart and Elke Teich An XML based data model for flexible representation and query of linguis tically interpreted corpora Data Structures for Linguistic Resources and Applications Proceedings of the Bi ennial GLDV Conference 2007 327 336 Georg Rehm Andreas Witt Lothar Lemnitzer Gunter Narr Verlag T bingen T bingen Germany 2007 Rehm07a Georg Rehm Richard Eckart Christian Chiarcos and Johannes Dellert Ontology Based XQuery ing of XML Encoded Language Resources on Multiple Annotation Layers European Language Resources Associa tion ELRA Proceedings of the Sixth International Language Resources and Evaluation LREC 08 510 514 Marrakech Morocco May 2008 Rehm07b Georg Rehm Richard Eckart and Christian Chiarcos An OWL and XQuery Based Mechanism for the Retrieval of Linguistic Patterns from XML Corpora Proceedings of the International Conference Recent Ad vances in Natural Language Processing RANLP 2007 510 514 Borovets Bulgaria 2007 Teich05 Elke Teich Peter Fankhauser Richard Eckart Sabine Bartsch and M nica Holtz Representing SFL annotated corpus resources Proceedings of the Ist Computational Systemic Functional Workshop Sydney Australia 2005 Teich06 Elke Teich Richard Eckart and M nica Holtz Systemic Functional Corpus Resources Issues in Develop
41. hemutzung und virtueller ANNOLAB_HOME _ C Programmelannolab ci Speicher APR_ICONY_PATH C iProgrammelSubversiop 0r JAVA_HOME C Programme JavaljgY46 0_05 Einstelungen N 3gAe Software Found ache Software Found Y Netzlaufwerk verbinden Netzlaufwerk trennen Verkn pfung erstellen Benutzerprofile L schen Desktopeinstellungen bez glich der Anmeldung Umbenennen Eigenschaften Einstellungen Starten und Wiederherstellen Systemstart Systemfehler und Informationen zur Problembehebung Name der Variablen ANNOLAB_HOME Wert der Variablen C Programme annolab cli Einstellungen ggsvariablen Fehlerberichterstattung Enter the path Click to open to the AnnoLab environment variables dialog Benutzervariable bearbeiten Next we need to add the AnnoLab installation directory to the Path environment variable Getting started Figure 2 2 Setting the PATH variable Benutzervariable bearbeiten Name der Yariablen Path Wert der Variablen PoANNOLAB_HOME Path If the variable does not yet exist you need to create it with the value SANNOLAB_ HOMES Path Other wise add SANNOLAB_HOMESto the beginning of the semi colon separated list of path entries Now try running AnnoLab Go to the Start menu and select Run Enter the command cmd in the dialog that opens and click OK Figure 2 3 Access the command line Alle Programme gt Ausf hren Ausf hren
42. ile contains re uired environment variables for this component InstallationController component java sentence ae installation completed INFO UIMAModule Successfully installed PEAR C Programme anno lab cli pears java s ntence ae pear 27 Ub ZUY8 W1 25 35 org candledark uima pear java sentence SentenceHnnotator initialize lt 4u gt INFO initializing Java Sentence Annotator INFO UIMAModule Successfully verified PEAR C Programme anno lab c lixpearsijava se tence ae pear rocess complete Ljava sentence ael stalling DEAD fem f ihe Figure 2 5 shows the command to install the PEARs and its output 1 The command annolab pear install pears installs all PEARs in the directory pears 2 This message indicates that the PEAR was successfully extracted and registered 3 This message indicates that the PEAR could be successfully initialised If this message is not present the PEAR will most probably not work You can use the pear list command to get a list of the installed PEARs see Figure 2 6 Getting started Figure 2 6 Listing the installed PEARs annolab pear list cx C WINDOWS system32 cmd exe gt annolab pear list Installed PEARs lang setter ae 6 1 Java token ae 1 8 tree rule processor uima ae 1 8 Java sentence ae 1 8 stanford parser ae 3 Create a pipeline Now we will create a simple pipeline The pipeline will do the following things e Annotate
43. ion that makes up the tree 3 Generalised Annotation Markup Generalised Annotation Markup GAM is a set of XML tags and attributes that can be used to extend XML formats so they can be used in a multi layer environment such as AnnoLab The format has been inspired by existing XML annotation formats such as the CD3 format used by Systemic Coder the MAS XML format used by GuiTAR gt the TEI XML format as well as HTML All use different tag sets and encode different semantics by the XML document structure However they also have some similarities e they are used to annotate text e the complete text exists in the XML document e the text is not contained in attributes but in text nodes e iterating over all text nodes from the beginning to the end of the XML document yields the full text in the correct order Any XML format conforming to these four points may be called document centric XML as the text being marked up by the XML tags provides the dominant structure Document centric XML formats can easily be converted for use within AnnoLab by adding stand off anchors allowing the XML annotations and the underlying text to exist independently of each other The idea is to leave the original XML format of as much as possible untouched and the conversion process from or to AnnoLab as simple as possible During the conversion process two changes are applied e the text nodes are replaced by gam seg tags representing segments
44. irectory Assuming the above is the content of a file named filterRules properties the following command uses this file to control the filter annolab filter rules filterRules properties filter Layout annolab default filteredOutputDirectory 26 Name help self documentation Synopsis help command Description Invoking this command without any arguments prints a list of the available commands Optionally a command can be given as the only argument In this case detail help for the command will be printed Arguments command A command for which detail information should be printed Examples Get a list of all available commands annolab help Get help for the copy command annolab help copy 27 Name ims2uam selective annotation Synopsis ims2uam 4 offsets remap from to source destination Description If a corpus is large the resources to annotate it completely and exhaustively may not be available A query can be used to extract particularly interesting sections of the corpus for further annotation Arguments source The file containing the results exported from the IMS CWB destination The directory in which the new UAM Corpus Tool project should be created The directory has to exist and should be empty Options remap fromto Defines a re mapping of AnnoLab URIs during the integration This can be used ifthe project was generated from a different data sto
45. is command copies data from the file system into a data store from a data store into another data store or from a data store to the file system Importers are used to convert data originating from the file system AnnoLab ships with importers for plain text HTML XML PDF and FLOB FROWN The info command shows all installed importers If the destination is on the file system AnnoLab will dump signals as raw data e g as text files and layers as XML files If the source is a file system directory or a collection within a data store it is copied recursively Arguments source One or more locations from which to copy data to the destination The sources can be files or directories on the local file system or data store locations destination One location to which the data is copied The location can be a directory on the local file system or a collection in a data store If the destination is a directory or collection it should always end ina Switches fanout Enable to create a sub directory for each signal This directory will contain the signal data and all layers Default behaviour is to dump all signals and layers into one directory This should only be used when copying to the file system Default off Examples Import the PDF file test pdf into the store default to the collection test annolab copy test pdf annolab default test Convert the PDF file test pdf to text and HTML using AnnoLab s PDF importer and save
46. l and vice versa While the CAS is based on an object graph formalism to encode annotations AnnoLab uses either a list or a tree Thus the CAS has to be decomposed into a set of list and tree layers for AnnoLab to make use of it The decomposition is controlled by the mapping file The mapping file is an XML file and its root is the lt mapping gt element Children of is are any number of lt layer gt and lt segment gt sections The lt layer gt section specifies how types from the CAS are converted to and from XML The lt segment gt sections specify how stand off anchors are extracted from the CAS ae A layer has aname a type and a segment type segType The name specified the name of the layer extracted from the CAS and can be anything The type has to be either tree or positional A positional layer can hold a list of annotations on non overlapping segments it can be used e g for simple part of speech annotations A tree layer is more flexible Annotated segments may overlap and annotation elements may form a hierarchy If in doubt use the type tree AnnoLab currently only supports the segment type sequential segments with integer start and end offsets A lt layer gt contains any number of lt element gt sections Each of these defines how one UIMA type determined by the casType attribute is mapped to an XML element The last component of the UIMA type name is used as the XML element name for de julielab jules types Con
47. le If omitted the output goes to the terminal File from which the query is read in manual mode Repeat the query X times With this option no output is generated In stead run time statistics are printed atthe end The serialisation pipeline is completely disabled so that only the performance of the query itself without any XSLT transformation or serialisation overhead can be measured This parameter can be used to set the value of a free query variable E g Vword algorithm sets the variable word in the query to the value algorithm The parameter can be specified multiple times to set multiple variables If a template requires a variable not set in this way it will prompt the user for a value In manual mode this is the path to an XSLT file used to transform the query output In template mode it is the name of one of the output for mats available in the template Specifying none turns off XSLT trans formation Enable disable replacement of all segments in the query results with the content they are referring to This allows to perform queries returning segments instead of using e g txt get text in the query Un fortunately it means that the complete query output needs to be buffered in memory thus for very large results this may not work 38 query Query templates A query template consists of a template descriptor file actually a Java property file a query file and optionally XSLT files that can be used to t
48. litter Tip Importing annotations into and exporting annotations from the pipeline is imperfect This is mainly relevant for the Layout layer While HTML lt h1 gt and lt p gt tags will be imported as header and para graph boundaries nothing else will be imported Thus when the Layout layer is exported afterwards it will only contain these two tags Usually you will want to use the out parameter tocontrol which layers should be exported from the pipeline e g out Token Sentence Pars Chapter 3 Analysis pipelines For doing corpus analysis AnnoLab integrates the Apache UIMA framework This framework allows to com pose so called Analysis Engines AE into a pipeline Each AE performs a particular analysis task e g tokeni sation sentence boundary detection part of speech analysis deep parsing etc AnnoLab offers a simplified XML syntax for configuring analysis pipelines but it can also use native UIMA aggregate AE or Collection Processor Engine CPE descriptors AEs that can be used by AnnoLab must have been packaged as PEARs Processing Engine ARchive This is a UIMA standard for packaging and automatically deploying components These PEARs have to be installed into AnnoLab using the pear install command 1 Pipeline descriptors In addition to the standard UIMA Analysis Engine descriptors and CPE descriptors AnnoLab supports a sim plified pipeline descriptor format The descriptor uses four sections lt pipeline
49. low processing using an XSLT file At the time a naive Java implementation was created to merge multiple XML annotation files into one GAM file Research relevant data was then extracted from GAM files using XSLT style sheets 1 2 Diploma Thesis era 2005 2006 In 2005 AnnoLab became the topic of my diploma thesis Eckart06a The research question underlying An noLab development at the time was Can the XML data model be extended to support linguistic annotations while remaining substantially compatible with existing XML tools The XML tools in question were XML parsers XSLT engines and XQuery engines At this time AnnoLab evolved into a web application implemented in Java and based on Apache Cocoon for the web front end and using Apache Avalon in the back end The front end allowed to upload texts and annotations in XML format as well as to perform queries using XQuery with several GAM specific extensions GAM was extended to support two modes the integrated mode known already from the PACE Ling era as well as a non integrated mode The latter mode facilitated uploading and managing XML annotation files and annotated text as separate objects Relations between these objects were derived dynamically at the time of querying The web front end allowed users to easily interact with AnnoLab to browse text and annotations to show display annotations side by side for comparison and to issue queries using predefined templates e g a Keyword in Context qu
50. lt gam seg gt lt gam seg gam cyper Bed gam sig default 5 gam s 9 gam e 10 gt lt gam ref gam TICES lt gam ref gam aid 665 gt lt gam seg gt lt gam seg gam Napa iti gam sig default 5 gam s 10 gam e 13 gt lt gam ref gam aid 0 gt lt gam ref gam aid 362 gt 13 Command Reference Name add layer add layer to existing signal Synopsis add layer source destination Description This command can be used to add an annotation layer to an existing signal The annotation has to be available as an XML file containing the same text whitespace may vary as the target signal The layer will be added with the name layout An existing layer with the same name will be overwritten if present Arguments source An XML annotation file that can be anchored on the specified signal destination The signal to anchor the annotation on Examples Add a HTML annotation from the file manual html to the signal at annolab default manual annolab add layer manual html annolab default manual Name ae cpe pipeline analyse data Synopsis ae switches and options descriptor source destination cpe switches and options descriptor source destination pipeline switches and options descriptor source destination Description These commands offer different ways of employing UIMA Analysis Engines to create annotation layers Signals are
51. ment which is the root of the XML annotation All text nodes originally contained in the annotation are replaced with gam a elements each bearing a gam id attribute The original text is moved to the gam layout section which references the annotation layers via gam ref elements This section contains the common segmentation induced by all annota tion layers Each segment is represented by a gam seg element which contains the annotated text in its gam content child gam ref el ements link to the gam a elements in the annotation layers Each seg ment contains one referent to each annotation layer annotation the por tion of text represented by the segment Figure 3 9 Linking between text and annotations lt head gt lt head gt lt body gt lt div gt lt p gt lt p gt lt p gt lt p gt lt gam layer gam id 9 gam name Layout gam type application xhtml xml gt lt html xmlns http www w3 0org 1999 xhtml gt lt link rel stylesheet type text css href lt div class pagebreak gt lt gam a gam id 1 gt lt gam a gam id 2 gt lt gam a gam id 3 gt Style style css gt lt link gt lt gam layout gam sig annolab default pdf Adler2005 gt lt gam root gt lt gam seg gam type seq gam sig default 5 gam s 0 gam e 9 gt lt gam content gt Adler2005 lt gam content gt lt gam ref gam aid 362 gt lt gam ref gam aid 664 gt
52. mmand allows to manage UIMA PEARS in AnnoLab A PEAR is a packaged analysis component that can be used in a pipeline Before a PEAR can be used in AnnoLab it has to be installed When it is no longer needed or before a new version is installed a PEAR has to be uninstalled It is also possible get a list of the currently installed PEARs and to get an detailed information about an installed PEAR The install sub command is used to install a PEAR into AnnoLab When a PEAR is installed it receives a unique name To work with an installed PEAR in a pipeline or with any commands this name has to be used to address the PEAR The uninstall sub command is used to uninstall a PEAR either because it is no longer needed or as a prepa ration to install a newer version The explain sub command prints detail information about a PEAR In particular it prints a list of configuration parameters that can be changed in a pipeline descriptor file to configure the analysis component Also a list of input and output capabilities is printed Within a pipeline the analysis components need to be ordered in such a way that all data a particular PEAR lists as its inputs have been produced by previous pipeline stages The data produced by an analysis component is listed as its outputs The list sub command prints a list of all installed PEARs Arguments file A PEAR file name The name of a PEAR as shown by the pear list command Examples Uninstall a the PEAR java
53. nce SA is matched Examples Assume the layer Speaker contains turn elements with the speaker encoded in the attribute speaker Assume also that the layer Token contains segment elements with the part of speech encoded in the at tribute posTag The following query extracts all nouns spoken by the speaker Chad annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution element results seq contained in ds layer QUERY_CONTEXT Token segment starts with posTag N ds layer SQUERY_CONTEXT Speaker turn speaker Chad CTRL D 49 Name seq grow calculating covering segment http annolab org module exist nativexq sequential Synopsis seq grow SA as element as element Description The function seq grow calculates a segment from the left most and the right most positions addressed by any descendant segment of the sequence of elements passed to it The result is a single segment that covers all the data addressed by the sequence This function is commonly used in conjunction with txt get text to retrieve a meaningful and easily readable section of the signal that includes whitespace and line breaks Arguments SA The sequence of elements for which the covering segment is calculated Examples See the example for the function txt get text 50 Name tree following sibling tree preceding sibing sibling navigation http
54. ncnnccnnccnnconnccnnccnnconnconiconiconnns 39 AnnoLab asking for the value of an unset query variable ooooocccoccnnccnnconnconoconncnnaconocnninnnicnns 39 List of Tables 3 1 Examples of lt relation gt and lt feature gt sections 2 2 0 0 cece cece eee ce ee ce eeceeece teen nennen seen nenn sen seen 3 2 Valid mapAs Valles cocaina nennen dre aaa indy 3 Available field srra 4 32305 are peu brot sr nie bremen Chapter 1 Introduction AnnoLab is an XML based framework for working with annotated text In the area of linguistic research text is annotated for linguistic features as a preparation step for linguistic analysis XML is widely being used to encode these annotations AnnoLab was born from the question Can XML and associated technology be easily extended to offer comprehensive support for linguistic annotations 1 History of AnnoLab 1 1 PACE Ling era 2004 2005 Work on the foundations of AnnoLab started in 2004 in the context of the PACE Ling project at Technische Universitat Darmstadt Germany The annotations created manually semi automatically and automatically using various tools had to be unified and made available as an integrated resource to allow linguistic analysis of relations between all annotations regardless of their source Teich05 The Generalised Annotation Markup GAM was developed This markup format allows to integrate XML annotations from various sources into a single XML file to al
55. onality in comparison with the UIMA framework UIMA provides an abstraction for accessing all kinds of data sources so called Collection Readers AnnoLab provides an abstraction for managing data stores and for transparently importing all kinds of data formats into an UIMA pipeline A Collection Reader is provided by AnnoLab which interfaces with its import handler infrastructure to transparently load data from a data store or import it from text files XML files PDF files etc The CAS Consumer abstraction provided by UIMA is designed to persist or use CAS data for use after the processing is done AnnoLab provides a CAS Consumer to transparently store data from the CAS either in the file system or in a data store from which it may be queried or read again for further processing Thus AnnoLab adds transparent data import a data repository and a query mechanism http www sfb441 uni tuebingen de c2 Chapter 2 Getting started This section will take you headlong into AnnoLab It will take you on a short tour from the installation of AnnoLab itself over the installation analysis components in the form of PEARs Processing Engine ARchive towards the creating of a pipeline involving these analysis components and finally running the pipeline All screenshots for this section were taken on Windows XP SP3 For other versions of Windows you may get a different visual experience AnnoLab also runs on Mac OS X and on Linux for those platf
56. orms you should choose paths appropriate for your platform e g on Mac OS X use Applications annolab cli in stead of C Programme annolab cli 1 Installation Extract the contents of the AnnoLab archive to the folder C Program Files orto the equivalent location depending for localised versions of Windows e g C Programme for a German version This creates a folder named annolab cli Note Setting the ANNOLAB_HOME environment variable is not necessary on Linux or Mac OS X It is still necessary to add the AnnoLab home directory to the PATH Now configure the AnnoLab home directory Do a right click on your Desktop icon and select the Properties item In the dialog that opens use the button labelled Environment This opens a dialog allowing to create a new user environment variable Create a new environment variable called ANNOLAB_HOME with the value C Program Files annolab cli or the equivalent on your localised version Remember this directory it is the home directory of your AnnoLab installation When I later refer to the AnnoLab home I mean this directory Figure 2 1 Set the ANNOLAB_ HOME environment variable Systemeigenschaften Systemwiederherstellung Automatische Updates Remote A TR _ Allgemein Computemame Hardware Erweitert Umenbungevazieblan Se mime teen mau diese rear Beeren bf Click to create Suchen Systemleistung Variable Wert new variable Verwalten Visuelle Effekte Prozessorzeitplanung Speic
57. ources have been specified the signals are loaded from the des tination URI and analysis results are added to these signals Name of an annotation layer that contains filtering information Fil tered data will be hidden from the pipeline Analysis components will not be able to see this data This is useful for excluding for hiding parts of document e g tables figures or other parts of the document that are likely to be analysed incorrectly A comma separated list of layer names These layers are loaded into the CAS prior to running the pipelines In this way a pipeline can make use of annotations present in the data store If this parameter is not specified no layers are loaded into the CAS ae map file out layer rules file suffix suffix xslt file Switches drop empty split report trim Mapping specification Caution Translating annotations from existing layers into the CAS it less well implemented and tested than translating annotations from the CAS to XML Specify a mapping file that is to be used instead of the default built in mapping file This mapping controls how the annotations are mapped from XML to the UIMA CAS model and vice versa A comma separated list of layer names These layers are extracted from the CAS translated to XML and added to the processed signals If this parameter is not specified the all layers specified by the mapping are extracted i
58. ransform the query results For our example we will create a query template file called query example template and a directory query example Into the latter we put the query file and the XSLT files Each line of the template descriptor file consists of a property name followed by an equals sign followed by the property value For properties pointing to files paths are always treated as being relative to the template descriptor file The query example template file should look like this Figure 20 Query template descriptor file query example template summary This is an example query searching for a word description Here we would put a much longer description query query example query xq xslt default query example html xslt xslt html query example html xslt xslt csv query example csv xslt variable word prompt Search word variable word description Word to search for The property summary specifies a short summary of what the query does A longer description can be given using the property description The property query defines the name of the file containing the query This is relative to the location of the template descriptor file For more information on writing the query file xquery xq itself see the chapter on querying Following are optional properties starting with xslt These specify XSLT style sheets that can be used to transform the output of the query If the property xslt defaul
59. re than it will be re integrated into Switches offsets Turn on when offsets are present in the IMS CWB results If they are present they can later be used to get the original signal If this is turned off it is unlikely that the manual annotations made in the UAM Corpus Tool can be re integrated into the corpus Default on IMS CWB database requirements This command can be used to convert a search result from the IMS CWB to a UAM Corpus Tool project It retains information about the original location of the extracted data in a special layer of the UAM Corpus Tool project which can be used later to integrate the annotation with the complete corpus The IMS CWB corpus database must have been created with at least the following positional attributes uri AnnoLab URI of the source signal Ss start offset of the segment within the signal e end offset of the segment within the signal These must appear in exactly the given relative order uri s e There can be other positional attributes present in the database To integrate the partial annotations made in the generated UAM Corpus Tool project back onto the full texts use the command uam2annolab Tolerance to changes The data store from which the IMS CWB database has been created has to be available when this command is used For best results the signals not have changed between the time the IMS CWB database has been created 28 ims2uam and the time the command is us
60. ree Rule Processing Engine Annotator Now go back to the command line prompt Change to the AnnoLab home using the command cd SANNOLAB_HOMES Now we will use AnnoLab s pear install command to make the PEARs available in AnnoLab pipelines Figure 2 5 Installing PEARs annolab pear install pears WINDOWS system32 cmd exe gt cd ANNOLAB_HOMEz gt annolab pear install pears Installing PEHR trom t11e 0 Programme annolab c11 pears java sentence ae pear InstallationControllerl l extracting C Programme anno lab c li pears java sentence ae pear INFO UIMAModule InstallationController extracting C Programme anno lab cli pear Njava sentence ae pear CInstallationController 563179 bytes extracted INFO UIMAModule InstallationController 583179 bytes extracted InstallationProcessor start processing InsD file C Dokumente und Einstellungen bluef ire annola Works pace org candledark annolab modules uima java sentence ae metadata install xml INFO UIMAModule InstallationProce r start processing InsD file C Dokumente und Einstellungen bluef ire anno labWorkspace org candledark annolab modules uima jayva sentence ae met datavinstall xml InstallationController the metadata setenv txt file contains required environment variables for t is component InstallationControllerl component java sentence ae installation completed INFO UIMAModule InstallationControllerl the metadata setenv txt f
61. sed by the sequence of segments passed as the first argument The functions txt get text left and txt get text right a number of characters left or right of each item in the sequence of segments passed as the first argument The number of characters is specified as the second argument Arguments Selements The segments addressing the text to be retrieved Swindow The number of characters left or right of the segment Examples In this example the PDF file SomeText pdf is copied into the data store default AnnoLab au tomatically extracts some layout information from the PDF and stores it as XHTML annotations in the layer Layout Then a query is issued against that data store looking for all page breaks sections xhtml div class pagebreak For each an XML element result is created containing the text extracted from the page break sections by the txt get text function Because the serialiser can only serialise XML fragments with a single root element all is wrapped in the element results The function txt get text fetches the text addressed by each segment Often segments addresses words but not the whitespace between them In order to get an actual piece of text the seq grow function can be used It calculates a segment from the left most and the right most positions addressed by any descendant segment of the sequence of elements passed to it Thus this function is commonly used in conjunction with txt get text to retrieve a meaningful
62. segments given in the first parameter Returns a set of segments indicating the matches Arguments Selements The XML elements in which to search Spattern A regular expression Examples Find all occurrences of the string zero knowledge within XHTML paragraphs xhtml p of the layer Layout annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution declare namespace xhtml http www w3 org 1999 xhtm1 txt find ds layer QUERY_CONTEXT Layout xhtml p zero knowledge CTRL D Find all XHTML paragraphs xhtml p in the layer Layout containing the string zero knowledge The wildcards at the beginning and the end of the search pattern cause the whole paragraph to be returned as a result annolab mquery annolab default Enter query Press lt CTRL D gt when done to start execution declare namespace xhtml http www w3 org 1999 xhtml txt find ds layer QUERY_CONTEXT Layout xhtml p zero knowledge CTRL D 52 Name txt get text txt get text left txt get text right access textual signals http annolab org annolab textual Synopsis txt get text Selements as element as element txt get text left Selements as element Swindow as xs string as element txt get text right Selements as element Swindow as xs string as element Description The function txt get text gets the contents addres
63. stituent it is Constituent A feature declared in an UIMA type is either of a primitive type integer string etc or of another UIMA type Features of primitive types are mapped by a lt feature gt section The attribute casName specifies the feature to be mapped while the mapAs attribute controls how it is mapped to XML ignore The feature is not mapped error If the feature is present on this type an error is generated feature The feature is mapped to an XML attribute This is also the default for all primitive features that are not explicitly mapped Non primitive features are mapped using lt relation gt section Per default non primitive features are not mapped to XML They can be explicitly mapped as ignore The feature is not mapped error If the feature is present on this type an error is generated feature The feature is mapped to an XML attribute If this mapping type is used the attribute select has to be present and specify the name of a primitive feature of the mapped type whose value will be used as the attribute value leaf segments If the annotation being mapped is a leaf segments are extracted from the specified feature An annotation is a leaf if it dominates no other annotation tree layer only segments Segments are extracted from the specified feature dominance The annotated specified by the feature is dominated by the annota tion being mapped This is usually used with the attribute inverted being set to tru
64. stribution of part of speech tags Y axis across a set of directories X axis The command uses TreeTagger to generate the part of speech and lemma The output is a CSV file in UTF 8 works well with OpenOffice Calc It is mandatory to specify a model using the parameter m To get a list of available models use the command lemmatize list This command requires the same set up as the lemmatize command Please refer to the docu mentation of that command to find out how to set up AnnoLab to work with a local installation of TreeTagger The source has to be a collection in a data store or a directory on the file system For each child of the source a column will be created in the output table Assuming the following data store structure the source annolab default A results in a table with the columns Text 1 and Text 2 while the source anno lab default results in a table with the columns A and B default A Text 1 Text 2 B Text 3 Per default the Y axis of the table shows the lemma and part of speech This can be changed using the option f which takes a comma separated list of field names Note that field names are case sensitive Arguments source One locations from which to read data to be annotated The source can be a directory on the local file system or collection in a data store destination The name of a file on the file system to which to write the results For your convenience use a file name endin
65. t is present the specified XSLT file is always used unless a xs1lt parameter is used to explicitly specify another In the given case we could specify xslt csv to use the XSLT style sheet specified with the property xslt csv instead of the default Since you have complete control of the query results any discussion of how to do the XSLT style sheets is omitted here It is suggested that to keep the query simple and the results plain and use XSLT style sheets to do aggregation and or formatting Free query variables need to be declare using properties starting with variable The example above de clares a variable named word When the user does not specify a variable value using the V parameter the value is asked for The description from variable word description is shown on the screen and the user can enter the value after the prompt specified in variable word prompt Figure 21 AnnoLab asking for the value of an unset query variable Unset variable Description Word to search for Search word _ Examples Run a query against the data store default using input redirection The file query xq contains the actual query annolab mquery annolab default lt query xq Profile the query template word template by running it 10 times in a row searching for the word be annolab query repeat 10 Vword be word template annolab default 39 query Run the query template word template using the html XSLT style sheet
66. t will remove any parts of the signal that have been annotated with the feature class with the values keywords bibliogra phy figure footnote formula ignore pagebreak table or affiliation To change this use the rules option rules file Specifies the rules to used The parameter accepts the name of a text file in which you can specify which attribute value combinations that cause parts of the signal to be filtered out See the examples below for more information rules specification Each line of the filter specification file corresponds to one rule The first part of the line before the colon is the name of a feature XML attribute After the colon follows a regular expression The rule matches if an annotation element bears the given feature and the feature value matches the regular expression The following example of a filter specification defines two rules The first rule matches all XML elements bearing an attribute class with either the value table or abstract The second rule matches all XML elements bearing an attribute speaker ending in Smith Data covered by XML elements matching either of these rules is not included in the output class table abstract speaker Smith Examples The following command recursively copies a filtered version of the complete data store annolab de fault tothe directory filteredOutputDirectory 25 filter annolab filter filter Layout annolab default filteredOutputD
67. ta in the data store default to that directory 23 export mkdir integrated annolab export annolab default integrated In the next example we use an XSLT style sheet called plaintext xslt to extract only the signal text from the integrated representation The suffix of the exported files is explicitly set to t xt annolab export xslt plaintext xslt suffix txt annolab default text 24 Name filter extract filtered signals Synopsis filter filter layer rules file source destination Description This command copies signals without layers optionally filtering the signals by removing parts of the signal annotated for a particular features in the specified layer The features causing a part ofthe signalto be removed are either a built in default set or the set of features given in the optional rule file If the source is a collection all signals are filtered recursively Arguments source One or more locations from which to copy data to the destination The sources can be files or directories on the file system or data store lo cations destination One location to which the data is copied The location can be a directory on the file system or a collection in a data store The location should always terminate in a to indicate that it is a collection Options Filter filter layer Specifies a layer to be used for filtering If a filter layer is specified per default i
68. to each other and how to interpret features they bear If you know about the CAS you may wonder what the terms relation and feature mean here because in UIMA lingua there are only feature structures What we call feature here is a feature structure of a primitive type integers strings etc When a feature of a feature structure is another feature structure we say there is a relation between the two feature structures For the moment we are only interested in the parse tree produced by the Stand fordParserAnnotator This tree is encoded in the children relation that is present in Constituent To treat this relation as the dominance relation of the tree layer we add a lt relation gt section Figure 3 5 Step 3 Define the dominance relation of the tree lt mapping gt lt layer type tree name parse segType sequential gt lt element casType de julielab jules types Constituent gt lt relation casName children mapAs dominance gt lt element gt lt element casType org annolab uima pear stanford Token gt lt element gt lt layer gt lt mapping gt Now we need to define that the segments relation of Token points to the stand off anchors So we a lt re lation gt section to the respective lt element gt section Analysis pipelines Figure 3 6 Step 4 Define where the segments are located lt mapping gt lt layer type tree name parse segType sequential gt lt element casType de julielab jules
69. und2BEinstellungen bluefire anno labio rks pace org candledark annolab modules uima stanford parser ae resources englishPCFG ser gz 2008 17 50 32 org candledark uima pear theme ThemeAnnotator initialize 37 Initializing theme annotat 5 2008 17 56 36 org candledark uima pear java sentence SentenceAnnotator process 5B gt Running ntence annotator 2068 17 56 36 org candledark uima pear java token Tokenfnnotator process 36 Running token annotator 06 2008 17 58 41 org candledark uima pea tanford StanfordParserfinnotat process 146 gt WARNUNG Sentence length 155 exceeds max sentence length 158 The term rapid prototyping lt RP gt re fers to a class of technologies that can automatically construct physical models from Computer Aided Design lt CAD gt data 29 45 2972 17 58 41 org candladork uima pear stanford Stanf aS p eS a 1 After the command completes you will find the following files in the output directory e exl html txt the text extracted from the input file e exl html_Layout xml the layout layer e exl html_Sentence xml the sentence annotation layer e exl html_Token xml the token annotation layer e exl html_Parse xml the parse annotation layer containing the syntactic parse and the theme anno tations The in Layout parameter causes the layout information from the HTML to be imported before running the pipeline Thus header and paragraph boundaries are available to the sentence sp

AnnoLab - User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents