Home

D-EE2.6 NERT User Manual

1. NERT can create a file that contains a list of all found NEs with the following command NElist FILE Input and output formats NERT 2 0 can handle three text formats BIO text and xml As default it expects BlO format as input and it will use this for output as well When you are using files in text or xml format or you want a particular output in the extraction process you need to tell NERT fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text IMPACT li a E Tool tite version 3 0 November 2011 page 5 19 training in the properties file format txt xml bio extracting on the command line in bio txt xml out bio txt xml BlO format BIO is an acronym and stands for the kind of tags used B egin I nside and O ut Basically each word is on a separate line followed by the tag Arjen POS B PER Robben POS I PER should POS O have POS O scored POS O against POS O Spain POS B LOC POS 0 The middle POS tag is optional it is not used by the tool However if you leave it out itis necessary to tell the tool in the properties file the structure of your bio input Default format bio map word 0 tag 1 answer 2 without the POS tag format bio map word 0 answer 1 Itis recommended to add a whitespace after each sentence and tokenize your data so that periods com
2. lt PER gt etc Starttag lt NE type TAG gt for lt NE type PER gt possibly followed by attributes e g lt NE type PER id 2 gt extraction on the command line Starttag lt NE TAG gt or starttag lt NE type TAG gt endtag lt TAG gt If a wrong starttag and or endtag is given NERT will most likely crash In extraction when a text file is given that has tags NERT will use the structure of these tags for its own output while marking the original reference tags with the tags lt REF_ORG gt Timbouctou lt REF gt For example lt PER gt ohn lt PER gt and lt PER gt Yoko lt PER gt with starttag lt TAG gt and endtag lt TAG gt will be outputted as lt PER gt lt REF_PER gt ohn lt REF gt lt PER gt and lt PER gt lt REF_PER gt Y oko lt REF gt lt PER gt in which the inner tags represent the original tags and the outer tags the ones supplied by the NERT As a final note although NERT is trying to preserve the original outline of a text document there will most probably be differences in the output of whitespaces fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 7 19 Xml format When using xml format the same principles apply as for txt regarding the tags NERT deals with xml input
3. As for the surname identifiers an example file for Dutch in the mat cher directory If string2roman is not specified or left empty the matcher will still find roman numbers but not the ones that are spelled out eq KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text MPACT i a E Tool title version 3 0 November 2011 page 19 19 7 License and IPR protection The Stanford tool is licenced under the GNU GPL v2 or later References Finkel J enny Rose Trond Grenager and Christopher Manning 2005 Incorporating Non local Information into Information Extraction Systems by Gibbs Sampling Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics ACL 2005 pp 363 370 htto nlp stanford edu manning papers gibbscrf3 pdf ea KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands
4. file A from that directory dir dir_A onlyShowF ile dir_A file_A fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 15 19 This is particularly useful if we would like to have the matcher group variants from a set of lists but we are only interested in the output of one of them If you want the output of more than one file use onlyShowFiles with semi colon separated filenames The NE matcher has different ways to print its output The default output is as follows Amsteldam Amsterdam Amsteldam Amstelredam Amstelredam Amsteldam Amstelredam Amsterdam Amsterdam Amsteldam Amsterdam Amstelredam This is called the pairview output since each line shows 1 pair of NEs If you rather want the matcher to list all variants of a single NE per line use the groupview flag in your properties file groupview true This will print Amsteldam Amsterdam Amstelredam Amstelredam Amsteldam Amsterdam Amsterdam Amsteldam Amstelredam The flag showScores can be used to let the NE matcher also print the matching scores for each variant showS cores true in the properties file gives Leeuwarden Leewarden 100 Gemeente Leeuwarden 100 Lieuwarden 76 The flag showP honeticTranscription can be used to have the NE matcher p
5. provided that it is told to consider only text between specific tags Say we have the xml file below lt xml version 1 0 encoding UTF 8 gt lt gt Mai lt Text gt Sally sells sea shells atthe sea shore lt Text gt lt Text gt Peter Piper picked a pack of pickled peppers lt Text gt lt gt lt gt lt xml gt We have to tell NERT to only consider the text between the lt Text gt tags This is done as follows training in the properties file xmltags Text or with multiple tags xmitags Text Unicode extracting on the command line xmitags Text or with multiple tags xmitags Text Unicode NERT deals with XML by simply skipping all text that is not between the specified tag s The relevant chunks are considered subsequently Note that this means that in the above example it will first train extract the first sentence and then the following Any NEs that would be stretched over these two chunks would therefore be missed Thus the xml format is recommended only when large chunks of text are covered by a specific tags In other cases it is necessary to convert the text to either text or BlIO format The spelling variation reduction module In training NERT learns to recognize NEs by trying to identify relevant clues about both the NEs and their context Examples of clues are use of capitals position in the sentence or preceding or following words or groups of words such as in LOCATION Th
6. 1 to limit the matches to any minimal score The option showDoubles false can be used to have the Matcher only print out unique NE s and their matches Types The matcher can also handle NE types e g LOC ORG PER For this it needs its input in the following way LOC Amsterdam LOC Leeuwarden PER Piet ansen with NE type and NE separated by a whitespace You need to tell the matcher that you re having types in your input file s by stating the following line in your properties file hasType true Note that this only tells the matcher how to read the input files The matcher will still match all NEs regardless of their type If you want the matcher to match only PERs with PERs and LOCs with LOCs use the following useType true By default the types will disappear in the matcher s output but you can tell the matcher to print them anyway by adding the following line to the properties file printTypes true This will print fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 17 19 LOC Amsterdam LOC Amsteldam Finally the verbose flag can be used for some more general output to STDERR The flag punishDiffinitial is used to punish the matching scores of NEs that do not start with the same character Its value is subtracted from the fina
7. as prepositions and titles For example Dutch naer and naar to For those words it pays to write a specific rule e g bnaer b gt naar Regarding the latter remark a script find_NE_identifiers shis added to the scripts directory which can be used to help identifying useful words When run on a text in BIO format in which the NEs are tagged like the training file it lists all words preceding the NEs These preceding words are often important predictors for NEs and performance generally improves when reducing the amount of variation in them The list will generally contain many prepositions and titles The scriptis run as follows sh find_NE_identifiers sh file gt outputfile NERT can print a list of created rewrite rules variant gt word to a file when using the following command training in the properties file prints pelvarpairs F ILE extraction on the command line svlist FILE Creating training target and properties files Training and target files A first step is to select and produce an appropriate training file NERT s performance depends strongly on the similarity between the training file and the test file when they are exactly alike the tool can reach an fl score of 100 Generally speaking the more different both files are the lower the performance will become although other factors also affect the tool s performance We therefore recommend using part of a particular batch of te
8. Improving Access to Text gt MPACT USER MANUAL for NERT Named Entity Recognition Tool Partner INL Deliverable D EE2 3 update part of D EE2 6 Version 3 0 November 2011 Table of contents L Background merah eea a eE E E E E E ets 1 2 Differences with earlier VErSIONS sssesccececcccccnnrenttn ii EnEn 1 3 NERT reg iemenG aren aen a E E 2 4 TEEN ER package minne ott cetera e a tartaric asin aeticaeee te eieualasteatetee aati ten 2 5 Extracting named entities with NERT cscssssssssssssssssssssssssssssessssssssrssesssesssrsssssnsessnssrssnserssssersssseessnseenseseey 2 6 Using the NERT named entity matcher module ssssssssssssssrrrrenssssrrrrensnanrrnrennnanrnnnnnnnnrnennnnnininennnninrrennnnt 13 7 License and IPR Protecto Mnisi aae a 19 ROTERCNC OS i imona tatnan er ei areae aN Eara iaae T ial Eae reani eta eneeier EATE aan 19 1 Background NERT is a tool that can mark and extract named entities persons locations and organizations from a text file It uses a supervised learning technique which means it has to be trained with a manually tagged training file before it is applied to other text In addition version 2 0 of the tool and higher also comes with a named entity matcher module with which itis possible to group variants or to assign modern word forms of named entities to old spelling variants As a basis for the tool in this package the named entity recognizer from Stanford University is used This tool has bee
9. at version 1 0 used J ava 1 5 4 The NERT package NERT consists of a directory with the tool itself example data and scripts NERT amp data matcher models phontrans props sample_extract sample_train doc out scripts tool Figure 1 contents of the NERT package The directory tool contains two jar files nert3 0 jar and stanford ner jar Both are needed to run NERT If you don t use the NERT package but simply have the jar file nert3 0 jar and you get the jar file from Stanford yourself it is necessary to rename the latter one to stanford ner jar and put it in the same directory as nert3 0 jar to run NERT Another option is to unpack nert3 0 jar and change the classpath and main class settings in the manifest mf file 5 Extracting named entities with NERT At the very least three files are needed for NE extraction with NERT If you have those three you are ready to go 1 a tagged training file 2 a tagged or untagged target file from which NEs are extracted 3 a properties file Tagged means that all NEs in the file have been tagged The target file can be either tagged or untagged If it is tagged itis possible to calculate the tool s performance with the conlleval script from the CONLL conferences provided that the output is set to BIO format see below This script can be downloaded at http www cnts ua ac be con
10. below arguments are compulsory There are several optional extraction settings that can be added and that will be discussed below java jar tools nert3 0 jar loadClassifier model testfile testfile in txt bio xml out txt bio xml nelist file xmltags tags starttag tag endtag tag sv svphontrans file svlist file NERT sends its output to STDOUT Again a higher amount of memory can be used as well For extraction a properties file is not needed In principle the settings from the training file will be passed on through the model A set of relevant parameter settings can be passed to the tool via the command line They will be discussed in the next section Settings Input and output files For training one or more files or a reference to a directory with relevant files can be used and the path has to be given in the properties file There are three options trainF ile F ILE trainF iles FILE1 FILE2 FILE3 trainDirs DIR For extraction a single file or a reference to a directory can be used in the command line testfile target file testDirs directory Note that NERT prints the results to standard output This means that when using a directory all files within this directory are printed subsequently as a whole In order to be able to distinguish the original target files NERT starts the output of each target file with a print of the filename when the flag testDira is used
11. e matcher will try to match the names by their surname first If it finds a match it will then look at the initials If these match as well it will assume that we are dealing with a variant In this strategy P de Vries and PCJ de Vries match but P de Vries and de Vries do not while de Vries matches with any of the above mentioned NEs by lack of an initial A list of these signalling words can be added in a file and given to the matcher surnameldentifiers F ILE With the file containing a simple list of words one on each line An example file for Dutch in the mat cher directory If the matcher cannot find any clue as to which is the surname it will only consider the last word of the NE and use this for matching This is also the case when the perFilter is used but no file is specified e g perFilter true and surnameldentifiers or without the entire latter line The perF ilter gets intro trouble with person names such as J ohannes X or Frederik de 2e since the matcher will only use X and 2e as its matching strings because of the word de For this reason the matcher checks the NE for use of roman numbers first If it finds any it will consider the first name instead of the last Note that Frederik de 2e and Frederik de Tweede should also be considered this way For this reason the user can provide the matcher with a file containing rewrite rules for words and their roman counterparts such as tweede gt ll string2roman F ILE
12. eNGrams true maxNGramLeng 6 useP rev true useNext true useS equences true usePrevS equences true maxLeft 1 useTypeSeqs true fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 13 19 useTypeSeqs2 true useTypeyS equences true words hape chris2useLC useDisjunctive true Note in order for the spelling variation reduction module to work properly useWord true is necessary and if gazetteers are used sloppyG azettes true is necessary as well 6 Using the NERT named entity matcher module The matcher module matches variants of named entities NEs such as Leijden and Leyden It can also be used to match historical spelling variants of NEs to their modern form such as Leyden to Leiden It compares phonetic transcriptions of NEs and calculates the distance between them by breaking them up in chunks and by calculating the number of chunks two NEs have in common This value is then corrected for string length and normalized ona scale from 0 100 with 100 being a perfect match Phonetic transcription takes place on the basis of a set of rules which have to be given to the matcher Examples of phonetic transcription are mastrigt for the NE Maastricht and franserepublik for Fransche Republiek NERT comes with a set of default rules that have pr
13. es with an example file with the phonetic transcription rules for Dutch in the mat cher directory Note that these rules do not have to be passed to the matcher because they are the default rules fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 18 19 Dealing with person names With the exception of those strings that the matcher is told to ignore with the phonetic transcription rules it uses the entire NE for matching For person names this might easily lead to false negatives for names such as Kurt Vonnegut Vonnegut and Kurt Vonnegut jr because of the differences in string length The matcher has a built in option to try and do a simple estimation of the structure of person names and thus to recognize that P de Vries Piet de Vries and Pieter Cornelis Sebastianus de Vries are possible variants This option is set by the following flag perF ilter true This is done by letting the matcher look for possible clues for surnames In the given example the word de is such a clue and the matcher will consider all names preceding de as given names and all names following de as surnames The given names are abbreviated and only the initial s is are used in matching Thus the three examples above are reduced to P de Vries P de Vries and PCS de Vries Th
14. is means that the tool is sensitive to variations in the spelling of words For example the sentences come from London come fro London and come frcm London all have different words preceding the location London for the tool although they are all made up variants of the word from Thus the tool would benefit if these variations would be diminished and this is what the spelling variation reduction module intends to do The module tries to reduce spelling variation on the input data by matching potential variants creating internal rewrite rules and by executing these rewrite rules before the tool actually uses the input The actual output remains fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 8 19 unchanged In the above example it would identify the words from fro and frc as variants and create the rewrite rules fro gt from and frc gt from These rewrite rules are applied to the input data the tool is ran and in the case of extraction the original text is used for output In extraction the module looks in both the target file the words from the original training file and if present gazetteer lists which are all stored in the used model For example if a model has been trained with the word fro it pays to create a rewrite rule in which
15. l score The default value is 10 The flag perF ilter default value true sets the use of the PERfilter which tries to handle person names more intelligently see explanation above Phonetic transcription rules As mentioned earlier the matcher uses default rules to convert each NE to a phonetic transcription These rules can be overridden by supplying the matcher with a file with other rules and be putting the path to this file in the properties file phonTrans myfiles phonTransR ules txt The rules are simple rewrite rules which J ava applies to each NE one by one with a single replaceAll method For example look at the following two rules ch gt g replace any ch with g d b gt t replace any d ata word boundary with t Before the matcher applies these rules the string is converted to lowercase For example if the above rules are applied the NE Cattenburch becomes cattenburg and Feijenoord becomes feijenoort Since the matcher goes over the applied rules one by one it is important to take the order of the rules into account Consider for example Z gt S tZ gt s The latter of the two rules will never be used since all z s are already turned into s because of the first rule The rules can also be used to simply remove characters or whole substrings from the NE e g gemeente gt replaces gemeente with void MW gt replaces all non word characters with void NERT com
16. ll2000 chunking output html However note that for the actual extraction of NEs tags in the target file are not necessary fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 3 19 The properties file consists of a list of features parameter settings and locations of the necessary files This file will be discussed below In the directory data props an example properties file is included The script run_nert sh in the scripts directory can be used as an example It trains a model with Dutch example data using the properties file from the directory data props It then uses its model to identify NEs ina target file Stanford is a statistical NE tool This means it needs to be trained on tagged material which is what the training file is for For good performance it is key to train on material that is as close to the actual target data as possible in terms of time period and genre More information on how to create training and target files is given below Training and extracting are two separate commands After training the tool produces a classifier model which is stored as a file This model can then be used for extracting at any later stage Training the model is done by running the jar file nert3 0 jar in the directory too with the following c
17. mas etc are on separate lines instead of being glued to the end of a word since this improves performance If the BlO format is needed the script tag2biotag pl inthe scripts directory can be used For input it needs a text file with each word on a new line and NEs tagged as lt NE_PER ORG LOC gt Named Entity lt NE gt e g lt NE_PERArjen Robben lt NE gt should have scored against lt NE_LOC gt Spain lt NE gt fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 6 19 Txt format NERT can also handle text format in which the tags are wrapped around the NEs lt NE_PER gt Arjen Robben lt NE gt should have scored against lt NE_LOC gt Spain lt NE gt Again NERT needs to know which format you are using both in training and extraction training in the properties file format txt extraction on the command line in txt With text format NERT expects the tags in the example above as default lt NE_PER gt OHN lt NE gt If different tags are used these need to be specified In this specification the actual tag e g PER LOC or ORG is represented by the word TAG in capitals training in the properties file Starttag lt NE TAG gt for lt NE PER gt lt NE LOC gt lt NE ORG gt endtag lt TAG gt Hor lt LOC gt
18. n extended for use in IMPACT Among the extensions is the aforementioned matcher module and a module that reduces spelling variation within the used data thus leading to improved performance For more information on the working of the Stanford tool see Finkel Grenager and Manning 2005 or visit the tool s website htto nlp stanford edu software CRF NER shtml The Stanford tool is licensed under the GNU GPL v2 or later 2 Differences with earlier versions e Some bug fixes regarding error handling e Added setting to show the actual phonetic transcription used in the matcher e In NERT 2 0 and up the IMPACT extensions are separated modules from the Stanford package Thatis one can download the tool from Stanford apart from the IMPACT modules However the IMPACT module only works together with the Stanford package e The present version can handle text and simple xml formats as input as an addition to the BlO format from version 1 0 Its spelling variation reduction module has been improved and there have been some changes on how to pass arguments and parameter settings Finally a matcher module has been added ee KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT li a E Tool tite version 3 0 November 2011 page 2 19 3 NERT requirements NERT is a J ava application and requires J ava 1 6 note th
19. ok out for are errors due to faulty OCR and tokenization as shown below eq KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool title version 3 0 November 2011 page 11 19 Where should be Where Is Is Dr dr Who Who 0 should be O J J Simpson Simpson The should be The New New TOM Tom W Waits A album l T S Al Bum The NER package comes with a few Perl scripts that can fix most of the above but itis always a good idea to double check the results Note also that using these scripts affects your source text The scripts work with BlO text input and print in standard output The scripts can be used as follows perl convertToLowercase pl lt BIO file changes all CAPITALIZED WORDS to words with Initial Capitals perl fixInitials pl lt BIO file detects periods that are preceded by a single capitalized letter and a whitespace or words listed in the script mr Mr dr Dr st St ir Ir jr Jr wed Wed fixAbbrev pl lt BIO file a script specific for Dutch changes v to van and d to de fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Im
20. ommand Training java jar nert3 0 jar t props properties file If necessary memory can be increased as follows java mx4000m jar nert3 0 jar t props properties file 4000MB should be enough for the training of the model but if necessary and available more memory can be used as well When the tool does not successfully create a model during training insufficient memory might be a reason The properties file gives the tool the location of the file or files it has to train with parameter settings and the location where to write its model to see below for more detail In the examples below nert3 0 jar is called from the main directory Note that the paths to all files in the training extraction and matching examples are relative so beware that the paths are correct Basic extraction with BIO input and BlO output is done as follows java jar tools nert3 0 jar loadClassifier model testFil testfile We experienced cases in which the tool crashed during extraction and this had to do with an out of memory error that was solved by increasing memory similar as that for the training process fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT i a E Tool tite version 3 0 November 2011 page 4 19 The loadClassifier and testF ile or testDir see
21. on word characters with nothing thus removing them from the string One can also use the rewrite rules to replace or remove complete words For each word the rules are applied one by one in the order of the file they are in Itis important to consider this order sz gt s after the rule z gt s is useless because all z will already have been removed from the string eq KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 9 19 Tests on Dutch historical data have shown that the module is capable of improving the scores up to a few procent However having the proper rewrite rules is key here We found that more rules did not necessarily lead to better performance due to the factthat more rules lead to more wrong variant matches In general the following advice can be given e Remove non word characters such as dashes whitespaces commas and periods w gt e Check the data for commonly occurring variations For example Dutch mensch vs mens and gaen vs gaan e Check the effect of the rewrite rules sch gt s would work for mensch but would also wrongfully change schip ship into sip sch b gt s works better but skips the plural menschen e Focus on words that identify named entities such
22. oven to work well for Dutch However for other languages some of the rules might have to be altered Using the matcher You can tell NERT to start the matcher by using the m flag as a first flag and use the props flag to tell the matcher the location of a properties file This properties file holds the values of a set of parameters and the location of all relevant files java jar tools nert3 0 jar m props propsfile props The matcher needs the following data e One or more files with NEs format one NE on each line e A properties file e For lemmatizing one or more files with NEs format one on each line e A file with phonetic transcription rules optional e A file with surname identifiers for person names optional e A file with transcription rules for roman numbers optional The exact use of this data and all possible settings in the properties file are discussed below fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 14 19 Examples Say we have a single file with NEs and we would like the matcher to group all NEs within that file that are variants The file is myfiles NE NE file txt In the properties file we then put the following file myfiles NE NE file txt If you have your NEs in more than one file they can be refe
23. proving Access to Text iMPACT I a E Tool title version 3 0 November 2011 page 12 19 Creating a properties file A properties file consists of a list of features parameter settings and locations of the necessary files and a link to its location should be added as an argument when training the model An example properties file can be found at data props Below the contents of a properties file are shown with a short description of the most important features trainF ile path and name of single training file trainF iles training file1 training file2 training file3 trainDirs directory with training file serializeTo path and name of the model that will be created map word 0 tag 1 answer 2 structure of the BIO format useS pelVar true use any of the spelvarmodules below svphontrans path and name of file file with phonetic transcription rules printS pelVarP airs path and name of file print all created and listed rewrite rules to file useGazettes true use gazetteers listed below sloppyG azette true gazette path to listLjlist2 list3 location of gazetteer lists format txt bio xml input format Default bio Starttag lt NE_TAG gt shape of NE tags in txt xml format endtag lt NE gt xmitags tag1 tag2 tag3 relevant xml tags Leave out lt gt the following features can be left like this noMidNGrams false useDistSim false useReverse true useTitle true useClassF eature true useWord true us
24. r training and the rest for testing The script splitFiles plinthe scripts directory can be used to create such a random set of sentences For input it needs a text file with each sentence beginning on a new line and the desired number of words It then creates two output files one with the desired number of words and one with the remaining text These files can then be used as training and target files perl splitFiles pl textfile number of words of output file 1 num The third argument num is the total number of files that are created Use 1 to create 1 training file and 1 target file The script splitFiles_BIO p1 works the same aS splitFiles pl butuses a file in BlO format as input For the tagging of the training file we used the Attestation Tool from deliverable EE2 4 but other tools can of course be used as well In the documentation of the current deliverable EE2 3 a document with NE keying guidelines is included that can be useful Although it is written for use with the Attestation Tool its guidelines are generally applicable If the BlO format is needed the script tag2biotag pl inthe scripts directory can be used For input it needs a text file with each word on a new line and NEs tagged as lt NE_PER ORG LOC gt Named Entity lt NE gt Improving data When using OCR d data tool performance on person names generally increases when the training and target files are cleaned up a bit Generally the main things to lo
25. rint the actual phonetic transcription used in the matching process For example Braddock bbrraddokk Bradock bbrraddokk Braddocke bbrraddokk By default the NE matcher shows all matches with a score higher than or equal to 50 Generally scores lower than 70 75 will contain many false positives so you can alter the minimal score by using minS core in the properties file minS core 75 fed KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 16 19 Note that it might be a good idea to use a minimal score that is not too high since it is harder to filter out false positives than to figure out the false negatives that is the matches it has overlooked The matcher s score can be used to quickly track the false positives You can also tell the matcher to only print the N best scores For this use the following flag nBest 5 The matcher looks at both the settings of minScore and nBest Say we have a word with 8 matches with scores 100 100 80 80 80 75 75 and 50 With minScore 50 and nBest 2 we only see the first 2 results With minScore 80 and nBest 8 we only see the first 4 results because scores lower than 80 are not considered e use minScore 0 and any nBest gt 0 to always show the N best results regardless of their score e use nBest
26. rred to by their directory dir myfiles NE If you want your NEs in NE file txt not to be matched to each other but to NEs in a different file e g NE file2 txt you can use the lemmaF ile or lemmaDir option file myfiles NE NE file1 txt lemmaF ile myfiles lemmata NE file2 txt The matcher s output will be the NEs from NE file1 txt with their possible variants from NE file2 txt The matcher can be told in which column in the input to look for the relevant data line type 0 ne 1 lemmaLine type 2 ne 3 The first line indicates that in the general file s the type of NE can be found in the first column and the actual NE in the second The second line indicate that in the lemma file s the type is in the third column and the NE in the fourth The matcher prints all output preceding the first indicated column The option ignoreW ordsWithT ag can be used when you would like the matcher to ignore parts of an NE s string ignoreWordsWithT ag For example in the NE J ean Philippe d Y voy Baron van Ittersum tot Leersum the matcher will ignore the part Baron It is important that both opening and closing tags are used otherwise the ignore option will be skipped Output options The Matcher outputs only those files that are listed in the option onlyShowFile and this can deviate from the actual input With the settings shown below the Matcher will look in directory dir A for files but will only process
27. variants of this word in the target file are rewritten to fro Similarly if the gazetteer lists contain the location London while the target file has the location Londen a rewrite rule Londen gt London is created thus enabling the tool to recognize Londen as a gazetteer The module works by transforming all words to a phonetic transcription and by comparing these versions of the words with each other Words with the same phonetic transcription are considered variants This means that the rules for phonetic transcription are crucial for a proper working of this module The module has a set of default rules but the user can load its own set if needed training in the properties file useS pelvar true svP honTrans F ILE extraction on the command line sv Svphontrans FILE The arguments useS pelvar true and sv are the ones that initiate the spelling variation reduction module The rules are read by the tool and used in a simple J ava replaceAll function Thus regular expressions can be used in them but this is not necessary Z gt 5 Sz b gt s w gt bcometh b gt come Before the module applies the rules each word is put in lowercase so only lowercase characters should be used on the left hand side of the rules The first example rule tranforms all occurrences of sz to s The second uses b which means it will only consider sz at word boundaries The third example rule replaces all n
28. xts for training Thatis if you have a 1 million words dataset of 19 century newspapers and 1 5 million words dataset of 18t century books we recommend to keep them separate and to create two training files The size of the training file affects performance as well the larger the better Below the fl scores for a training file of 100 000 words on different Dutch text types are shown to give an indication table 1 The parliamentary proceedings score best because OCR quality is good but mainly because itis a very homogeneous text type eq KB IMPACT is supported by the European Community under the FP7 ICT Work Programme The project is coordinated by the National Library of the Netherlands Improving Access to Text iMPACT I a E Tool tite version 3 0 November 2011 page 10 19 Dataset Time period OCR quality Time period f1 score prose poetry plays non fiction 18 c n a 18th c 70 80 19 c n a 19th c 78 68 Parliamentary proceedings 19 c okay 19 c 83 31 20 c okay 20 c 88 50 various Dutch newspapers 18 c poor 18 c 73 49 19 c poor 19th c 83 92 Table 1 F1 scores of various datasets with a training file of 100 000 words without the use of the spelling variation module Another way of giving the training file a better coverage of the target file is to randomly select sentences from the data We found that this method leads to a better performance then when for example the first 100 000 words from the data is used fo

D-EE2.6 NERT User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents