Home
CS - 97 - 02 Information Extraction
Contents
1. CS 97 02 Information Extraction a User Guide Hamish Cunningham Information Extraction a User Guide F Hamish Cunningham January 1997 Research memo CS 97 02 Institute for Language Speech and Hearing ILASH and Department of Computer Science University of Sheffield UK h cunningham Qdcs shef ac uk http www dcs shef ac uk research groups nlp extraction http www dcs shef ac uk hamish IE A User Guide 1 Contents 1 Introduction 1 2 Types of IE 2 3 Performance levels 3 4 Named Entity recognition 5 5 Coreference resolution 7 6 Template Element production 8 7 Scenario Template extraction 10 8 An Extended Example 12 9 Multilingual IE 15 1 Introduction This note gives a user oriented view of Information Extraction IE No knowledge of language processing is assumed For a more technical overview see CL96 Information Extraction is a process which takes unseen texts as input and produces fixed format unambiguous data as output This data may be used directly for display to users or may be stored in a database or spreadsheet for later analysis or may be used for indexing purposes in Information Retrieval IE A User Guide 2 IR applications It is instructive to compare IE and IR whereas IR simply finds texts and presents them to the user the typical IE application analyses texts and presents only the specific information from them that the user is interested
2. extraction tasks currently available as defined by the leading forum for this research IE A User Guide 3 the Message Understanding Conferences GS96 Named Entity recognition NE Finds and classifies names places etc Coreference Resolution CO Identifies identity relations between entities in texts Template Element construction TE Adds descriptive information to NE results Scenario Template production ST Fits TE results into specified event scenarios From a user point of view NE TE and ST are the most relevant IE tasks CO as noted below is necessary as an adjunct to the other tasks but is of limited direct usefulness to the IE system user NE TE and ST provide progressively higher level information about texts These are described in more detail below after a discussion of the current performance levels of IE technology 3 Performance levels Each of the four types of IE have been the subject of rigorous performance evaluation in MUC 6 1995 and other MUCs so it is possible to say quite precisely how well the current level of technology performs Below we will guote percentage figures guantifying performance levels they should be interpreted as a combined measure of precision and recall see the section on evaluation in Adv95 Several caveats should be noted most of the evaluation has been on English with some Japanese Chinese and Spanish IE A User Guide 4 som
3. 00 level measured in MUC by inter annotator comparisons NE recognition can now be said to function at human performance levels and applications of the technology are increasing rapidly as a result A recent evaluation of NE for Spanish Japanese and Chinese MOC96 IE A User Guide 6 7 GATE Viewer doc2 Named Entities lt DOC gt lt DOCID gt wsj94 008 0212 X DOCID lt DOCNO gt 940413 0062 lt DOCNO gt lt HL gt Who s News lt 50 gt WALL STREET JOURNAL J PAGE B10 x s0 lt CO gt MER lt CO gt lt IN gt SECURITIES SCR lt IN gt ee GRRE 46 years old was named executive vice president and director of fixed incame at this Dismiss l Figure 2 Named entity recognition produced the following scores language best system Spanish 93 04 Japanese 92 12 Chinese 84 51 The process is weakly domain dependent i e changing the subject matter of the texts being processed from financial news to other types of news would involve some changes to the system and changing from news to scientific papers would involve quite large changes IE A User Guide 7 5 Coreference resolution Coreference resolution CO involves identifying identity relations between entities in texts These entities are both those identified by NE recognition and anaphoric references to those entities For example in Alas poor Yorick I knew him well coreference resolution
4. A User Guide 8 GATE Viewer doc2 CoReference Chains lt DOCID gt wsj94 008 0212 lt DOCID gt lt DOCNO gt 940413 0062 lt DOCNo gt lt HL gt Who s News e lt m gt lt DD gt 04 13 94 lt DD gt lt 50 gt WALL STREET JOURNAL J PAGE B10 lt 50 gt lt CO gt MER lt CO gt lt IN gt SECURITIES SCR lt IN gt Toronto Donald Wright 46 years old was named executive vice president and director of fixed incame at EEEREN Mr Wright resigned as president of Merrill Lynch Canada Inc a unit of Merrill Lynch amp Co to succeed Mark Kassirer 48 who left last month A Merrill Lynch spokeswoman said it hasn t named a successor to Mr Wright who is expected to begin his new position by the end of the month Dismiss Colour key Co referred items Selected reference chain Redis play all Figure 3 Coreference resolution only around 80 but note that this hides the difference between proper noun coreference identification same object different spelling or compound ing e g IBM IBM Europe International Business Machines Ltd and anaphora resolution the former being a significantly easier prob lem CO systems are domain dependent 6 Template Element production The TE task builds on NE recognition and coreference resolution In addi tion to locating and typing i e classifying or assigning to a type personal name date etc entities in docum
5. T tasks separately TE scores should improve in future as developers gain more experience with the task As in NE recognition the production of TEs is is weakly domain dependent i e changing the subject matter of the texts being processed from financial news to other types of news would involve some changes to the system and changing from news to scientific papers would involve quite large changes IE A User Guide 10 7 Scenario Template extraction Scenario templates STs are the prototypical outputs of IE systems They tie together TE entities into event and relation descriptions For example TE may have identified Isabelle Dominique and Francoise as people entities present in the Robert edition of Napoleon s love letters ST might then identify facts such as that Isabelle moved to Paris in August 1802 from Lyon to be nearer to the little chap that Dominique then burnt down Isabelle s apartment block and that Francoise ran off with one of Gerard Depardieu s ancestors A slightly more pertinent example is given in figure 5 The same comments regarding format apply as for the TE task ST is a difficult IE task The current Sheffield system scores 49 for ST production the best MUC 6 system scored 56 The human score was 81 which illustrates the complexity involved These figures should be taken into account when considering appropriate applications of ST technology Note however that it is possible to increase prec
6. TATUS IN ON_THE_JOB UNCLEAR lt SUCCESS1ON_EVENT 9404130062 20 gt SUCCESSION_ORG lt ORGANI ZATION 9404130062 28 gt POST president IN AND OUT KIN AND OUT 9404130062 15 lt IN_AND_OUT 9404130062 21 gt lt IN_AND_OUT 9404130062 22 gt VACANCY_REASON REASSIGNMENT lt IN_AND_OUT 9404130062 15 gt 10_PERSON lt PERSON 9404130062 50 gt NEW_STATUS OUT ON_THE_JOB No lt IN_AND_OUT 9404130062 21 gt 10_PERSON lt PERSON 9404130062 50 gt NEW_STATUS IN ON THE JOB UNCLEAR lt IN_AND_OUT 9404130062 22 gt 10_PERSON lt PERSON 9404130062 29 gt NEW STATUS OUT ON THE JOB UNCLEAR lt SUCCESSION_EVENT 9404130062 30 gt SUCCESSION ORG KORGANI ZATION 9404130062 28 gt POST president IN_AND_OUT lt IN_AND_OUT 9404130062 31 gt VACANCY_REASON REASSIGNMENT lt IN_AND_OUT 9404130062 31 gt 10_PERSON lt PERSON 9404130062 29 gt NEW STATUS OUT ON THE JOB No lt ORGANI ZATION 9404130062 18 gt ORG NAME BURNS FRY Ltd ORG ALIAS Burns Fry Ltd ORG TYPE COMPANY ORG LOCALE Toronto CITY ORG COUNTRY Canada Dismiss Figure 5 Scenario template current technology has difficulty attaining scores much above 60 accuracy for this task however IE A User Guide 12 8 An Extended Example So far we have discussed IE from a general perspective In this section we look at the capabilities that might be delivered as part of an application designed to support a
7. e applications of the technology may be either easier or more difficult in other languages The performance of each IE task and the ease with which it may be devel oped is to varying degrees dependent on Text type the kinds of texts we are working with for example Wall Street Journal articles or email messages or HTML documents from the World Wide Web Domain the broad subject matter of those texts e g financial news or requests for technical support or tourist information Scenario the particular event types that the IE user is interested in for example mergers between companies or problems experienced with a particular software package or descriptions of how to locate parts of a city For example a particular IE application might be configured to process fi nancial news articles from a particular news provider and find information about mergers between companies and various other scenarios The per formance of the application would be predictable for only this conjunction of factors If it was later required to extract facts from the love letters of Napoleon Bonaparte as published on wall posters in the 1871 Paris Com mune performance levels would no longer be predictable Tailoring an IE system to new requirements is a task that varies in scale dependent on the degree of variation in the three factors listed above IE A User Guide 5 4 Named Entity recognition The simplest and most reliable IE tec
8. ents TE associates descriptive informa tion with the entities For example from the figure 1 text the system finds out that Burns Fry Ltd is located in Toronto and it adds the information that this is in Canada IE A User Guide 9 Template elements for the figure 1 text are given in figure 4 The format is a T GATE Viewer doc2 Template Elements ORGANI ZATION 9404130062 1 gt ORG NAME BURNS FRY Ltd ORG_ALIAS Burns Fry Ltd ORG_TYPE COMPANY ORG LOCALE Toronto CITY ORG COUNTRY Canada lt ORGANI ZATION 9404130062 2 gt ORG NAME Merrill Lynch Canada Inc ORG_ALIAS Merrill Lynch amp Co ORG_TYPE COMPANY lt PERSON 9404130062 1 gt PER NAME Mark Kassirer lt PERSON 9404130062 2 gt PER_NAME Donald Wright PER_ALIAS Wright PER_TITLE Mr Dismiss l al Figure 4 Template elements somewhat arbitrary one developed at the behest of the American intelligence community the original target user group of the MUC competitions It is difficult to read the main point to note is that it is essentially a database record and could just as well be formatted for SQL store operations or reading into a spreadsheet or with some extra processing for multilingual presentation Section 8 gives a simplified example The current Sheffield system scores 71 for TE production the best MUC 6 system scored 80 Humans achieved 93 MUC 6 was the first MUC to evaluate TE and S
9. etrators ENTITY 5 ENTITY 6 status on trial joint venture id EVENT 2 type transport companies ENTITY 6 ENTITY 11 status past These results correspond to the ST task 9 Multilingual IE The results described above may then be translated for presentation to the user or for storage in existing databases In general this task is much easier than translation of ordinary text and is close to software localisation the process of making a program s messages and labels on menus and buttons multilingual Localisation involves storing lists of direct translations for IE A User Guide 16 known items In our case these lists would store translations for words such as entity location date heroin We also need ways to display dates and numbers in local formats but code libraries are available for this type of problem Problems can arise where arbitrary pieces of text are used in the entity de scription structures for example the descriptor slot in MUC 6 TE objects Here a noun phrase from the text is extracted with whatever qualifiers relative clauses etc happen to be there so the language is completely unre stricted and would need a full translation mechanism References Adv95 Advanced Research Projects Agency Proceedings of the Sixth Message Understanding Conference MUC 6 Morgan Kauf mann 1995 CL96 J Cowie and W Lehnert Information extraction Communica tions of t
10. he ACM 39 1 80 91 1996 CWG96 H Cunningham Y Wilks and R J Gaizauskas New Methods Current Trends and Software Infrastructure for NLP In Proceedings of the conference on New Meth ods in Natural Language Processing NeMLaP 2 Bilkent University Turkey September 1996 Also available as http xxx lanl gov ps cmp l1g 9607025 GS96 R Grishman and B Sundheim Message understanding confer ence 6 A brief history In Proceedings of the 16th Interna tional Conference on Computational Linguistics Copenhagen June 1996 IE A User Guide 17 GWH795 R Gaizauskas T Wakao K Humphreys H Cunningham and MOC96 Y Wilks Description of the LaSIE system as used for MUC 6 In Proceedings of the Sizth Message Understanding Conference MUC 6 Morgan Kaufmann 1995 R Merchant M E Okurowski and N Chinchor The Multi Lingual Entity Tast MET Overview In Advances in Text Processing TIPSTER Programme Phase II DARPA Morgan Kaufman 1996
11. hnology is Named Entity recognition NE NE systems identify all the names of people places organisations dates and amounts of money So for example if we run the Wall Street Journal text in figure 1 through an NE recogniser the result is as in figure 2 this looks better in colour The viewers shown here and below are part of the GATE language engineering architecture and development environment see CWG96 NE recognition can be performed at 96 accuracy the tmp_mnt home peterr gate B uild a doc2 lt BDOC gt lt DOCID gt wsj94 008 0212 lt DOCID gt lt DOCNO gt 940413 0062 lt sDOCNO gt lt HL gt Who s News Burns Fry Ltd lt HL gt lt DD gt 04 13 94 lt DD gt lt 50 gt WALL STREET JOURNAL J PAGE B10 lt s0 gt lt CO gt MER lt CO gt KIND SECURITIES SCR lt IN gt lt TXT gt lt p gt BURNS FRY Ltd Toronto Donald Wright 46 years old was named executive vice president and director of fixed income at this brokerage firm Mr Wright resigned as president of Merrill Lynch Canada Inc a unit of Merrill Lynch amp Co to succeed Mark Kassirer 48 who left Burns Fry last month A Merrill Lynch Spokeswanan said it hasn t named a successor to Mr Wright who is expected to begin his new position by the end of the month Dismiss 1 Figure 1 An example text current Sheffield system GWHt 95 performs at 92 accuracy Given that human annotators do not perform to the 1
12. in For example a user of an IR system wanting information on the share price movements of companies with holdings in Bolivian raw materials would typically type in a list of relevant words and receive in return a set of doc uments e g newspaper articles which contain likely matches The user would then read the documents and extract the requisite information them selves They might then enter the information in a spreadsheet and produce a chart for a report or presentation In contrast an IE system user could with a properly configured application automatically populate their spread sheet directly with the names of companies and the price movements There are advantages and disadvantages to IE with respect to IR IE sys tems are more difficult and knowledge intensive to build and are to varying degrees tied to particular domains and scenarios see next section They are also for most tasks less accurate than human readers IE is more compu tationally intensive than IR However in applications where there are large text volumes IE is potentially much more efficient than IR because of the possibility of reducing the amount of time analysts spend reading texts Also where results need to be presented in several languages the fixed format unambiguous nature of IE results makes this straightforward in comparison with providing full translation facilities 2 Types of IE There are four types of information extraction or information
13. ision at the expense of recall we can develop ST systems that don t make many mistakes but that miss quite a lot of occurrences of relevant scenarios Alternatively we can push up recall and miss less but at the expense of making more mistakes The ST task is both domain dependent and by definition tied to the sce narios of interest to the users Note however that the results of NE and TE feed into ST Note also that in MUC 6 the developers were given the specifications for the ST task only 1 month before the systems were scored This was because it was noted that an IE system that required very lengthy revision to cope with new scenarios was of less worth than one that could meet new specifications relatively rapidly As a result of this the scores for ST in MUC 6 were probably slightly lower than they might have been with a longer development period Experience from previous MUCs suggests that IE A User Guide 11 GATE Viewer doc2 Scenario Template TEMPLATE 9404130062 1 gt DOC_NR 9404130062 CONTENT lt SUCCESS1ON_EVENT 9404130062 11 gt lt SUCCESS 1 ON_EVENT 9404130062 20 gt lt SUCCESSION_EVENT 9404130062 30 gt lt SUCCESSION_EVENT 9404130062 11 gt SUCCESSION_ORG lt ORGANI ZATION 9404130062 18 gt POST executive vice president IN_AND_OUT lt IN_AND_OUT 9404130062 5 gt VACANCY REASON OTH_UNK lt IN_AND_OUT 9404130062 5 gt 10_PERSON lt PERSON 9404130062 50 gt NEW_S
14. nalysts tracking international drug dealing When the system is specified our imaginary analyst states that the op erational domains that user interests are centred around are drug en forcement money laundering organised crime terrorism legislation The entities of interest within these domains are cited as person company bank financial entity transportation means locality place organisation time telephone narcotics legislation activity A number of relations or links are also specified for example between people between people and companies etc These relations are not typed i e the kind of relation in volved is not specified Some relations take the form of properties of entities e g the location of a company whilst others denote events e g a person visiting a ship Working from this starting point an IE system is designed that 1 is tailored to texts dealing with drug enforcement money laundering organised crime terrorism and legislation 2 recognises entities in those texts and assigns them to one of a number of categories drawn from the set of entities of interest person company si 3 associates certain types of descriptive information with these entities e g the location of companies 4 identifies a set relatively small to begin with of events of interest by tying entities together into event relations IE A User Guide 13 For example conside
15. ous type specific information is available e g for dates a normalisation giving the date in standard format Reuter id ENTITY 1 type company business news New York IE A User Guide id type subtype is_in Wednesday 12 July 1996 id type normalisation New York police id type location Frederick J Thompson id type aliases domicile profession employer Jay Street Imports Inc id type aliases business Manhattan id type subtype is in Robert Guliani id type aliases 1989 id type normalisation Latin America id type subtype Downing Jones id type 14 ENTITY 2 location city US ENTITY 3 date 12 07 1996 ENTITY 4 organisation ENTITY 2 ENTITY 5 person Thompson Fred ENTITY 7 managing director ENTITY 6 ENTITY 6 organisation Jay Street import export ENTITY 7 location city ENTITY 2 ENTITY 8 person Guliani ENTITY 9 date 2 1989 ENTITY 10 location country ENTITY 11 organisation IE A User Guide 15 business transportation heroin id ENTITY 12 type drug class A United States id ENTITY 13 type location subtype country These results correspond to the combination of NE and TE tasks if we removed all but the type slots we would be left with the NE data Second relations of event type or scenarios narcotics smuggling id EVENT 1 destination ENTITY 13 source unknown perp
16. r the following text Reuter New York Wednesday 12 July 1996 New York police announced today the arrest of Frederick J Thompson head of Jay Street Imports Inc on charges of drug smuggling Thompson was taken from his Manhattan apartment in the early hours yesterday His attorney Robert Giuliani is sued a statement denying any involvement with narcotics on the part of his client No way did Fred ever have dealings with dope Guliani said A Jay Street spokesperson said the company had ceased trading as of today The company a medium sized import export con cern established in 1989 had been the main contractor in several collaborative transport ventures involving Latin American pro duce Several associates of the firm moved yesterday to distance themselves from the scandal including the mid western trans portation company Downing Jones Thompson is understood to be accused of importing heroin into the United States From this IE might produce information such as the following in some for mat to be determined according to user requirements e g SQL statements addressing some database schema First a list of entities and associated descriptive information Relations of property type are made explicit Each entity has an id e g ENTITY 2 which can be used for cross referencing between entities and for describing events involving entities Each also has a type or category e g company person Additionally vari
17. would tie Yorick with him and I with Hamlet if that information was present in the surrounding text This process is less relevant to users than other IE tasks i e whereas the other tasks produce output that is of obvious utility for the application user this task is more relevant to the needs of the application developer For text browsing purposes we might use CO to highlight all occurrences of the same object or provide hypertext links between them CO technology might also be used to make links between documents though this is not currently part of the MUC programme The main significance of this task however is as a building block for TE and ST see below CO enables the association of descriptive information scattered across texts with the entities to which it refers To continue the hackneyed Shakespeare example coreference resolution might allow us to situate Yorick in Denmark Figure 3 shows results for our example text CO resolution is an imprecise process when applied to the solution of anaphoric reference The Sheffield system scored 51 recall and 71 precision at MUC 6 other systems scored e g 59 recall 72 precision 63 recall 63 precision These scores are low although problems with completing the task definition on schedule complicated matters and led to human scores of 1For statistical reasons the combined precision and recall measure we use elsewhere is inappropriate here IE
Download Pdf Manuals
Related Search
Related Contents
BEIGNET POMME Informations produit Command Builder Manual v3.6.1 GUARDIAN ANGEL USER MANUAL Lynx MiniVib Manual En Da De Fr Ne No Sv 2014 働くことは生きること よりよく生きるためのワーク・ライフ・バランス MANUAL DE INSTRUÇÕES Manual Parflange 1050 Xtech XTA-175 card reader Broncoscopia en el paciente en estado crítico Tunturi MT311 User's Manual Copyright © All rights reserved.
Failed to retrieve file