Home

Acquiring Hyponymy Relations from Web Documents

1. Nissan Basically our method extracts the set of expressions associated with the same path as an HCS In the above example we can obtain the HCS Toyota Honda We extract an itemization only when its size is n and 3 lt n lt 20 This is because the processing of large itemizations particularly the downloading of the related documents is time consuming and small itemizations are often used to obtain a proper layout in HTML doc uments We actually need to distinguish different occurrences of the tags in some cases to prevent distinct itemizations from being recognized as a single itemization We found some words that are often inserted into an item ization but do not have common semantic properties with other items in the same itemization during the experiments using a development set Y 7 links and Jl F help are ex amples of such words We prepared a list of such words con sisting of 70 items and removed them from the HCSs obtained in Step 1 2 2 Step 2 Selection of a hypernym candidate by df and idf In Step 1 we can obtain a set of hyponym candidates an HCS that may have a common hypernym In Step 2 we select a common hypernym candidate for an HCS First we prepare two sets of documents We randomly select a large number of HTML documents and download them We call this set of documents a global document set We assume this document set indicates the general tendencies of word frequencies
2. Re ranking of top 4 hypernym candidates Re ranking of top 5 hypernym candidates L 0 500 1000 1500 2000 2500 of hypernym hyponym pairs 20 Figure 5 Contribution of re ranking each step produced only the top 200 pairs in the sorted pairs Since the output of Step 4 is the final output this means that we also assumed that only the top 200 pairs of a hypernym and an HCS would be produced as final output with our procedure In other words the remaining 1 800 2 000 200 pairs were discarded The resulting hypernyms were checked by the authors according to the definition of the hypernym given in Miller et al 1990 i e we checked if the expression a hyponym candidate is a kind of a hypernym candidate is acceptable Then we computed the precision which is the ratio of the correct hypernym hyponym pairs against all the pairs obtained from the top n pairs of an HCS and its hypernym candidate The x axis of the graph indicates the number of hypernym hyponym pairs obtained from the top n pairs of an HCS and its hypernym candidate while the y axis indicates the precision More precisely the curve for Step 2 plots the following points where 1 lt 7 lt 200 J J DD Cel es eres h Ce k 1 k 1 Cr correct C h C indicates the number of hyponym candidates in C that are true hyponyms of h C Note that after Step 4 the precision reached about 75 for 701 hyponym candidates which w
3. Thx X Y2N member Es FY Far Ss Andromeda Galaxy SRA The Galaxy aH 10 Jay ebERYAL F local group of galaxies galaxy TF YW Brazil 7 J EY Philippine E Korea 4 X F India F 2 Y US A 4 Thailand HA 7 rhe China 2 Peru 7 3 F FY F Australia Japan 80 aalt Tver F vy Argentina ZN Y Spain indicates a hyponym candidate that is a true hyponym of the provided hypernym candidate in the Fired Rules column indicates a firing rule while specifies the rule that doesn t fire
4. We simply judged if the produced hypernyms are acceptable or not But we used different evaluation method for the other alternatives We checked if the correct hypernyms pro duced by our method can be found by these alternatives This is simply for the sake of easiness of the evaluation Note that we evaluated Alternative 1 and Alternative 2 in the second evaluation scheme when they are combined and are used as a part of Alternative 4 More detailed explanations on the alternative methods are given below Alternative 1 Recall that Japanese is a head final lan guage and we have explained that common suffixes of hyponym candidates are good candidates to be common hyponyms Alternative 1 computes a hypernym candidate according to this principle Alternative 2 This method uses the captions of the itemizations which are likely to contain a hypernym of the items in the itemization We manually found cap tions or titles that are in the position such that they can explain the content of the itemization and picked up the caption closest to the itemization and the second closest to it Then we checked if the picked up captions included the proper hypernyms Note that the precision obtained by this method is just an upper bound of real performance because we do not have a method to extract hypernyms from captions at least at the current stage of our research Alternative 3 We prepared the lexicosyntactic patterns in Fig 6 which are similar
5. existing search engine as the number of documents including an expres sion As for Rule 2 note that Japanese is a head final language and a semantic head of a com plex noun phrase is the last noun Consider the following two Japanese complex nouns amerika eiga nihon eiga American movie Japanese movie Apparently an American movie is a kind of movie as is a Japanese movie There are many multi word expres sions whose hypernyms are their suffixes and if some expressions share a common suffix it is likely to be their hypernym However if a hypernym candidate appears in a position other than as a suffix of a hyponym candidate the hypernym candidate is likely to be an erroneous one In addition if a hypernym candidate is a common suffix of only a small portion of an HCS then the HCS tends not to have semantic uniformity and such a hypernym candidate should be eliminated from the output We em pirically determined one half as a threshold in our ex periments on the development set As for Rule 3 in our experiments on a development set we found that our procedure could not provide precise hy pernyms for place names such as Kyoto and Tokyo In the case of Kyoto and Tokyo our procedure produced Japan as a hypernym candidate Although Japan is consistent with most of our assumptions regarding hy pernyms it is a holonym of Kyoto and Tokyo but their hypernym In general when a set of place
6. s x Casia 4 E g BRS umag 3 407 X CEs ee k a KHOR L i og x ee S KKK eee Eee ES 20 0 500 1000 1500 2000 of hypernym hyponym pairs Figure 7 Comparison with alternative methods Alternative 4 We also compared our procedure with the combination of all the above methods Alternative 4 Again we checked whether the combination could find the correct hypernym hyponym pairs provided by our method The difference between the precision of our method and that of Alternative 4 reflects the number of hypernym hyponym pairs that our method could acquire and that Alternative 4 could not We assumed that for a given HCS a hypernym was successfully acquired if one of the above methods could find the correct hypernym In other words the performance of Alternative 4 would be achieved only when there were a technique to combine the output of the above methods in an optimal way Figure 7 shows the comparison between our procedure and the alternative methods We plotted the graph as suming the pairs of hypernym candidates and hyponym candidates were sorted in the same order as the order ob tained by our procedure The results suggest that our method can acquire a significant number of hypernyms that the alternative methods cannot obtain when we gave rather small amount of texts a maximum of 100 docu ments per hyponym candidate as in our current experi mental settings There is possibility that the difference particularly
7. the difference from the peformance of Alter native 3 becomes smaller when we give more texts to the alternative methods But the comparison in such settings is actually a difficult task because of the time required for downloading It is our possible future work 5 Concluding Remarks and Future Work We have proposed a method for acquiring hyponymy re lations from Web documents and have shown its effec tiveness through experimental results We also showed More precisely we sorted only the hyponym candidates in the order used by our procedure for sorting and attached the hypernym candidates produced by each alternative to the hy ponym candidates that our method could find a significant number of hy ponymy relations that alternative methods could not at least when the amount of documents used was rather small The first goal of our future work is to further improve the precision of our method One possible approach will be to combine our methods with alternative techniques which were actually examined in our experiments Our second goal is to extend our method so that it can han dle multi word hypernyms Currently our method pro duces just company as a hypernym of Toyota If we can obtain a multi word hypernym such as automobile manufacturer it can provide more useful information to various types of natural language processing systems References Maya Ando Satoshi Sekine and Shun Ishizaki 2003 Aut
8. to the ones used in the pre vious studies of hypernym acquisition in Japanese Ima sumi 2001 Ando et al 2003 One difference from the previous studies was that we used a regular expression instead of a parser This may have caused some errors but our patterns were more generous than those used in the previous studies and did not miss the expressions matched to the patterns from the previous studies In other words the accuracy obtained with our patterns was an upper bound on the performance obtained by the previ ous proposal Another difference was that the procedure was given correct pairs of a hypernym and a hyponym computed beforehand using our proposed method and it only checked whether given pairs could be found by us ing the lexicosyntactic patterns from given texts In other words this alternative method checked if the lexicosyn tactic patterns could find the hypernym hyponym pairs successfully obtained by our procedure The texts used were local document sets from which our procedure com puted a hypernym candidate If our procedure has better figures than this method this means that our procedure can produce hypernyms that cannot be acquired by pat terns at least from a rather small number of texts 1 e a maximum of 100 documents per hyponym candidate 100 Proposed Method Alternative 1 x a Alternative 2 3 80 i Alternative 3 8 a TS Alternative 4 m Hy aos ae X60 pg es
9. 0 F 20 0 500 1000 1500 2000 2500 of hypernym hyponym pairs Figure 4 Contribution of each step and rule randomly picked 2 000 HCSs from among the extracted HCS as our test set The test set contained 13 790 hy ponym candidates Besides these HCSs we used a de velopment set consisting of about 4 000 HCSs to develop our algorithm For each single hyponym candidate we downloaded the top 100 documents in the ranking pro duced by a search engine as a local document set if the engine found more than 100 documents Otherwise all the documents were downloaded Note that a local doc ument set for an HCS may contain more than 100 doc uments As a global document set we used the down loaded 1 00 x 106 HTML documents 1 26 GB without HTML tags Fig 3 shows the accuracy of hypernyms obtained after Steps 2 3 and 4 We as sumed each step produced the sorted pairs of an HCS and a hypernym which are denoted by h C1 C1 h C2 C2 h Cm Cm The sorting was done by the score sim h Ci Ci df h C LD C idf h C G after Steps 3 and 4 as described before while the output of Step 2 was sorted by the df idf score In addition we assumed gt The search engine goo http www goo ne jp 100 Paar 80 Pee AS SS i 60 F S W accuracy 40 Proposed Method Re ranking of top 2 hypernym candidates x 7 Re ranking of top 3 hypernym candidates
10. Acquiring Hyponymy Relations from Web Documents Keiji Shinzato Kentaro Torisawa School of Information Science Japan Advanced Institute of Science and Technology JAIST 1 1 Asahidai Tatsunokuchi Nomi gun Ishikawa 923 1292 JAPAN skeiji torisawa jaist ac jp Abstract This paper describes an automatic method for acquiring hyponymy relations from HTML documents on the WWW Hyponymy relations can play a crucial role in various natural lan guage processing systems Most existing ac quisition methods for hyponymy relations rely on particular linguistic patterns such as NP such as NP Our method however does not use such linguistic patterns and we expect that our procedure can be applied to a wide range of expressions for which existing meth ods cannot be used Our acquisition algo rithm uses clues such as itemization or listing in HTML documents and statistical measures such as document frequencies and verb noun co occurrences 1 Introduction The goal of this work is to become able to automatically acquire hyponymy relations for a wide range of words or phrases from HTML documents on the WWW We do not use particular lexicosyntactic patterns as previous at tempts have Hearst 1992 Caraballo 1999 Imasumi 2001 Fleischman et al 2003 Morin and Jacquemin 2003 Ando et al 2003 The frequencies of use for such lexicosyntactic patterns are relatively low and there can be many words or phrases that do not appe
11. CSs based on semantic similarities between hypernym and hyponym candidates lt UL gt lt LI gt Car Specification lt LI gt lt UL gt lt LI gt Toyota lt LI gt lt LI gt Honda lt LI gt lt LI gt Nissan lt LI gt lt UL gt lt UL gt Figure 2 An example of HTML documents Step 4 Application of a few additional heuristics to elab orate computed hypernym candidates and hyponym candidates 2 1 Step 1 Extraction of hyponym candidates The objective of Step is to extract an HCS which is a set of hyponym candidates that may have a common hyper nym from the itemizations or lists in HTML documents Many methods can be used to do this Our approach is a simple one Each expression in an HTML document can be associated with a path which specifies both the HTML tags that enclose the expression and the order of the tags Consider the HTML document in Figure 2 The expression Car Specification is enclosed by the tags lt LI gt lt LI gt and lt UL gt lt UL gt If we sort these tags according to their nesting order we obtain a path UL LI and this path specifies the information regarding the place of the expression We write UL LI Car Specification if UL LI is a path for the expression Car Specifica tion We can then obtain the following paths for the ex pressions from the document UL LI Car Specification UL UL LI Toyota UL UL LI Honda UL UL LI
12. Diapensia lapponica F 7 YY Sasa kurilensis 116 NFL Y 7 FF Rhododendron aureum t 280 t SPY HILY Polygonatum lasianthum flower flower VA YF shiitake mushroom FY 27 7 Hericium ramosum ty 77 Pseudocolus schellenbergiae AA AJA 127 Sn hAYY Rhodophyllus murraii mash 306 mash vuy Amanita virgineoides Bas room room tv r Pseudocolus schellenbergiae Tae music WEE movie Y A cartoon AE As 139 EAV encounter AE A artiste web site 324 web site Fr ta Z7 Ryunosuke Akutagawa IY Tsugi Takano IPK Bokusui Wakayama FF2E2 E8 Motojiro Kajii tAE Roka Tokutomi S RAAT Yuriko Miyamoto HAWA Soseki Natume HH EKE Kantaro Tanaka fet YEr 150 ERH k Doppo Kunikida 4487 AE Kyusaku Yumeno work 343 work TVA 7974 VF A William Blake ibti Kan Kikuchi ZAY AEE HEF parse error These are novelists x F May Day 7 Y AY A Christmas Day 4 A Y Easter P the New Year 77 ffi All Saints Day EWfifi Epifania Hg 172 flikita H Emancipation Day WAREZ HA Immacolata concezione H 391 place WAF 7 7 7 OH Stefano s Day HAEF KAE Ferragosto Japan name These are national holidays in Italy BDH SA mother Bat warm current 222 cloud drift 184 FEZI blue sky girl 39LC IE V beauty has guilt RE fais Wet These are Japanese movies movie movie SEV HE group of galaxies
13. Then we download the documents including each hyponym candidate in a given HCS This document set is called a local document set and we use it to know the strength of the association of nouns with the hyponym candidates Let us denote a given HCS as C a local document set obtained from all the items in C as LD C and a global document set as G We also assume that N is a set of words which can be candidates of hypernym A hypernym candidate denoted as h C for C is ob tained through the following formula where df n D is the number of documents that include a noun n in a doc ument set D h C argmaxnen df n LD C idf n G IG idf n G log df n The score has a large value for a noun that appears in a large number of documents in the local document set and is found in a relatively small number of documents in the global document set In general nouns strongly associated with many items in a given HCS tend to be selected through the above for mula Since hyponym candidates tend to share a common semantic property and their hypernym is one of the words strongly associated with the common property the hyper nym is likely to be picked up through the above formula Note that a process of generalization is performed auto matically by treating all the hyponym candidates in an HCS simultaneously That is words strongly connected with only one hyponym candidate for instance Lexus for Toyota have relatively l
14. ar in such pat terns even if we look at a large number of texts The effort of searching for other clues indicating hyponymy rela tions is thus significant We try to acquire hyponymy re lations by combining three different types of clue obtain able from a wide range of words or phrases The first type of clue is inclusion in itemizations or lists found in typi cal HTML documents on the WWW The second consists of statistical measures such as the document frequency df and the inverse document frequency idf which are Car Specification Toyota Honda Nissan Figure 1 An example of itemization popular in the IR literature The third is verb noun co occurrence in normal corpora In our acquisition we made the following assumptions Assumption A Expressions included in the same item ization or listing in an HTML document are likely to have a common hypernym Assumption B Given a set of hyponyms that have a common hypernym the hypernym appears in many documents that include the hyponyms Assumption C Hyponyms and their hypernyms are se mantically similar Our acquisition process computes a common hyper nym for expressions in the same itemizations It pro ceeds as follows First we download a large number of HTML documents from the WWW and extract a set of natural language expressions that are listed in the same itemized region of documents Consider the item ization in Fig 1 We extract the set of expressio
15. as slightly more than 5 of all the given hyponym candidates For 1398 hyponym candidates about 10 of all the candidates the preci sion was about 61 Another important point is that Step 2 tf in the graph refers to an alternative to our Step 2 procedure i e the Step 2 procedure in which df h C LD C was re placed by tf h C LD C One can see the Step 2 pro cedure with df works better than that with tf Table 1 shows some examples of the acquired HCSs and their common hypernyms Recall that a common suf fix of an HCS is a good candidate to be a hypernym The examples were taken from cases where a common suffix hypernym hyponym hyponym LAY hypernym hyponym ZR E D hypernym hyponym OK 5 hypernym hyponym lt M7 hypernym hyponym amp v2 9 hypernym hyponym amp IFIEN Z hypernym hyponym 5 6 hypernym The hypernym and hyponym may be bracketed by or 669 Figure 6 lexicosyntactic patterns of an HCS was not produced as a hypernym This list is actually the output of Step 3 and shows which HCSs and their hypernym candidates were eliminated modified from the output in Step 4 and which rule was fired to eliminate modify them Next we eliminated some steps from the whole pro cedure Figure 4 shows the accuracy when one of the steps was eliminated from the procedure Step X or Rule X refers to the accuracies obtained through the proc
16. e However we found no significant improvement when this alternative was used in our experiments as we later explain 2 4 Step 4 Application of other heuristic rules The procedure described up to now can produce a hyper nym for hyponym candidates with a certain precision We found though that we can improve accuracy by using a few more heuristic rules which are listed below Rule 1 If the number of documents that include a hyper nym candidate is less than the sum of the numbers of the documents that include an item in the HCS then discard both the hypernym candidate and the HCS from the output Rule 2 If a hypernym candidate appears as substrings of an item in its HCS and it is not a suffix of the item then discard both the hypernym candidate and the HCS from the output If a hypernym candidate is a suffix of its hyponym candidate then half of the members of an HCS must have the hypernym can didate as their suffixes Otherwise discard both the hypernym candidate and its HCS from the output Rule 3 If a hypernym candidate is an expression belong ing to the category of place names then replace it by place name In general we can expect that a hypernym is used in a wider range of contexts than those of its hyponyms and that the number of documents including the hyper nym candidate should be larger than the number of web documents including hyponym candidates This justifies Rule 1 We use the hit counts given by an
17. e hyponym candidates in an HCS C occupying an argument position p of a verb v as fhypo C p v As sume that all possible argument positions are denoted as pi p and all the verbs as v1 Um We then define the co occurrence vector of hyponym candidates as follows hypov C _ frypo C p1 v1 Faypo C P2 v1 ae Frypo C pi 1 Um fhypo C Pl Um In the same way we can define the co occurrence vec tor of a hypernym candidate n hyperv n a f n p1 v1 ee f n pi Um Here f n p v is the frequency of a noun n occupying an argument position p of a verb v obtained from the pars ing results of a large number of documents 33 years of Japanese newspaper articles Yomiuri newspaper 1987 2001 Mainichi newspaper 1991 1999 and Nikkei news paper 1990 1998 3 01 GB in total in our experimental setting The semantic similarities between hyponym candi dates in C and a hypernym candidate n are then computed by a cosine measure between the vectors hypov C hyperv n hypou C hypero n sim n C Our procedure sorts the hypernym HCS pairs h C Ci using the value sim h Ci Ci df h Ci LD Ci idf h Ci G Note that we consider not only the similarity but also the df idf score used in Step 2 in the sorting An evident alternative to the above method is the al gorithm that re ranks the top 7 hypernym candidates ob tained by df idf for a given HCS by using the same scor
18. edure from which step X or rule X were eliminated Note that both graphs indicate that every step and rule contributed to the improvement of the precision Figure 5 compares our method and an alternative method which was the algorithm that re ranks the top 7 hypernym candidates for a given HCS by using the score sim h C df h LD C idf h G where h is a hy pernym candidate in Step 3 Recall that our algorithm uses the score only for sorting pairs of HCSs and their hy pernym In other words we do not re rank the hypernym candidates for a single HCS We found no significant im provement when the alternative was used 4 Comparison with alternative methods We have shown that our assumptions are effective for ac quiring hypernyms However there are other alternative methods applicable under our settings We evaluated the followings methods and compared the results with those of our procedure Alternative 1 Compute the non null suffixes that are shared by the maximum number of hyponym can didates and regard the longest as a hypernym can didate Alternative 2 Extract hypernyms for hyponym candi dates by looking at the captions or titles of the item izations from which hyponym candidates are ex tracted Alternative 3 Extract hypernyms by using lexicosyntac tic patterns Alternative 4 Combinations of Alternative 1 3 The evaluation method for Alternative 1 and Alterna tive 2 is the same as the one for our method
19. names is given as an HCS the procedure tends to produce the name of the region or area that includes all the places designated by the hyponym candidates We then added the rule to re place such place names by the expression place name which is a true hypernym in many of such cases Recall that we obtained the ranked pairs of an HCS and its common hypernym in Step 3 By applying the above rules some pairs are removed from the ranked pairs or are modified For some given integer k the top k pairs of the obtained ranked pairs become the final output of our procedure as mentioned before 3 Experimental Results We downloaded about 8 71 x 10 HTML documents 10 4 GB with HTML tags and extracted 9 02 x 104 HCSs through the method described in Section 2 1 We To judge if a hypernym candidate is a place name we used the output of a morphological analyzer Matsumoto et al 1993 100 k n Step 2 80 pis Step 2 tf 8 E Hh irn oe 80 pa im x on SOO OOK O82 HOPE SE Sox accuracy 40 F a a Fy x a yy ECR RAK Ra RK EAK ek OK RK j L aon i gg EHIE a i Epe 500 1000 1500 2000 2500 of hypernym hyponym pairs Figure 3 Contribution of each step 100 z z i Proposed Method eter RA Step 4 0 i hada Rule 1 fal E ta AEK Rule 2 raa F e Rule 3 pd RK om O amg 3 ba so 13 P re BeBe Ea s POG SS OmgOxKxXtT accuracy 4
20. ns Toyota Honda Nissan from it From Assumption A we can treat these expressions as candidates of hyponyms that have a common hypernym such as company We call such expressions in the same itemization hyponym candidates Particularly a set of the hyponym candi dates extracted from a single itemization or list is called a hyponym candidate set HCS For the example docu ment we would treat Toyota Honda and Nissan as hy ponym candidates and regard them as members of the same HCS We then download documents that include at least one hyponym candidate by using an existing search engine and pick up a noun that appears in the documents and that has the largest score The score was designed so that words appearing in many downloaded documents are highly ranked according to Assumption B We call the selected noun a hypernym candidate for the given hy ponym candidates Note that if we download documents including Toy ota or Honda many will include the word company which is a true hypernym of Toyota However words which are not hypernyms but which are closely associ ated with Toyota or Honda e g price will also be in cluded in many of the downloaded documents The next step of our procedure is designed to exclude such non hypernyms according to Assumption C We compute the similarity between hypernym candidates and hyponym candidates in an HCS and eliminate the HCS and its hy pernym candida
21. omatic extraction of hyponyms from newspaper us ing lexicosyntactic patterns In JPSJ SIG Technical Re port 2003 NL 157 pages 77 82 in Japanese Sharon A Caraballo 1999 Automatic construction of a hypernym labeled noun hierarchy from text In Pro ceedings of 37th Annual Meeting of the Association for Computational Linguistics pages 120 126 Michael Fleischman Eduard Hovy and Abdessamad Echihabi 2003 Offline strategies for online ques tion answering Answering questions before they are asked In Proceedings of the 41st Annural Meeting of the Association for Computational Linguistics pages 1 7 Marti A Hearst 1992 Automatic acquisition of hy ponyms from large text corpora In Proceedings of the 14th International Conference on Computational Lin guistics pages 539 545 Kyosuke Imasumi 2001 Automatic acquisition of hy ponymy relations from coordinated noun phrases and appositions Master s thesis Kyushu Institute of Tech nology Hiroshi Kanayama Kentaro Torisawa Yutaka Mitsuishi and Jun ichi Tsujii 2000 A hybrid Japanese parser with hand crafted grammar and statistics In Proceed ings of COLING 2000 pages 411 417 Yuji Matsumoto Sadao Kurohashi Takehito Utsuro Hi roshi Taeki and Makoto Nagao 1993 Japanese Morphological Analyzer JUMAN user s manual in Japanese George A Miller Richard Beckwith Christiane Fell baum Derek Gross and Katherine J Miller 1990 Introduction to wordnet An
22. on line lexical database Journal of Lexicography 3 4 235 244 Emmanuel Morin and Christian Jacquemin 2003 Auto matic acquisition and expansion of hypernym links In Computer and the Humanities 2003 forthcoming Table 1 Examples of the acquired pairs of a hypernym candidate and HCS Rank Hypernyms Rank Fired Hypernyms by Hyponym candidate sets obtained in by Rules obtained in Step4 Step3 Step3 1 2 Step4 ALA murder BK arson 1R rape fe Ave burglary 29 Z ASS burgle robbery JE A HE theft without breaking in ALF 68 ALF JE AGRE robbery without breaking in crime crime AZT Moscow 17 Kiev Y XV 7 Y F Tashkend SVAZY Minsk KEY V Thilisi F xXx Dushanbe BY 7 he 69 ey a7 Bishkek PAY F Astana a 7 Kishinev Russia 169 place ZU Ny Erevan 7 7 Baku 7 7 7S F Ashkhabad name 7 N Seguignol EIFEEE Yasuo Fujii HR Yuji Goshima 7k 7XH12 Tomotaka Tamaki A a 78 4240 Hiroki Fukutome PIF Keiichi Hirano EF 196 EF Vak FY Sheldon HF Kazuhiko Shiotani player player These are baseball players TA VU AHF wireless card NEJ X 2 Y F 4 security rR hes 81 HEAL A radio FEJESA a kind of instrument AGAR 200 AR P H S 2 ESj a kind of department wireless wireless 4 77 amp
23. ow score values since we ob tain statistical measures from all the local document sets for all the hyponym candidates in an HCS Nevertheless this scoring method is a weak method in one sense There could be many non hypernyms that are 3In our experiments NV is a set consisting of 37 639 words each of which appeared more than 500 times in 33 years of Japanese newspaper articles Yomiuri newspaper 1987 2001 Mainichi newspaper 1991 1999 and Nikkei newspaper 1983 1990 3 01 GB in total We excluded 116 nouns that we ob served never be hypernyms from N An example of such noun is D We found them in the experiments using a develop ment set strongly associated with many of the hyponym candidates for instance price for Toyota and Honda Such non hypernyms are dealt with in the next step An evident alternative to this method is to use tf n LD C which is the frequency of a noun n in the local document set instead of df n LD C We tried using this method in our experiments but it produced less accurate results as we show in Section 3 2 3 Step 3 Ranking of hypernym candidates and HCSs by semantic similarity Thus our procedure can produce pairs consisting of a hypernym candidate and an HCS which are denoted by h C1 C1 h C2 C2 ace h Cm Cm Here C1 Cm are HCSs and h C is a common hy pernym candidate for hyponym candidates in an HCS C i In Step 3 our procedure ranks these pair
24. s by using the semantic similarity between h C and the items in Cj The final output of our procedure is the top k pairs in this ranking after some heuristic rules are applied to it in Step 4 In other words the procedure discards the remaining m k pairs in the ranking because they tend to include erroneous hypernyms As mentioned we cannot exclude non hypernyms that are strongly associated with hyponym candidates from the hypernym candidate obtained by h C For exam ple the value of h C may be a non hypernym price rather than company when C Toyota Honda The objective of Step 3 is to exclude such non hypernyms from the output of our procedure We expect such non hypernyms to have relatively low semantic similarities to the hyponym candidates while the behavior of true hy pernyms should be semantically similar to the hyponyms If we rank the pairs of hypernym candidates and HCSs according to their semantic similarities the low ranked pairs are likely to have an erroneous hypernym candidate We can then obtain relatively precise hypernyms by dis carding the low ranked pairs The similarities are computed through the following steps First we parse all the texts in the local document set and check the argument positions of verbs where hy ponym candidates appear To parse texts we use a down graded version of an existing parser Kanayama et al 2000 throughout this work Let us denote the frequency of th
25. te from the output if they are not seman tically similar For instance if the previous step of our procedure produces price as a hypernym candidate for Toyota and Honda then the hypernym candidate and the hyponym candidates are removed from the output We empirically show that this helps to improve overall preci sion Finally we further elaborate computed hypernym can didates by using additional heuristic rules Though we admit that these rules are rather ad hoc they worked well in our experiments We have tested the effectiveness of our methods through a series of experiments in which we used HTML documents downloaded from actual web sites We ob served that our method can find a significant number of hypernyms that at least some of alternative hypernym acquisition procedures cannot acquire at least when only a rather small amount of texts are available In this paper Section 2 describes our acquisition al gorithm Section 3 gives our experimental results which we obtained using Japanese HTML documents and Sec tion 4 discusses the benefit obtained through our method based on a comparison with alternative methods 2 Acquisition Algorithm Our acquisition algorithm consists of four steps as ex plained in this section Step 1 Extraction of hyponym candidates from itemized expressions in HTML documents Step 2 Selection of a hypernym candidate with respect to df and idf Step 3 Ranking of hypernym candidates and H

Acquiring Hyponymy Relations from Web Documents

Contents

Download Pdf Manuals

Related Search

Related Contents