Home

ChaSen Morphological Analyzer version 2.4.0 User`s Manual

1. RAIA A FREE DARA RER TRAGBRESMN Y ATA JUMAN EA version 2 0 NAIST Technical Report NAIST IS TR94025 1994 LL PERE RARA TERESA LY AT A ViJUMAN version 1 0 EHAE NAIST Technical Report NAIST IS TR96005 1996 LIFE RARA TERESA ROMA ATA ViJUMAN OF HBREL HUES SERE 96 NL 115 pp 29 34 September 1996 Vi eh THEON ZAJA U 7 RE BA ARA LANA O EREM REREN ARA SHR EBACE EAN E AOL NAIST IS MT9551092 March 1997 heme EL ERE TI MORA T EORR RR AmE DONA ARO NAIST IS MT9551119 March 1997 PEKRE MARIA TIA REUNIR LE REE TI VOM DEBERAN TERA RE 96 NL 119 May 1997 AEA E UT ERE PAK RA A ARR Y AT AAOWMA RRM OSR AA RP ERK tC pp 437 440 1997 ABRES 7A SAA RIN ASE ER EE TANYA VII T BANAL SHU EE TERE 1998 ARE a mA PALS I NAPE ERO RS AR ER EBT AS eK EE ttt NAIST IS MT9851103 March 1999 HA AYE GUE PAS R RO BOR MERIC ES AAR OPER ET VEE ERU Si CHE Vol 40 No 5 p p 2325 2337 May 1999 REA Sa Well AAA LOR ES Be E PARRA BO ERE A OR NAD MEARE Hj ALLA AAA 99 NL 134 p p 23 30 Nov 1999 Masayuki Asahara Extended Statistical Model for Morphological Analysis R R mE At KS be
2. surface form base form first reading candidate conjugated form first reading candidate base form all readings conjugated form all readings base form first pronunciation candidate conjugated form first pronunciation candidate base form all pronunciation conjugated form all pronunciation base form surface form with ruby i e A Kanji B kana C X 1 first semantic information candidate all semantic information semantic information if NIL print c X 1 part of speech name of all layers in the part of speech hierarchy joined together by c part of speech name of first n layers n 1 9 in the part of speech hierarchy joined together part of speech code part of speech name part of speech name at the nth layer n 1 9 or the deepest layer 0 only for backwards compatibility sub part of speech name if NIL print POS sub part of speech code if NIL print c X 1 conjugated type code conjugated type name if NIL print c X 1 conjugated form code conjugated form name if NIL print c X 1 cost of morpheme the input sentence x if optimal path otherwise the index of the path of the output lattice Starting position of the morpheme in the Ending position of path s morphemes 1 Cost of path indices of the elements in the preceeding path joined together by C costs of the elements in the preceeding path joined tog
3. Represent each morpheme as a Prolog compound term and output them as a list Detailed display mode for VisualMorphs Output in the format specified in the format string F Display the help for output formatting options Treat full stops and empty lines as sentence boundaries Specify output file Manually set cost threshold Use rc_file as the chasenrcfile Read the default chasenrc file PREFIX etc chasenrc Specify input language Show a list of POS category codes and their names Show a list of inflection category codes and their names Show a list of inflection type code inflected form code inflected form name Select the input encoding e EUC JP s Shift_JIS w UTF 8 u UTF 8 a ISO 8859 1 Show the help message Show the version number Restricted analysis About the j option Normally ChaSen treats the end of a line as the end as the end of an input sentence Because of this when analyzing a file where newlines appear in the middle of a sentence the correct results are often not obtained In these cases adding the j option will cause full stops and other sentence final punctuation by default o or empty lines to be used for identifying sentence boundaries The characters used to split sentences with the j option can also be specified by setting the punctuation characters Xt Y X value appropriately in chasenrc 1 4 Output Formats The output format of analysis results can be chang
4. s author Taku Kudoh and we will simultaneously release versions for use with both ChaSen and MeCab One problem that neither JUMAN ChaSen or MeCab has addressed is unknown word processing i e the handling of words not in the dictionary Machine learning models to solve this problem are currently under development at NAIST 32 33 Sometime in the future we would like to release a morphological analyzer with a different framework than ChaSen that can support unknown word processing 3 http mecab sourceforge net 4 http mecab sourceforge net soft html 5 http nlp kuee kyoto u ac jp nl resource juman html 19
5. Cost threshold In the process of morphological analysis there may be situations where users want to allow all analyses within a beam search cost width This setting is used to specify a cost width To ouput all solutions within the cost width use the m and p options cost roma 0 cost width default value The cost width can also be specified with the w option overriding the value set in the chasenrc file Undefined connectivity cost This setting specifies the connectivity cost for morpheme sequences not defined in the connection rule file If an undefined connectivity cost is not given or it is set to 0 then morpheme sequences not in the connection rule file will never be permitted The default value is 0 oer com cost 500 undefined connectivity cost of 500 Output format This settings lets users change the output format of ChaSen s results covreur ronnan An thy t P n The output format can also be specified using the F flag overriding any value set in chasenrc For more information on formatting see Section 1 4 BOS string The setting specifies the string to display at the beginning of the results for a sentence Using S will display the entire input sentence The default is the empty string cs stan Input sentence S n BOS string is Input sentence S EOS string The setting specifies the string to display at the end of the results for
6. KEL NAIST IS MT9851001 March 2000 RSA H FAAS AR MRA ZN S SN APES GUI Y I VisualMorphs PEPA AA 244 2000 NL 137 p 98 June 2000 AR EF BAAS 878 CRRA AFBI RT Oo SRR HMM EDU RU ARR 2000 NL 137 p p 39 46 June 2000 Masayuki Asahara Yuji Matsumoto Extended Models and Tools for High performance Part of Speech Tagger Proceedings of COLING 2000 July 2000 BOR ESE MA OR FRU RC LSA TOUET VOR HUE ZAR 2000 NL 139 p p 25 32 Sep 2000 16 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 PA AR DEBRA ATA TZI BLE Vol 41 No 11 p p 1208 1214 Nov 2000 fe Beha UR ESE TY LV FDA TARA NIRO BAR 1 RU SHOR EL TAE VIV av 7 Feb 2001 a BEN FER AC UHE E UE EF PK ARA FR LU SEMI U LBL BORE B2 E HU SHORS ELS YI Y av 7 pp 39 46 Feb 2002 RE EF BAAS 814 BRR EF Y Y AV TOMAR HUI ARRE RA NRO RBA free PAUL WR AAS ULA SIGNL 154 pp 47 54 2003 I EB TER M RAK 1878 Support Vector Machine Av JEEP ARO ERUS i Mak Vol 44 No 5 pp 1354 1367 May 2003 Taku Kudo Kaoru Yamamoto Yuji Matsumoto Appliying Conditional Random Fields to Japanese Morphological Analysis EMNLP 2004 2004 ARAA multi FG REESE CHA TAR L AN KS AAT Re AZ OR EIA ATARE ARE Vol 19 No 3 pp 334 339 2004
7. sions of their file names Multiple dictionary sets may also be specified 10 Relative paths i e paths not starting with are assumed to start in the same directory as the grammar files Here is an example DADIC chadic home rikyu mydic chadic In the example below two sets of dictionaries are read in a chadic da lex dat in the grammar file directory b chadic da lex dat in home rikyu mydic When dictionary lookups are done both of the above dictionary sets will be used 2 The setting DADIC is used to specify a double array dictionary for Darts omr chadic In the above example chadic da chadic lex and chadic dat in the same directory as the grammar files will be read The maximum number of usable dictionaries is set to 32 3 Unknown word part of speech When an unknown word is detected this setting indicates what part of speech to treat it as while applying ChaSen s connection rules If multiple parts of speech are given then the connection rules for each part of speech are applied UNKNOWN_POS 4a Y ARE one part of speech UNKNOWN_POS 4a VAR ail fikK multiple parts of speech 4 Part of speech cost The morphological analyzer calculates analysis precidences as costs When there is ambiguity while analyzing the result with the lowest total cost is given precidence The part of speech cost setting is used to define the magnitude of cost ass
8. w H 30 H ER 19 4 Please send any inquiries regarding ChaSen to the following address Computational Linguistics Laboratory Graduate School of Information Science Nara Institute of Science and Technology 8916 5 Takayama Ikoma Nara 630 0192 Japan Tel 81 743 72 5240 Fax 81 0743 72 5249 E mail chasen is naist jp URL http chasen legacy sourceforge jp 1 Chasen User s Manual 1 1 Installation 1 Install the necessary tools The following tools are necessary to compile ChaSen e Darts version 0 3 or later e libiconv if not part of your system s standard installation 2 Run configure 3 configure e When specifying the location of Darts header files configure with darts usr local include e When using libiconv Q configure with libiconv yes e When specifying the location of libiconv Nee ON uy a configure with libiconv usr local The compiler and options will be determined automatically For more information on how to use configure consult INSTALL or the output of configure help 3 Run make make ChaSen s executable is created in chasen chasen the libraries in mkchadic and the dictionary cration program in mkchadic Sometimes compilation will fail when using the OS standard make In that case GNUmake should be used 4 Run make install make install The installation directory
9. we decided to fork into separate projects and Kyoto University s expanded version was soon released as JUMAN 3 0 beta in June of 1996 NAIST s fork was renamed ChaSen and version 1 0 was released in February of 1992 The planed im provements to JUMAN were made through the release of versions 1 5 through 2 3 and with the release of ChaSen 2 4 almost all of the planned features had been added Development progressed on the following schedule 1 ChaSen 1 0 development of system independent dictionaries replacement of NDBM with binary trees 2 ChaSen 1 0 Refactored and improved performance of system 3 ChaSen 1 0 Support for undefined connectivity costs compound parts of speech and user definition of output formatting 4 ChaSen 1 0 Support for JIS encoding 5 ChaSen 1 0 Definitions for readings of inflectional endings 6 WinCha 1 0 Support for Windows 7 ChaSen 1 5 Converted to library 8 ChaSen 1 5 Converted to server 9 ChaSen 2 0 Stratification of POS definitions 18 10 ChaSen 2 0 Variable length connection rules 11 ChaSen 2 0 Created a dictionary for words with half width characters dictionary using SUFARY 12 ChaSen 2 0 Expansion of output formats 13 ChaSen 2 0 Training of models using variable length connection costs 14 ChaSen 2 4 Restricted analysis C The Future of Morphological Analyzers A morphological analyser called MeCab has been released by Taku Kudoh MeCab us
10. Chooi Ling Goh Masayuki Asahara and Yuji Matsumoto Chinese Word Segmentation by Clas sification of Characters International Journal of Computational Linguistics and Chinese Language Processing Vol 10 No 3 pp 381 396 September 2005 Chooi Ling Goh Masayuki Asahara and Yuji Matsumoto Training Multi Classifiers for Chinese Unknown Word Detection Journal of Chinese Language and Computing Vol 15 No 1 pp 1 12 2005 H ERK FE E ly T F 37 VY WS ARER PABA TRENMRAZO BE HAU 11 El Hii XA pp 245 248 2005 A 11 EER E aut cu El BUSES AA ARAA E ARE NG ULM BOM ALE KHER AMA pp 604 607 2005 Like fh TERESA AES RA KD D5 EIO COIS AUGE 11 HEM ADE Beith CHE 2005 Chooi Ling Goh Masayuki Asahara and Yuji Matsumoto Machine Learning based Methods to Chi nese Unknown Word Detection and POS Tag Guessing Journal of Chinese Language and Computing Vol 16 No 4 pp 185 206 2006 TREE BORUIES2 RARA TRENERKA KES AAA A AER BABB LEEA SIGNL 173 pp 67 74 2006 HE CIRH SES PAAR RAIA ORERE DO AM AMENT AORE AEREO Fos AAS LEAS SIGNL 179 2007 17 Appendix A Regarding Copyright and Usage Restrictions The ChaSen morphological analyzer was developed as free software to widely aid research on natural language processing ChaSen s copyright is held by Computational Li
11. 2 PITIKIZTDILDL VANS specifying that Ik 1229 is a noun or that IZ amp Y should be treated as a single morpheme 202 amp analysis candidates that violate the constraints like the fourth character amp being treated as an independent morpheme or ZHE Y getting split into Z4 and amp Y will be rejected Input format The input for constrained analysis is the same as ChaSen s standard output format but reading and lemmatization information is ignored In the following examples tabs are represented by a gt D t Z7 Xt IZ 4 tUNSPEC KIDDA AAT ll 4i 10M Z7 RUN 1102 Y tUNSPEC POD EOS Se p Each line consists of a segment A segment can be one of the following morpheme specification l l l 2 o sentence fragment end of sentence comment e morpheme specification This segment represents a single morpheme a unit that will not be split any further Morpheme specification segments have part of speech information from the fourth column onward The format is the same as ChaSen s standard output If you write PUNSPEC instead of part of speech information ChaSen will look up the segment in its dictionary and use the corresponding entry as its results If there is no entry the segment will be labeled as an unknown word sentence fragment A segment without any part of speech information represents a sentence fragment The co
12. ARE DISCLAIMED IN NO EVENT SHALL THE Nara Institute of Science and Technology BE LIABLE FOR ANY DIRECT INDIRECT INCIDENTAL SPECIAL EXEMPLARY OR CONSEQUEN TIAL DAMAGES INCLUDING BUT NOT LIMITED TO PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHER WISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE JUMAN version 0 6 version 0 8 version 1 0 version 2 0 ChaSen version 1 0 version 1 5 version 2 0 version 2 2 0 version 2 3 0 version 2 4 0 ChaSen for Windows version 1 0 version 2 0 version 2 4 0 NAIST Technical Report 1st edition NAIST IS TR99008 2nd edition NAIST IS TR99012 17 February 1992 14 April 1992 25 February 1993 11 July 1994 19 February 1997 7 July 1997 15 December 1999 06 December 2000 16 February 2003 30 March 2007 29 March 1997 15 December 1999 30 March 2007 20 April 1999 15 December 1999 BR 1 Chasen User s Manual 11 Installation 4 2 94 ban Did were ee Rd Gin ee AE a i ae ga 1 2 sRumning Ghasen yo eas a ei ee SE Lee ee A Se ot he de 1 3 Runtime Options 24 224264 A Aa ee ee ee De ee 1 41 Output Formats s s aupa ai Coe ie ae Ged ae ar oe eee ee da 1 5 Constrained Analysis 36 5 ook Sepang he AAA ee eM a Se ee Se EOS 2 The chasenrc Resource File 3
13. ChaSen Morphological Analyzer version 2 4 0 User s Manual Yuji Matsumoto and Kazuma Takaoka and Masayuki Asahara 2007 03 19 Copyright c 2007 Computational Linguistics Laboratory Graduate School of Information Science Nara Institute of Science and Technology Morphological Analysis System ChaSen 2 4 0 User s Manual Yuji Matsumoto Kazuma Takaoka and Masayuki Asahara This translation of the ChaSen user s manual was made with support from the non profit organization GSK by Eric Nichols Copyright c 2007 Nara Institute of Science and Technology All rights reserved Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met 1 Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer 2 Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution 3 The name Nara Institute of Science and Technology may not be used to endorse or promote products derived from this software without specific prior written permission THIS SOFTWARE IS PROVIDED BY Nara Institute of Science and Technology AS IS AND ANY EXPRESS OR IM PLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
14. The ChaSen Libarary 4 Using ChaSen from Other Systems Al Using Chasen from Perl d aa paitia a ee RO De ek oe ee da Bibliography Appendix A Regarding Copyright and Usage Restrictions B The Connection between JUMAN 3 0 and ChaSen C The Future of Morphological Analyzers 10 14 15 15 16 18 18 18 19 Introduction In the computational analysis of Japanese unlike American and European languages to begin with there are the following two problems The first is the problem of morphological analysis With the spread of word processors a big problem in the input of Japanese has gone away but in computational analysis of Japanese first the individual morphemes in the input sentence need to be recognized For this we need a dictionary as large as can be practically supported so at the same time there is also the problem of how to maintain this dictionary One more problem is the reality that in Japanese there is no widely accepted or agreed upon grammar or grammatical terminology In grammars taught in school in general there are word classifications and grammatical terminologies however amongst researchers they are not held in very high regard and are not suitable for computers Although morphological analyzers a tool of foremost necessity in Japanese analysis have already been developed by many research groups and many technological problems brought to light there is no common tool in circulation in the world This
15. a sentence Using ZS will display the entire input sentence The default is EOS n 12 cens_sranne END n EOS string is END 11 Whitespace part of speech ChaSen treats the halfspace whitespace character ASCII code 32 and tab ASCII 9 as whitespace and ignores them during analysis Normally whitespace information is not included in ChaSen s output but this can be changed by using the SPACE_POS setting For example the setting given below will output punct whitespace for whitespace cseace_pos punct whitespace whitespace part of speech is punct wstespacet Furthermore by setting the output format to m and specifying a whitespace part of speech uesrs can get output that is corresponds exactly to the input sentence whitespace included 12 Annotations This setting allows strings that begin and end with a certain sequence to be treated as an annotation and ignored during morphological analysis In the results the annotation string will be output as a single morpheme Each annotation definition consists of a list of a start string and stop string followed by optional part of speech information or a formatting string The stop string can also be omitted in which case the start string itself will be treated as the annotation If the part of speech information and format string are omitted then absolutely no information about the annotation s morpheme will be output
16. atenated together and displayed as punctuation it Compound word output ChaSen can be configured to treat compound words defined in the morphological dictionary file in dic two different ways a compound 244 the morphological information for the entire compound word is output b compositional the compound word is decomposed into individual words and the mor phological information for eachword is output The default setting is compound A48 courro courouo Aak output compound morphological information Compound word output can also be controlled by the 0c and 0s options Delimiters This setting allows users to define the characters that are used as sentence delimiters when the j option is set see 1 3 Both half width and full width characters can be used as delimiters For example the following definition treats the full width characters the half width characters and whitespace as sentence delimiters oeum Me dogs che ie Encodings The character encoding that ChaSen supports can be changed by reencoding the morphological file and recompiling ChaSen The ENCODE setting is used to indicate the encoding that ChaSen will use For example the following definition denotes Unicode croone u The supported encodings are e EUC JP s Shift_JIS w UTF 8 u UTF 8 a ISO 8859 1 The ChaSen Libarary The ChaSen module can be i
17. e_tostr char str_in 4 These functions perform morphological analysis on the input If ChaSen has not been initialized it is initalized before proceeding There are 4 functions differing on whether the input and output are strings or file pointers chasen_fparse and chasen_fparse_tostr performs morphological analysis on strings read from a file pointer When the j option is set in chasen_getopt_argv ChaSen tokenizes the input sentences with delimiters before parsing chasen_sparse and chasen_sparse_tostr perform morphological analysis on the string str_in chasen_fparse and chasen_sparse output the results of morphological analysis to the file pointer fp out The return value of these functions is 0 chasen_fparse_tostr and chasen_sparse_tostr store the results of morphological analysis in ChaSen s internal memory and return a pointer to the region of memory This region of memory can be accessed until chasen_fparse_tostr or chasen_sparse_tostr is called again Using ChaSen from Other Systems 4 1 Using ChaSen from Perl ChaSen can be called in Perl by using the per1 ChaSen pm Perl module Consult the perl README file for information on installation and usage 15 SAM 1 2 10 11 12 13 14 15 16 17 18 A HER TAREA DAA ld lt 4 UBH 1992 WAR PRAGA REI TINA AA RS LOREM Y ATA UE 42 BEA FHE 1991
18. ed by using the F option or setting the value of output format Hi 7 4 Y Y F in chasenrc If there is an An at the end of the output format string a newline will be inserted at the end of each piece of morphological information and EOS will be output at the end of each sentence If there is no An at the end of the output format string then the morphological information for one sentence will be output on one line with a newline at the end Also if the output format string contains f e or option Here are some examples of output format string usage e Same as default f option hm thy thM tZU CAPI At T Nt F An or e Input word reading POS delimited by tabs n thy thP n e Only the input word m n e Wakatigaki mode input words divided by spaces Amu e Kanji kana conversion hy e Ruby mode output in the form of Kanji kana Yr O 6 c output will match that of the corresponding Below we give a list of all output format conversion strings and their meaning Conversion string Function fm 7M 4y AY yO YO ha nA a0 A0 xr ABC hi hit i0 hic Pc Pnc h 7H Hn hb BB Bc ht Tc hf Fc hc 8 hpb hpi y1 71 ps pe pc ppiC ZppeC 7B STR1 STR2 71 STR1 STR2 T STR1 STR2 F STR1 STR2 U STR1 STR2 U STR hh surface form conjugated form
19. es a discriminitive training model known as Conditional Random Field as opposed to the generative Hidden Markov Model used by ChaSen In 24 the MaCab s model is shown to have better accuracy than ChaSen s MeCab s other characteristic is it can output Soft Wakatigaki 30 In ChaSen s current framework it is not possible to support new models of analysis and freely design training features like in MeCab Recently there have also been various improvements relating to dictionaries For the new JUMAN dictionary together with the selection of a fundamental lexicon of Japanese information about orthi graphical variations forms is also being prepared UniDic a dictionary developed by Professor Den s group at Chiba University that was recently released is said to be easy to use not just for natural language pro cessing researchers but also for researchers in Arts and Humanities and speech processing At NAIST we plan to screen the entries in IPADIC and release a Japanese dictionary annotated with information about or thographical variations and compound words We plan to rename the new dictionary and remove the ICOT entries that were a pending problem for IPADIC We are also planning to release a dictionary for Chinese morphological analysis with Penn Chinese Treebank part of speech information once issues regarding usage rights have been settled We have discussed the release of the Chinese morphological analysis dictionary with MeCab
20. ether by C STR1 if detailed POS category STR2otherwise X 2 STR1 if not the empty string even if auxiliary information is NIL STR2 otherwise X 2 STR1 if conjugated STR2 otherwise X 2 Same as T STR1 STR2 STR1 if unknown word STR2 otherwise X 2 RAIRE if unknown word STR otherwise X 2 percent sign Conversion string Function specifies field width specifies field width 1 9 specifies field width n newline t tab backslash y single quotation mark double quotation mark X 1 In ipadic when morphemes have multiple readings as in the case of 47 lt W lt W lt the readings are displayed with half width braces and back slashes readings like so 4 4 1 7 In the standard output format i e that of y the word s first reading candidate T 7 is output and with output format y0 all of the readings 1 2 7 are output X 1 When A B C c are empty strings nothing is displayed X 2 The string divider can be an arbitrary string Brackets like lt gt are also usable For example o 7THSTRIRHSTR2H e B STR1 STR2 e U STR1 STR2 e U STR 1 5 Constrained Analysis Constrained analysis refers to a special kind of analysis that satisfies constraints used when the mor phological information or boundaries for a portion of the input sentence are already known For example it is possible to analyze the sentence
21. gt ANNOTATION lt gt m n output as is c T Gd RB0 punctuation CC Gis R0 punctuation CONN 4 STASHED noun quotation sting C nothing will be output ER X For example when using the above annotation definition ChaSen will output its results in the following format e text starting with j and ending with such as lt img src cha gif gt will be output as is e i 5 Ax will be output for 1 and J e 2il B HXFZYI will be output for strings in double quotes like hello again e strings enclosed in square brackets like ChaSen will be ignored in morphological analysis and no information will be included in its output 13 Part of speech concatenation This setting is used to concatonate together morphemes of certain parts of speech that appear in succession and output them as a single morpheme COMPOSIT_POS WAHE AE Edad AB GAR BE a0 13 3 14 15 16 For example with the above declaration of COMPOSIT_POS parts of speech are concatonated to gether in the following manner a Consecutive nouns 414 noun prefixes P445 4 5 Left numeric prefixes Hew ai A Reise are concatenated together and displayed as compound noun 1444 1 However this part of speech must be defined in the part of speech definition file grammar cha b Consecutive punctuation te is conc
22. has changed from version 2 1 onward Now it installs to the locations below by default PREFIX can be specified with configure prefix the default is usr local PREFIX bin chasen the ChaSen executable PREFIX libexec chasen dictionary construction programs PREFIX 1lib libchasen the ChaSen libraries PREFIX include chasen h ChaSen s header file PREFIX share chasen doc documentation l http cl aist nara ac jp 7etaku ku software darts However the following is not installed per1 ChaSen pm Perl module chasenrc is not installed with ChaSen Instead when the dictionary ipadic 2 6 0 or above is installed chasenrc s path is taken from chasen config and if there is no chasenrc in PREFIX etc a copy is made automatically When PREFIX etc already contains a chasenrc file it is not copied and must be manually updated 1 2 Running ChaSen The morphological analyzer s executable is installed into PREFIX bin chasen by the makeinstall com mand e Running the morphological analyzer ChaSen is started by running the chasen command in the following manner gt chasen options filename J ChaSen reads files from standard input or specified by command line arguments one line at a time and conducts morphological analysis on each sentence e Processing details ChaSen finds the lowest cost solution the solution where each morpheme s boundary has a variation from the minimum cost
23. is true of machine readable Japanese dictionaries as well This system was developed to offer the many reasearchers aiming at computational analysis of Japanese a commonly usable morphological analyzer Under these circumstances we took into account the above two problems and gave special consideration to making it easy for users to change the definition of the grammar and the connective relations between words This system was developed at a university by a small number of people and there are still areas that are not perfect but we plan to make a series of improvements as much as possible We hope that you will bear this in mind when using ChaSen The ChaSen system is based on the Japanese morphological analyzer JUMAN version 2 0 developed at Nagao Laboratory of Kyoto Univeristy and the Graduate School of Information Science at Nara Institute of Science and Technology JUMAN was made with the cooperation of many students and the staffs of Kyoto University and NAIST Also regarding the dictionary we used the dictionary from the Kana Kanji conversion system Wnn and a Japanese dictionary publically released by ICOT adding our own modifications We are especially grateful to Sadao Kurohashi of Tokyo University with whom we developed JUMAN 2 0 and Yutaka Myo ki who is currently working at Canon First we would like to thank Professor Makoto Nagao for creating the opportunity to develop JUMAN We are also grateful to Takehito Utsuro of Tsukuba Uni
24. ncluded in other programs using the ChaSen libraries libchasen a and libchasen so To do so include the header file chasen h The following library functions and variables are accessable include lt chasen h gt 14 int chasen_getopt_argv char argv FILE fp extern int Cha_optind Pass ChaSen options If ChaSen has not been initialized initialize it before setting the options If ChaSen s defaults options are acceptable calling this function can be omitted argv is an array of NULL terminated strings containing the command line options for ChaSen argv 0 always contains the program name When there is an error in the options an error message is output to the file at file pointer fp No output is produced when fp is set to NULL When there are no errors in the option settings 0 is returned When there is an error 1 is returned The number processed options including argv 0 is stored in the external variable Cha_optind The following is a usage example 6 In the program chawan the options r home rikyu chasenrc proj j are passed to ChaSen After chasen_getopt_argv is called Cha_optind is assigned 4 char option chawan r home rikyu chasenrc proj j NULL chasen_getopt_argv option stderr include lt chasen h gt int chasen_fparse FILE fp_in fp_out int chasen_sparse char str_in FILE fp_out char chasen_fparse_tostr FILE fp_in char chasen_spars
25. nguistics Laboratory Graduate School of Information Science Nara Institute of Science and Technology There are not any particular restrictions imposed on use and modification of this software however the following conditions apply to its redistribution B The Connection between JUMAN 3 0 and ChaSen Since JUMAN 2 0 was relased in July of 1994 Nagao Laboratory at Kyoto University and Matsumoto Laboratory at Nara Institute of Science and Technology have been trying different approaches to its expan sion At Kyoto University researchers have been working on adding functionality for processing multi word expressions and parsing bracketed expressions in order to describe connective relations that cannot be rep resented by existing bi gram models and they have produced expanded versions of the grammar files and morphological dictionaries with large scale updates At NAIST anticipating the accumulation of a large amount of tagged Japanese data we focused on adding functionality for automatically learning connection rules that go beyond bi grams including word and part of speech label tagging and the development of dictionaries that do not depend on the NDMB Unix hash database The latter improvement aimed at ad dressing the requests to use the software on operating systems other than UNIX and improve the compilation time and search speed of the dicitonaries Because the two approaches to connetivity rules going beyond bi grams are fairly different
26. ntents of this segment will be processed without any contraints However no candidates that cross the segment boundaries will be generated e end of sentence Lines starting with EOS BOS EOS or XX and lines containing nothing but a newline mark the end of a sentence e annotations Putting ANNO in the part of speech information column will make that segment an annotation Annotations are displayed in ChaSen s output but they are not used in its analysis The display is determined in chasenrc Example analysis An example of restricted analysis is given below Input F gt chasen s ZA Z7 t 2 tUNSPEC le ENZO e DN lio e ki fe TIDE Vt ZY hY t II 2 VY tUNSPEC HOJ EOS 0 ss Output gt D t t t ARE Ze t Et BH 8 4 HE HL AZ DION A al M THEY ZY RUANDA Ae gi HE A t A t D t Bd Bi HR CM TVG WS t Bi Hot Be t HARE o Wto to t o EOS l l l Areas of caution in restricted analysis e During restricted analysis even if ANNO is set no output will be displayed unless comments are enabled in chasenrc e During restricted analysis whitespace part of speech tagging and whitespace skipping are disabled this is to support comments 2 The chasenrc Resource File The chasenrc resource file is used to define the various necessary options for running the ChaSen morpho logical analyzer These definition
27. ociated with each part of speech as well as set the cost of unknown words Costs must be integer values X POS_COST 6 1 any part of speech default cost 1x CRABB 500 unknown words cost 500x CA 2 nouns cost 2x iil BAZA 3 proper nouns cost 3x Ne A When multiple costs are defined for a part of speech the last cost is given precedence In the above example the cost of nouns 44 is 2 but the morpheme cost of proper nouns 41 144 increases ple th t of 4 is 2 but th ph t of prop 2 44 2 The same morpheme cannot be registered in a single dictionary set multiple times but a given morpheme may appear in multiple dictionary sets In this case there will be duplicates of a morpheme 11 10 to 3 The setting at the top indicates that the morpheme cost for parts of speech not explicitly defined should be set to 1 i e no change in the total cost of the path The cost of unknown words is set to 500 Relative weights of connectivity and morpheme costs The cost in morphological analysis is calculated as the sum of morpheme cost and connectivity cost This setting lets users assign weights to these two kinds of costs The cost of an analysis result will be calculated as the sum of each cost multiplied by its weight If this setting is omitted it defaults to 1 CONN_WEIGHT 1 connectivity cost of 1 MORPH_WEIGHT 1 morpheme cost of 1
28. s are usually kept in PREFIX etc chasenrc but they can also be stored in the file chasenrc in the user s home directory The chasenrc file can also be specified by an option when chasen is initialized The following precendence order wil be used to determine which chasenrc file will be loaded when ChaSen is run 1 Unix Windows the file specified by the r option at initialization time 2 Unix Windows the file set in the CHASENRC environment variable 3 Windows The chasenrc set in the registry key chasenrc in HKEY_CURRENT_USER Software NAIST ChaSen 4 Unix the chasen2rc file in the user s home directory 5 Unix the file chasenrc in the user s home directory 6 Unix PREFIX etc chasenrc not installed by default A list of settings is given below Of these settings DADIC UNKNOWN_POS and POS_COST absolutely must be defined 1 The grammar file directory setting This setting specifies the directory where the grammar files grammar cha ctypes cha cforms cha connect cha reside conamur usr local lib chasen ipadic dic This setting can be omitted in which case it is assumed to be the same as the directory that the chasenrc file resides in In the chasenrc file distributed with version 1 01 or later of chasen s dictionary ipadic GRAMMAR is omitted 2 System dictionaries This setting is used to specify double array dictionaries chadic da lex dat omitting the exten
29. that is within the established cost threshold and display the results following the formatting options The meaning of each option is summarized in the next section e Example usage The input file can be given as arguments to ChaSen For example fe cat temp chasen temp AL TADA Ik 2N HH 725 FR Ayay A A ee 4 Fla A 7z 32 EOS AL lk MIER FRANTS k UA Ai 4 R AR Bi 485 20 514 AT Re 4 Hl HE Ehia EHRE A Bhad Ave Eaa By a LA A LR NITRE HA KEK YA EHH RIR A SEAR 1 3 Runtime Options ChaSen supports several runtime options They are summariezed below For options that take arguments such as r the argument may be optionally separated from the option with whitespace e Display options for ambiguous input all display methods display in the same format for unambiguous results b Show only the solution with the rightmost longest match default m Show multiple morphemes for only the ambiguous areas p Expand the ambiguous combinations and show all possible solutions separately e Display options for individual morphemes sp Fh format e Other options j 0 W lp 1t 1f file width rc_file lang Display arranged in columns default Show all morphological information by category name Show all morphological information by category code
30. versity who helped us in many ways with JUMAN s development Ken ichi Chinen gave us many suggestions about the ChaSen system s development while he was at NAIST We received a variety of assistance from Osamu Imaichi Tomoaki Imamura Akira Kitauchi while they were at NAIST during the development of ChaSen vesions 1 0 and 2 0 8 and from Tatsuo Yamashita Yoshitaka Hirano and Hiroshi Matsuda during the development of versions 2 0 and 2 2 We are extremely grateful to both these groups and all of the other members of Matsumoto Laboratory who helped with ChaSen s development The Japanese Speech Dictation Software Development Group whose representative member is Professor Kiyohiro Shikano of NAIST carried out large scale mantinence of the IPA POS dictionary In particular we would like to thank Katunobu Itou of Housei University and Kaoru Yamada from ASTEM for their assistance We are grateful to Yasuharu Den of Chiba University for the various pieces of advice about dictionary mantinence focusing on the analysis of spoken language We also received a lot of advice about converting ChaSen to autoconf automake and making RPM packages from Tetsu Takabayashi and Taku Kudoh while they were at NAIST Chooi Ling Goh Cheng Yuchang and Jia Lu helped with the maintenance of the Chinese dictionary Finally although there are too many to name individually we would like to thank all of ChaSen s users for the many comments and questions tt

ChaSen Morphological Analyzer version 2.4.0 User`s Manual

Contents

Download Pdf Manuals

Related Search

Related Contents