Home

document

1. 0 arbitrary 1 ITG 2 BITG structural constraint 0 speed up setting enable pre pruning 1 pre pruning threshold 0 0 Lines 1 5 show the knowledge sources used by the aligner which are source and target vocabularies and translation probability tables in two directions gen erated by GIZA Lines 7 23 specify the feature weights of the log linear model We use 16 features in TsinghuaAligner Other parameters in the con figuration file are related to the search algorithm We strongly recommend enabling pre pruning to improve the aligning speed by an order of magnitude without sacrificing accuracy significantly The default structural constraint used in search is arbitrary You may try other constraint such as ITG by modifying the configuration file as follows structural constraint 0 arbitrary 1 ITG 2 BITG structural constraint 1 The aligner itself in the bin folder is easy to use Usage TsinghuaAligner help Required arguments ini file lt ini_file gt initialization file src file lt src_file gt source file trg file lt trg_file gt target file 12 NOQA 10 11 12 13 14 NOD GO A ONAK agt file lt agt_file gt alignment file Optional arguments nbest list 1 00 n best list size default 1 verbose 0 1 displays run time information 0 document level default 1 sentence level posterior 0 1 outputs
2. In supervised learning the log linear model for word alignment is given by exp h f e a Xa exp 0 A e a Shttp nlp csai tsinghua edu cn ly systems TsinghuaA ligner TsinghuaA ligner html P alf e 2 16 where h R is a real valued vector of feature functions that capture the characteristics of bilingual word alignment and 6 R is the corresponding feature weights In unsupervised learning the latent variable log linear model for word align ment is defined as P f e Bian 3 exp 0 h f e a 2 De ue Qua p0 h f e a 4 The major advantage of log linear models is to define useful features that capture various characteristics of word alignments The features used in Ts inghuaAligner mostly derive from Liu et al 2010 5 1 1 Translation Probability Product To determine the correspondence of words in two languages word to word trans lation probabilities are always the most important knowledge source To model a symmetric alignment a straightforward way is to compute the product of the translation probabilities of each link in two directions For example suppose that there is an alignment 1 2 for a source language sentence f f2 and a target language sentence e1e2 the translation probability product is t e2 f1 x t f le2 where t e f is the probability that f is translated to e and t fle is the proba bility that e is translated to f respectively Unfortu
3. wo ye xihuan yinyue ni xihuan dushu ni ye xihuan yinyue wo he ni dou xihuan dushu ta xihuan dushu ma ta xihuan yinyue ma gt devset e lt i like reading i also like music 13 14 15 16 17 18 19 20 21 22 23 24 25 26 wCoanan4nrk ONK 1 11 12 13 14 15 16 17 18 19 you like reading you like music too both you and me like reading does he like reading does he like music gt devset wa lt I A Bol Besa AAi edi 22 1 Sesyil AAL Balsy il 17 Deval BesyAl Abadi Leni Peay GD Assyal Betey il 1 4 1 2 3 1 3 2 1 4 1 1 5 5 1 6 6 1 ei Desi SeayAl alo Belsy il 1 2 1 2 3 1 3 4 1 4 1 0 5 5 1 The source and target files are similar to the training set The gold standard file devset wa contains manual annotations 1 1 1 denotes a sure link that connects the first source word and the first target word This usually happens for content words such as yinyue and music In contrast 4 1 0 denotes a possible link that connects the fourth source word e g ma and the first English word e g does Function words are usually aligned as possible links Our supervised trainer relies on the development set to learn the parameters of log linear models The arguments of the supervisedTraining py in the scripts folder are listed as follows Usage supervisedTraining help Required arguments src vcb file lt file gt source
4. XO So G 3 x G 7 lt 9 13 j i a j i Ea Geclf e a J 2 J G a x i i lt 0 14 j v ea where expr is an indicator function that takes a boolean expression expr as the argument 1 if expr is true 0 otherwise 15 expr 20 5 1 5 Neighbor Count Moore 2005 finds that word alignments between closely related languages tend to be approximately monotonic Even for distantly related languages the num ber of crossing links is far less than chance since phrases tend to be translated as contiguous chunks In Figure 1 the dark points are positioned approximately in parallel with the diagonal line indicating that the alignment is approximately monotonic To capture such monotonicity we follow Lacoste Julien et al 2006 to en courage strictly monotonic alignments by adding bonus for a pair of links j i and j i such that j j 1Ai v 1 In Figure 1 there is one such link pair 2 6 and 3 7 We call them monotonic neighbors Formally the feature function of neighbor count is given by hnc f e a 5 l 7 1Ai v7 16 j t E a j t Ea Inclf e a j i So li 1 i i lj 17 ji Ea TsinghuaAligner also uses swapping neighbors in a similar way j j 1 i i 1 5 1 6 Linked Word Count We observe that there should not be too many unaligned words in good align ments For example there are only two unaligned words on the target side in Figure 1 The and a
5. beam size beam size of the search algorithm The default value is 1 4 structural constraint TsinghuaAligner supports three kinds of struc tural constraints arbitrary ITG and block ITG The default value is 0 5 enabling prepruning prepruning is a technique that improves the align ing speed by constraining the search space The default value is 0 6 prepruning threshold The threshold for prepruning The default value is 0 You need not specify optional arguments in running the script unless you want to change the default setting An example command for supervised train ing is supervisedTraining py src vcb file source vcb trg vcb file target vcb s2t ttable file source_target tTable t2s ttable file target_source tTable dev src file dev f dev trg file dev e dev agt file dev wa The resulting file is a configuration file TsinghuaAligner ini for the aligner knowledge sources source vocabulary file source vcb target vocabulary file target vcb source to target TTable file source_target tTable target to source TTable file target_source tTable feature weights 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 Ny Qa A WwW NS translation probability product feature weight 0 0504511 link count feature weight 0 0661723 relative position absolute distance feature weight 0 264923 cross count featur
6. let x be a noisy training example derived from an observed example x Our training data is composed of pairs of observed and noisy exam ples D x x V_ The training objective is to maximize the difference 26 of probabilities between observed and noisy training examples 0 argmax so 40 0 T P x argmax lt lo 41 argmax l 5 log exp 0 h x y i 1 yey x lg gt expo hy YEV RM 42 Accordingly the partial derivative of J with respect to the k th feature weight Ox is given by OJ 0 00 I gt dX Pty 6 rg x y i 1 yey x So Py Ore y 43 yEy x I 4 5 X Eyxo olhr x y Eyzw olhe y 44 The key difference is that our approach cancels out the partition function Z which poses the major computational challenge in unsupervised learning of log linear models However it is still intractable to calculate the expectation with respect to the posterior distribution Ey x 9 h x y for non local features due to the exponential search space One possible solution is to use Gibbs sampling to draw samples from the posterior distribution P y x DeNero et al 2008 But the Gibbs sampler usually runs for a long time to converge to the equilibrium distribution Fortunately by definition only alignments with highest probabilities play a central role in calculating expectations If the probability mass of the log linear model for
7. parametrized by a real valued vector R is given by P x 0 Do Phx y 0 32 yey x Vyev x exp h x y Z 33 where h R is a feature vector and Z 0 is a partition function for normalization 5 exp 0 h x y 34 xEX yEY x We use to denote all possible pairs of source and target strings and V x to denote the set of all _ alignments for a sentence pair x Given a set of training examples x _ the standard training objective is to find the parameter that maximizes the log likelihood of the training set 0 argmax fo 35 6 I anemia toa T PC x o 36 I argmax 5 log exp 0 h x y 0 i 1 yey x log xo 37 Standard numerical optimization methods such as L BFGS and Stochastic Gradient Descent SGD require to calculate the partial derivative of the log likelihood L with respect to the k th feature weight 0 OL on To gt Piy 0h x y i 1 yEey x 25 zai huiyi shang fabiao yanjiang zai fabiao huiyi shang wo yanjiang gt a a he made a speech at he meeting talk a meeting sh he made a b Figure 3 a An observed romanized Chinese sentence an English sentence and the word alignment between them b a noisy training example derived from a by randomly permutating and substituting words As the training data only consists of sentence pairs word alignment serves as a latent variable in the log linear model In our approach
8. the latent variable log linear model is expected to assign higher probabilities to observed training examples than to noisy examples ae x y 8 hu x y 38 xEX yEy x I So Eyjxc 6 he x y Ex yolh x y 39 i 1 As there are exponentially many sentences and alignments the two expecta tions in Eq 8 are intractable to calculate for non local features that are critical for measuring the fertility and non monotonicity of alignment Liu et al 2010 Consequently existing approaches have to use only local features to allow dy namic programming algorithms to calculate expectations efficiently on lattices Dyer et al 2011 Therefore how to calculate the expectations of non local features accurately and efficiently remains a major challenge for unsupervised word alignment 5 3 2 Contrastive Learning with Top n Sampling Instead of maximizing the log likelihood of the observed training data we pro pose a contrastive approach to unsupervised learning of log linear models Liu and Sun 2015 For example given an observed training example as shown in Figure 3 a it is possible to generate a noisy example as shown in Figure 3 b by randomly shuffling and substituting words on both sides Intuitively we expect that the probability of the observed example is higher than that of the noisy example This is called contrastive learning which has been advocated by a number of authors More formally
9. 1 2 6 7 2 3 7 8 4 Rm 0 1 0 1 1 3 6 8 5 S 0 1 0 1 1 3 6 8 3 4 4 5 6 S 0 1 0 1 1 3 6 8 3 4 4 5 4 5 5 6 7 Ru 0 1 0 1 1 3 6 8 3 5 4 6 8 Ry 0 1 0 1 1 5 4 8 9 S 0 1 0 1 1 5 4 8 5 7 1 3 10 S 0 1 0 1 1 5 4 8 5 7 1 3 7 8 3 4 11 Rm 0 1 0 1 1 5 4 8 5 8 1 4 12 Rr 0 1 0 1 1 8 1 8 13 Ru 0 8 0 8 Figure 7 A shift reduce algorithm for judging ITG alignment The algorithm scans links from left to right on the source side Each link j i is treated as an atomic block j 1 j i 1 7 Three operators are defined S shift a block Rm merge two blocks in a monotone order and Ry merge two blocks in an inverted order The algorithm runs in a reduce eager manner merge blocks as soon as possible Unaligned words are attached to the left nearest aligned words e g 5 7 1 3 in step 9 this can be reduced to O bn for the beam search algorithm that searches for ITG word alignment Algorithm 2 shows the beam search algorithm for ITG alignment The best 8If the Viterbi alignment is a full alignment e g aig and the beam size is 1 XDxUXIFD nodes will be explored Apparently this can hardly happen in practice For ITG alignments however our algorithm can reach at most the min J I th layer because ITG only allows for one to one links 34 alignment is set to empty at the beginning line 2 The algorithm c
10. 3 0 868177 3 2 0 570701 1 0 0 476397 2 0 0 090127 4 4 0 996101 5 5 0 976339 6 6 0 919651 3 2 0 777563 0 3 0 646714 13 4 o AN Da A ONK m oO 1 0 0 415113 Note that each link is assigned a probability within 0 1 to indicate the confi dence the two words are aligned The resulting file just contains sets of links collected from the aligning process and do not form reasonable alignments You must specify a threshold to prune unlikely links and get high quality alignments 3 6 Visualization We provide a visualization tool called Align Viz to display the alignment results which is located in the GUI folder The input of AlignViz are three files gt source txt lt ta he wo dou xihuan yinyue ni he ta dou xihuan dushu gt target txt lt both he and i like music both he and you like reading gt alignment txt lt 2 2 3 3 0 4 4 5 5 5 Oe is 6 6 0 3 1 2 2 1 3 0 4 4 5 5 6 6 To launch AlignViz simply use the following command java jar AlignViz jar The aligned sentence pair is shown in Figure 2 Fa AlignViz lB File About H ta both he he wo and dou i xihuan like yinyue music Figure 2 Visualization of word alignment 14 Coan aanank wd O wCoanaa4ark wre RR eR eR eR We a RF NR DS 3 7 Evaluation To evaluate our system we need a test set to calculate alignment error rate AER The test set in the exa
11. Note that User Jack TsinghuaAligner is the directory where Tsinghua Aligner is installed You may use the command pwd to get the full path name of the root directory Do the same to the other two Python scripts supervisedTraining py and unsupervisedTraining py and complete the installation 3 User Guide 3 1 Quick Start To quickly know how TsinghuaAligner works you may take the following steps First download the model ce tar gz file from the system website 1 which is a Chinese English model that can be directly used by TsinghuaAligner Un pack the file tar xvfz model ce tar gz lhttp nlp csai tsinghua edu cn ly systems TsinghuaA ligner TsinghuaAligner html too NOD GM KR WH HH and move the model ce directory to the quickstart directory In the quickstart directory there are three files chinese txt Chinese text english txt English text and TsinghuaAligner ini system config uration file Then simply run the following command TsinghuaAligner ini file TsinghuaAligner ini src file chinese txt trg file english txt agt file alignment txt The resulting file alignment txt contains the word alignment in the Moses format 0 0 12 2 3 3 5 G27 D 3 G where 0 0 denotes that the first Chinese word is aligned to the first English word In the following we will introduce how to train the alignment models and how to align unseen parallel text 3 2 Running GIZA As
12. noted before TsinghuaAligner is based on log linear models that are capable of incorporating arbitrary features that capture various characteristics of word alignment One of the most important feature in TsinghuaAligner is translation probability product see Section 5 derived from GIZA 4 the state of the art generative alignment system As a result we need to run GIZA first before training the log linear models This can be done by running the GIZA py script in the scripts directory The input of this script is a parallel corpus which can be found in the example trnset folder The source file trnset f contains source language sentences Each line cor responds to a tokenized source language sentence We strongly recommend us ing UTF 8 encoding for all the training development and test sets In addition each sentence should contain no more than 100 words because GIZA 4 usu ally truncates long sentences before training which might lead to unexpected problems wo xihuan dushu wo ye xihuan yinyue ni xihuan dushu ni ye xihuan yinyue wo he ni dou xihuan dushu ta xihuan dushu ma ta xihuan yinyue ma The target file trnset e contains target language sentences Each line cor responds to a tokenized target language sentence For English lowering cases is usually used to reduce data sparseness and improve the accuracy Note that the source and target sentences with the same line number are assumed to be tr
13. posterior link probabilities default 0 help prints this message to STDOUT There are two optional arguments 1 n best list size n best list size The default value is 1 2 verbose display run time information The default value is 0 3 posterior output posterior link probabilities The default value is 0 Suppose we have unseen source and target sentences as follows gt source txt lt ta he wo dou xihuan yinyue ni he ta dou xihuan dushu gt target txt lt both he and i like music both he and you like reading Let s run the aligner to produce alignment for the sentences TsinghuaAligner ini file TsinghuaAligner ini src file source txt trg file target txt agt file alignment txt The resulting file alignment txt is in the Moses format 1 0 2 3 3 2 4 4 5 5 6 6 0 3 3 2 4 4 5 5 6 6 where 1 0 denotes that the second source word is aligned to the first target word Note that the subscript of the first word is 0 rather than 1 Sometimes we are interested in how well two words are aligned TsinghuaAlign er is able to output link posterior probabilities to indicate the degree of aligned ness This can be simply done by turning the posterior option on TsinghuaAligner ini file TsinghuaAligner ini src file source txt trg file target txt agt file alignment txt posterior 1 The result file is as follows 4 4 0 997164 5 5 0 981469 6 6 0 941433 2
14. 48 Let the cardinality of N x 0 be n We refer to Eq 47 as top n sam pling because the approximate posterior distribution is normalized over top n 28 alignments exp h x y Zyren P O R x y In this paper we use the beam search algorithm proposed by Liu et al 2010 to retrieve top n alignments from the full search space Starting with an empty alignment the algorithm keeps adding links until the alignment score will not increase During the process local and non local feature values can be calculated in an incremental way efficiently The algorithm generally runs in O bl m time where b is the beam size As it is intractable to calculate the objective function in Eq 42 we use the stochastic gradient descent algorithm SGD for parameter optimization which requires to calculate partial derivatives with respect to feature weights on single training examples Py y x 0 49 5 4 Search 5 4 1 The Beam Search Algorithm Given a source language sentence f and a target language sentence e we try to find the best candidate alignment with the highest model score a angmax PCE e a 50 angunax 6 h e a 51 To do this we begin with an empty alignment and keep adding new links un til the model score of current alignment will not increase Graphically speaking the search space of a sentence pair can be organized as a directed acyclic graph Each node in the graph is a candidate alignm
15. Aligner tar gz Can anak WwW ds RR OW OR 9 2 2 2 Step 2 Compiling Entering the TsinghuaAligner directory you may find five folders code doc example GUI and scripts and one script file install sh The code fold er includes the source code the doc folder includes the documentation the example folder contains example data the GUI folder contains the visualization tool AlignViz and the scripts folder includes Python scripts for training the system First change the mode of the installing script chmod x install sh Then install the system by running the script install sh If everything goes well you should see the following information Creating a directory to store binary executables done Compiling TsinghuaAligner done Compiling waEval done Compiling convertNBestListFormat done Compiling genIni done Compiling mergeNBestList done Compiling optimizeAER done Chmoding genNoise done Compiling trainer done Chomoding scripts done The system is installed successfully The script creates a folder bin to place all binary executables 1 TsinghuaAligner the main component of the system that produces word alignment for parallel corpora 2 optimizeAER the main component for supervised training 3 trainer the main component for unsupervised training 4 convertNBestListFormat converting n best list format for supervised tra
16. Columbia Canada Moore R C Yih W t and Bode A 2006 Improved discriminative bilin gual word alignment In Proceedings of COLING ACL 2006 pages 513 520 Sydney Australia Och F J 2003 Minimum error rate training in statistical machine translation In Proceedings of ACL 2003 pages 160 167 Sapporo Japan Och F J and Ney H 2003 A systematic comparison of various statistical alignment models Computational Linguistics 29 1 19 51 Taskar B Lacoste Julien S and Klein D 2005 A discriminative matching approach to word alignment In Proceedings of HLT EMNLP 2005 pages 73 80 Vancouver British Columbia Canada Wu D 1997 Stochastic inversion transduction grammars and bilingual pars ing of parallel corpora Computational Linguistics Xiong D Liu Q and Lin S 2006 Maximum entropy based phrase re ordering model for statistical machine translation In Proceedings of ACL 2006 37
17. Sampling alignment structure under a bayesian translation model In Proceedings of EMNLP 2008 Dyer C Chahuneau V and Smith N A 2013 A simple fast and effective reparameterization of ibm model 2 In Proceedings of NAACL 2013 Dyer C Clark J H Lavie A and Smith N A 2011 Unsupervised word alignment with arbitrary features In Proceedings of ACL 2011 Haghighi A Blitzer J DeNero J and Klein D 2009 Better word align ments with supervised itg models In Proceedings of ACL 2009 Lacoste Julien S Taskar B Klein D and Jordan M I 2006 Word align ment via quadratic assignment In Proceedings of HLT NAACL 2007 pages 112 119 New York City USA Li P Liu Y and Sun M 2012 A beam search algorithm for itg word alignment In Proceedings of COLING 2012 Liu Y Liu Q and Lin S 2005 Log linear models for word alignment In Proceedings of ACL 2005 Liu Y Liu Q and Lin S 2010 Discriminative word alignment by linear modeling Computational Linguistics Liu Y and Sun M 2015 Contrastive unsupervised word alignment with non local features In Proceedings of AAAI 2015 Liu Y Xia T Xiao X and Liu Q 2009 Weighted alignment matrices for statistical machine translation In Proceedings of EMNLP 2009 36 Moore R C 2005 A discriminative framework for bilingual word alignmen t In Proceedings of HLT EMNLP 2005 pages 81 88 Vancouver British
18. Table The resulting file of unsupervised training is also a configuration file TsinghuaAligner ini for the aligner 3 5 Aligning Unseen Parallel Corpus TsinghuaAligner takes a configuration file TsinghuaAligner ini as input knowledge sources source vocabulary file source vcb target vocabulary file target vcb source to target TTable file source_target tTable target to source TTable file target_source tTable feature weights translation probability product feature weight 0 0504511 link count feature weight 0 0661723 relative position absolute distance feature weight 0 264923 cross count feature weight 0 0588821 mono neighbor count feature weight 0 137836 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 a A O NK a A O N K swap neighbor count feature weight 0 049596 source linked word count feature weight 0 00257702 target linked word count feature weight 0 0229796 source maximal fertility feature weight 0 072508 target maximal fertility feature weight 0 0126342 source sibling distance feature weight 0 072326 target sibling distance feature weight 0 0100039 one to one link count feature weight 0 0212899 one to many link count feature weight 0 0310621 many to one link count feature weight 0 0334263 many to many link count feature weight 0 0933321 search setting beam size 1 structural constraint
19. TsinghuaAligner A Statistical Bilingual Word Alignment System Yang Liu Tsinghua University liuyang2011 tsinghua edu cn April 22 2015 1 Introduction Word alignment is a natural language task that aims to identify the correspon dence between words in two languages zongtong zai niuyue fabiao jianghua BS E AXA RR HE The President made a speech at New York Figure 1 Example of word alignment Figure 1 shows a romanized Chinese sentence an English sentence and the word alignment between them The links highlighted in blue indicate the correspondence between Chinese and English words Word alignment is a chal lenging task because both the lexical choices and word orders in two languages are significantly different For example the English function words the and a have no counterparts in Chinese In addition a verb phrase e g made a speech is usually followed by a prepositional phrase e g at New York in English but the order is reversed in Chinese TsinghuaAligner is statistical bilingual word alignment system that takes a set of parallel sentences as input and produces word alignment automatically It has the following features m 2 2 1 Language independence The system is language independent and can be used for any language pairs Extensibility Our system is based on log linear models which are capa ble of incorporating arbitrary information sources as features Therefore th
20. Unaligned words are usually function words that have little lexical meaning but instead serve to express grammatical relationships with other words or specify the attitude or mood of the speaker To control the number of unaligned words we follow Moore et al 2006 to introduce a linked word count feature that simply counts the number of aligned words Riwe e a SW gt 0 gt 0 18 E i Yj estie 0 19 where Y and are the fertilities before adding l TsinghuaAligner separates the two terms in the above two equations to dis tinguish between source linked word count and target linked word count 5 1 7 Maximal Fertility To control the maximal number of source words aligned to the same target word and vice versa we introduce the maximal fertility feature 21 hms f e a max y max 20 It is not straightforward to calculate the feature gain In practice Ts inghuaAligner maintains the positions of maximal fertilities and calculates the feature gain without checking all links TsinghuaAligner distinguishes between source maximal fertility and target maximal fertility 5 1 8 Sibling Distance In word alignments there are usually several words connected the same word on the other side For example in Figure 1 two English words New and York are aligned to one Chinese word niuyue We call the words aligned to the same word on the other side siblings A word often tends to produce a series of words in
21. alculate the difference of feature values instead of the values themselves which we call feature gain The feature gain for translation probability product is Dippy f a j i log t e f F log t f le _ log 6 w 0 x t fileo 1 5 0 log 5 i 0 x t eil fo 1 6 i 0 8 where p and are the fertilities before adding the link j i Although this feature is symmetric we obtain the translation probabilities t fle and t e f by training the IBM models using GIZA Och and Ney 2003 5 1 2 Link Count Given a source sentence with J words and a target sentence with J words there are J x I possible links However the actual number of links in a reference alignment is usually far less For example there are only 6 links in Figure 1 although the maximum is 5 x 8 40 The number of links has an important effect on alignment quality because more links result in higher recall while less links result in higher precision A good trade off between recall and precision usually results from a reasonable number of links Using the number of links as a For clarity we use gipp f e a j i instead of gtpp f e a l because j and i appear in the equation 19 feature could also alleviate the bias problem posed by the translation probability product feature see Section 4 1 1 A negative weight of the link count feature often leads to less links while a positive weight favors more links Formall
22. another language that belong together while others tend to produce a series of words that should be separate To model this tendency we introduce a feature that sums up the distances between siblings Formally we use w to denote the position of the k th target word aligned to a source word fj and use 7 to denote the position of the k th source word aligned to a target word e Obviously w 1 is always greater than wj by definition As New and York are siblings we define the distance between them is w3 2 w3 1 0 The distance sum of fj can be efficiently calculated as yf Oj Wi v 1 ify gt 1 A j 3 0 otherwise 21 Accordingly the distance sum of e is an J Tijd Tii pi 1 if i gt 1 V i bi 0 otherwise v2 Formally the feature function of sibling distance is given by I hsa f e a gt AG vi VG g 23 j l i 1 The corresponding feature gain is Gsa f e a j i A j pi 1 AG pj V i di 1 V i gi 24 where p and are the fertilities before adding the link j i TsinghuaAligner distinguishes between source sibling distance and tar get sibling distance 22 5 1 9 Link Type Count Due to the different fertilities of words there are different types of links For instance one to one links indicate that one source word e g zongtong is trans lated into exactly one target word e g President while many to many links exist for phrase to phrase translation The
23. anslations of each other For example the first sentence in trnset f and the NO TO KR WH HH CAN Da A ONK RoR mR SOR Oo first sentence in trnset e constitute a parallel sentence i like reading i also like music you like reading you like music too both you and me like reading does he like reading does he like music To run GIZA 4 simply use the GIZA py script in the scripts folder with the two above files as input GIZA py trnset f trnset e GIZA training usually takes a long time especially for large parallel corpora i e millions of sentence pairs We recommend using nohup to keep executing the training script after you exit from a shell prompt nohup GIZA py trnset f trnset e amp When GIZA 4 training is complete there should be four resulting files 1 source vcb source language vocabulary 2 target vcb target language vocabulary 3 source_target tTable source to target translation probability table 4 target_source tTable target to source translation probability table These files will be used in both supervised and unsupervised training 3 3 Supervised Training In supervised training we require a set of parallel sentences annotated with gold standard alignment manually which we refer to as the development set The example development set in the example devset folder contains three files gt devset f lt wo xihuan dushu
24. at minimize error scores on a representative hand aligned training corpus Given a reference alignment r and a candidate alignment a we use a loss function E r a to measure alignment performance Note that E r a can be any error function Given a bilingual corpus f e with a reference alignment rs and a set of M different candidate alignments C al a for each sentence pair f e our goal is to find a set of feature weights 6 that minimizes the overall loss on the training corpus S argminf X E r a el 0 29 8 s 1 S M me anganin 5 E r al 6 a t el 6 al 30 9 s lm 1 where a f e is the best candidate alignment produced by the linear model alfi el 0 angunax 6 h ef a 31 The basic idea of MERT is to optimize only one parameter i e feature weight each time and keep all other parameters fixed This process runs iter atively over M parameters until the overall loss on the training corpus will not decrease Please refer to Liu et al 2010 for more details 24 5 3 Unsupervised Training 5 3 1 Unsupervised Learning of Log Linear Models To allow for unsupervised word alignment with arbitrary features latent variable log linear models have been studied in recent years Berg Kirkpatrick et al 2010 Dyer et al 2011 2013 Let x f e be a pair of source and tar get sentences and y be the word alignment A latent variable log linear model
25. distribution of link types differs for different language pairs For example one to one links occur more frequently in closely related language pairs e g French English while one to many links are more common in distantly related language pairs e g Chinese English To capture the distribution of link types independent of languages we use features to count different types of links Following Moore 2005 we divide links in an alignment into four categories 1 one to one links in which neither the source nor the target word partic ipates in other links 2 one to many links in which only the source word participates in other links 3 many to one links in which only the target word participates in other links 4 many to many links in which both the source and target words partic ipate in other links In Figure 1 1 2 2 6 4 3 and 5 5 are one to one links and others are one to many links As a result we introduce four features horo f e a X fj 1A i 1 25 j i Ea hoom f e a X yy gt 1A 1 26 j i Ea hmao f e a X yj 1A i gt 1 27 j i Ea hmam f e a X bj gt 1A gt 1 28 j i a Their feature gains cannot be calculated in a straightforward way because the addition of a link might change the link types of its siblings on both the source and target sides Please refer to Liu et al 2010 for the algorithm to calculate the four feature gains 5 2 Supervised Traini
26. e file lt file gt target to source TTable No 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 Optional arguments training corpus size 1 00 training corpus size sent length limit 1 00 sentence length limit shuffle 0 1 shuffling words randomly replace 0 1 replacing words randomly insert 0 1 inserting words randomly delete 0 1 deleting words randomly beam size 1 00 beam size sample size 1 00 sample size sgd iter limit 1 00 SGD iteration limit sgd converge threshold 0 00 SGD convergence threshold sgd converge limit 1 00 SGD convergence limit sgd lr numerator 0 00 SGD learning rate numerator sgd lr denominator 0 00 SGD learning rate denominator default 10 default 100 default 1 default 0 default 0 default 0 default 5 default 10 default 100 default 0 01 default 3 default 1 0 default 1 0 The optional arguments include 1 training corpus size the number of training examples used for training It turns out our unsupervised trainer works pretty well only using a small number of training examples The default value is 10 sentence length limit the maximal length of sentences in the training corpus The default value is 100 shuffle our unsupervised training algorithm is based on a contrastive learning approach which differe
27. e system is highly extensible It is possible to design and include new features to capture the characteristics of specific languages Supervised learning Provided with manually annotated parallel cor pus the system is able to learn to minimize the difference between its alignment and the manual alignment Unsupervised learning The system is also capable of learning from unlabeled data automatically and delivers pretty good performance Structural constraints TsinghuaAligner supports a variety of struc tural constraints such as many to many ITG and block ITG These con straints prove to be effective in modeling the structural divergence between natural languages Link posterior probabilities The system is capable of producing the posterior probability for each link in alignments to indicate the confidence that two words are aligned Installation System Requirements TsinghuaAligner supports Linux i686 and Mac OSX You need to install the following free third party software to build TsinghuaAligner 1 2 3 4 2 2 GIZA It can be downloaded at https code google com p giza pp g version 4 6 3 or higher Python version 2 7 3 JRE 1 6 or higher optional only used for the visualization tool AlignViz Installing TshinghuaAligner Here is a brief guide on how to install TsinghuaAligner 2 2 1 Step 1 Unpacking Unpack the package using the following command tar xvfiz Tsinghua
28. e weight 0 0588821 mono neighbor count feature weight 0 137836 swap neighbor count feature weight 0 049596 source linked word count feature weight 0 00257702 target linked word count feature weight 0 0229796 source maximal fertility feature weight 0 072508 target maximal fertility feature weight 0 0126342 source sibling distance feature weight 0 072326 target sibling distance feature weight 0 0100039 one to one link count feature weight 0 0212899 one to many link count feature weight 0 0310621 many to one link count feature weight 0 0334263 many to many link count feature weight 0 0933321 search setting beam size 1 structural constraint 0 arbitrary 1 ITG 2 BITG structural constraint 0 speed up setting enable pre pruning 0 pre pruning threshold 0 0 We will explain the configuration file in detail in Section 3 5 3 4 Unsupervised Training As manual annotation is labor intensive it is appealing to directly learn the feature weights from unlabled data Our unsupervised trainer unsupervisedTraining py in the scripts folder only uses the training set for parameter estimation Usage unsupervisedTraining help Required arguments Sse KELE source file trg file lt file gt target file src vcb file lt file gt source vocabulary trg veb eal ILS GaeaLiliee target vocabulary s2t ttable file lt file gt source to target TTable t2s ttabl
29. ent and each edge corresponds to a link We define that alignments that have the same number of links constitute a level There are 27 possible nodes and J x I 1 levels in a graph Our goal is to find the node with the highest model score in a search graph As the search space of word alignment is exponential although enumerable it is computationally prohibitive to explore all the graph Instead we can search efficiently in a greedy way During the above search process we expect that the addition of a single link to the current best alignment a will result in a new alignment a U 1 with a higher score 0 h f e aU 1 h f e a gt 0 52 As a result we can remove most of computational overhead by calculating only the difference of scores instead of the scores themselves The difference of alignment scores with the addition of a link which we refer to as a link gain is defined as G f e a l 0 g f e a 1 53 29 Algorithm 1 A beam search algorithm for word alignment 1 procedure ALIGN f e 2 open 0 gt a list of active alignments 3 Nep gt n best list 4 ac gt begin with an empty alignment 5 ADD open a 3 b gt initialize the list 6 while open 4 do 7 closed 0 gt a list of promising alignments 8 for all a open do 9 for alll J x I a do gt enumerate all possible new links 10 a aU i gt produce a new alignment 11 g GAIN f e a 1 gt compute the link gain 12 if g
30. es we are interested in the confidence of aligning a specific word pair rather than the overall word alignment To do so we define link posterior probability as P ilf e X P alf e x l a 62 Eare I al 63 This is especially useful for building bilingual lexicons and inducing weighted alignment matrices Liu et al 2009 However due to the exponential search space of word alignment it is in tractable to calculate the posteriors exactly using log linear model models Al ternatively TsinghuaAligner resorts to an approximate approach 9In practice the algorithm can be even more efficient by recording the sequence of blocks in each hypothesis without unaligned word attachment Therefore block merging needs not to start from scratch for each hypothesis 35 Dace P alf e x l a dace Plalf e Dace exp A f e a x Ea Sace exp 0 h f e a where is the explored search space i e the set of nodes visited in the search graph during the aligning process P I f e 64 Q 65 References Berg Kirkpatrick T Bouchard Cot A DeNero J and Klein D 2010 Painless unsupervised learning with features In Proceedings of NAACL 2010 Brown P F Pietra V J D Pietra S A D and Mercer R L 1993 The mathematics of statistical machine translation parameter estimation Computational Linguistics DeNero J Bouchard Cot A and Klein D 2008
31. gorithm proposed by Liu et al 2010 keeps add single links to current alignments until all expanded alignments do not have higher probabilities For example adding single links to the empty alignment a results in four expanded alignments aj a3 a4 and as Suppose only ag has a higher probability than a Then expanding a3 gives three new alignments a7 a9 and aj If all of them have lower probabilities than ag then the algorithm returns ag as the optimal alignment From a graphical point of view the search space is organized as a directed acyclic graph that consists of 27 nodes and J x I x 27 edges The nodes are divided into J x I 1 layers The number of nodes in the kth layer k 0 J x TI is a The maximum of layer width is given by det The goal of word alignment is to find a node that has the highest probability in the graph The major difference of our algorithm from Algorithm 1 is that we only consider ITG alignments which is highlighted by shading in Figure 5 Wu 1997 shows that ITG alignments only account for 0 1 in the full search space The percentage is even lower for long sentences As the worst case running time is O bn b is a beam size for the beam search algorithm of Liu et al 2010 33 o0 Islamabad at Kel Musharraf N with vt meeting step operation stack 1 S 0 1 0 1 2 S 0 1 0 1 1 2 6 7 3 S 0 1 0 1
32. gt 0 then gt ensure that the score will increase 13 ApD closed a 3 b gt update promising alignments 14 end if 15 ADD N a 0 n gt update n best list 16 end for 17 end for 18 open lt closed gt update active alignments 19 end while 20 return M gt return n best list 21 end procedure where g f e a is a feature gain vector which is a vector of the incrementals of feature value after adding a link l to the current alignment a g f e a 1 h f e aU 1 h f e a 54 In our experiments we use a beam search algorithm that is more general than the above greedy algorithm In the greedy algorithm we retain at most one candidate in each level of the space graph while traversing top down In the beam search algorithm we retain at most b candidates at each level Algorithm 1 shows the beam search algorithm The input is a source lan guage sentence f and a target language sentence e line 1 The algorithm maintains a list of active alignments open line 2 and an n best list M line 3 The aligning process begins with an empty alignment a line 4 and the proce dure ADD open a 3 b adds a to open The procedure prunes search space by discarding any alignment that has a score worse than 1 6 multiplied with the best score in the list or 2 the score of b th best alignment in the list For each iteration line 6 we use a list closed to store promising alignments that have higher scores than cur
33. h translation probability product is degenerate as a generative model the bias problem can be alleviated when this feature is combined with other features such as link count see Section 4 1 2 5We use the logarithmic form of translation probability product to avoid manipulating very small numbers e g 4 3 x e7100 just for practical reasons 18 Table 1 Calculating feature values of translation probability product for a source sentence f fz and a target sentence e1 2 alignment feature value e t e2 fo t fileo t f2leo 1 2 log t e1 fo teal fi t filez t f2leo 1 2 2 2 log t e1 fo t e2 f t eal fa t filez t f2le2 Similarly the fertility of a target word e is the number of aligned source words gt gt i 7 G t ea For example as only one English word President is aligned to the first Chi nese word zongtong in Figure 1 the fertility of zongtong is 1 1 Similarly the fertility of the third Chinese word niuyue is Y3 2 because there are two aligned English words The fertility of the first English word The is 0 Obviously the words with zero fertilities e g The and a in Figure 1 are unaligned In Eq 5 the first term calculates the product of aligned words the second term deals with unaligned source words and the third term deals with unaligned target words Table 1 shows the feature values for some word alignments For efficiency we need to c
34. ining 5 genIni generating configuration files for supervised training 6 megerNBestList merging n best lists of multiple iterations for super vised learning 7 genNoise py generating noises for unsupervised learning 8 waEval evaluating the system in terms of alignment error rate m 2 2 3 Step 3 Compiling GIZA TsinghuaAligner takes the translation probabilities derived from GIZA as the central feature in the log linear model As a result GIZA needs to be properly installed to run our system Please visit https code google com p giza pp to download GIZA and compile it according to its user manual After installing GIZA copy the following binary executables to the bin folder 1 GIZA the main component for training IBM models mkcls training word classes on monolingual corpus plain2snt out converting plain files to the snt format Pe we snt2cooc out collecting co occurrence statistics Note that there should be 12 binary executables in the bin folder if the system is properly installed 2 2 4 Step 4 Locating the Executables Training TsinghuaAligner is mainly done by using the Python scripts in the scripts folder You need to enable these scripts to locate the binary executables by modifying the root_dir variable in each script For example you may change line 7 of the GIZA py script root_dir to root_dir User Jack TsinghuaAligner
35. mple tstset folder is gt tstset f lt ta he wo dou xihuan yinyue ni he ta dou xihuan dushu gt tstset e lt both he and i like music both he and you like reading gt tstset wa lt 1 2 1 2 3 1 3 4 1 4 1 1 5 5 1 6 6 1 1 2 1 2 3 1 3 4 1 4 1 1 5 5 1 6 6 1 Now let s run the aligner to produce alignment for the test set TsinghuaAligner ini file TsinghuaAligner ini src file tstset f trg file tstset e agt file alignment txt The resulting file alignment txt is in the Moses format 1 0 2 3 3 2 4 4 5 5 6 6 0 3 3 2 4 4 5 5 6 6 Finally you may run the waEval program in the bin folder to calculate the AER score waEval tstset wa alignment txt The evaluation result is shown as follows CD S8 60 05 2 2 2 5 6 gt 0 636364 total matched sure 5 total matched possible 5 total actual 11 total sure 12 Precision 0 454545 Recall 0 416667 AER 0 565217 Top 10 wrong predictions 2 0 636364 1 0 5 waEval outputs not only the overall AER score but also the AER score for each sentence pair It lists the top 10 wrong sentence pairs to facilitate error 15 analysis 4 Additional Datasets In the TsinghuaAligner tar gz package we only offer toy training develop ment and test sets for showing how to use the system In practice you need large training corpus and manually annotated development and test sets for ru
36. nately the underlying model is biased the more links added the smaller the product will be For example if we add a link 2 2 to the current alignment and obtain a new alignment 1 2 2 2 the resulting product will decrease after being multiplied with t e2 f2 x t f2 e2 t ea fi x t file2 x t e2 f2 x t felea The problem results from the absence of empty cepts Following Brown et al 1993 a cept in an alignment is either a single source word or it is empty They assign cepts to positions in the source sentence and reserve position zero for the empty cept All unaligned target target words are assumed to be aligned to the empty cept For example in the current example alignment 1 2 the unaligned target word e1 is said to be aligned to the empty cept fo As our model is symmetric we use fp to denote the empty cept on the source side and eo to denote the empty cept on the target side respectively If we take empty cepts into account the product for 1 2 can be rewritten as t ea fi x t filez x t e1 fo x t feleo 17 Similarly the product for 1 2 2 2 now becomes t e2 f1 x t filez x t ea f2 x t fale2 x t er fo Similarly the new product for 1 2 2 2 now becomes t ea fi x t filea x t ea fa x t fele2 x t er fo Note that after adding the link 2 2 the new product still has more factors than the old product However the new product is not necessarily always smalle
37. ng In supervised training we are given the gold standard alignments for the parallel corpus Using the minimum error rate training MERT algorithm Och 2003 training log linear models actually reduces to training linear models 23 Table 2 Example feature values and error scores feature values candidate h hao Te AER ay 85 4 10 0 21 ao 89 3 12 0 20 as 93 6 11 0 22 Suppose we have three candidate alignments a1 a2 and a3 Their error scores are 0 21 0 20 and 0 22 respectively Therefore ag is the best candidate alignment a is the second best and ag is the third best We use three features to score each candidate Table 2 lists the feature values for each candidate If the set of feature weights is 1 0 1 0 1 0 the linear model scores of the three candidates are 71 74 and 76 respectively While reference alignment considers ag as the best candidate a has the maximal model score This is unpleasant because the model fails to agree with the reference If we change the feature weights to 1 0 2 0 2 0 the model scores become 73 71 and 83 respectively Now the model chooses ag as the best candidate correctly If a set of feature weights manages to make model predictions agree with ref erence alignments in training examples we would expect the model will achieve good alignment quality on unseen data as well To do this the MERT algorithm can be used to find feature weights th
38. nning the system We offer the following additional Chinese English datasets for FREE on our system website 1 model ce tar gz the model files see Section 3 1 are trained by GIZA on millions of Chinese English sentence pairs Note that all Chinese word s are composed of halfwidth characters and all English words are lower cased 2 Chinese English training set the training set comprises parallel cor pora from the United Nations website UN and the Hong Kong Govern ment website HK The UK part contains 43K sentence pairs and the HK part contains 630K sentence pairs 3 Chinese English evaluation set the evaluation set comprises two part s the development set 450 sentences and the test set 450 sentences Note that we use the UTF 8 encoding for all Chinese and English files 5 Tutorial 5 1 Log Linear Models for Word Alignment TsinghuaA ligner originates from our early work on introducing log linear models into word alignment Liu et al 2005 Given a source language sentence f fi fj fz and a target language sentence e 1 e7 we define a link j i to exist if fj and e are translations or part of a translation of one another Then an alignment is defined as a subset of the Cartesian product of the word positions aC j i j 1 J i 1 T 1 Note that the above definition allows for arbitrary alignments while IBM models impose the many to one constraint Brown et al 1993
39. ntiates the observed examples from noises Turning this option on will generate noisy examples by shuffling words The default value is 1 replace generating noises by replacing words randomly The default value is 0 insert generating noises by replacing words randomly The default value is 0 10 Can aank wd RoR wm N Ke oO delete generating noises by inserting words randomly The default value is 0 beam size beam size for the search algorithm The default value is 5 sample size sample size for top n sampling that approximates the ex pectations of features The default value is 10 SGD iteration limit we use stochastic gradient descent for optimiza tion This argument specifies the limit on SGD iterations The default value is 100 10 SGD convergence threshold the threshold for judging convergence in SGD The default value is 0 01 11 SGD convergence limit the limit for judging convergence in SGD The default value is 3 12 SGD learning rate numerator the numerator for computing learning rate in SGD The default value is 1 0 13 SGD learning rate denominator the denominator for computing learning rate in SGD The default value is 1 0 An example command for unsupervised training is unsupervisedTraining py src file trnset f trg file trnset e src vcb file source vcb trg vcb file target vcb s2t ttable file source_target tTable t2s ttable file target _source t
40. nts One major challenge in word alignment is modeling the permutations of words between source and target sentences Due to the diversity of natural languages the word orders of source and target sentences are usually quite different es pecially for distantly related language pairs such as Chinese and English In version transduction grammar ITG Wu 1997 is a synchronous grammar for synchronous parsing of source and target language sentences It builds a syn chronous parse tree that indicates the correspondence as well as permutation 31 Figure 6 The search space of word alignment ITG alignments are highlighted by shading of blocks i e consecutive word sequences based on the following production rules X gt KX 56 X gt XX 57 X gt fje 58 X gt fe 59 X gt eje 60 where X is a non terminal f is a source word e is a target word and e is an empty word While rule 57 merges two blocks in a monotone order rule 58 merges in an inverted order Rules 59 61 are responsible for aligning source and target words Figure 5 shows an ITG derivation for a Chinese English sentence pair The decision rule of finding the Viterbi alignment for a sentence pair f ef is given by 7 argmax I p fj e x Ito x Iree 61 J 2 Ea jga iga Traditionally this can be done in O n time using bilingual parsing W
41. ollects promising links before alignment expansion line 3 By promising we mean that adding a link will increase the probability of current alignment For each alignment the algorithm calls a procedure ITG a to verify whether it is an ITG alignment or not line 12 We use a shift reduce algorithm for ITG verification As shown in Figure 7 the shift reduce algorithm scans links from left to right on the source side Each link j i is treated as an atomic block j 1 j i 1 7 The algorithm maintains a stack of blocks on which three operators are defined 1 S shift a block into the stack 2 Rm merge two blocks in a monotone order 3 Rr merge two blocks in an inverted order The algorithm runs in a reduce eager manner merge blocks as soon as possible Unaligned words are attached to the left nearest aligned words de terministically The alignment satisfies the ITG constraint if and only if the algorithm manages to find a block corresponding to the input sentence pair The shift reduce algorithm runs in linear time At each level the algorithm at most retains b alignments line 13 As ITG only allows for one to one links the beam search algorithm runs for at most min J I iterations lines 7 21 Therefore the running time of our beam search algorithm is O bn Similarly TsinghuaAligner also supports the block ITG constraint as de scribed in Haghighi et al 2009 5 4 4 Link Posterior Probabilities Sometim
42. r than the old one In this case the new product divided by the old product is t e2 f2 x t falea t f2leo Whether a new product increases or not depends on actual translation proba bilities Depending on whether aligned or not we divide the words in a sentence pair into two categories aligned and unaligned For each aligned word we use translation probabilities conditioned on its counterpart in two directions i e t e f and t f e For each unaligned word we use translation probabilities conditioned on empty cepts on the other side in two directions i e t e fo and t fileo Formally the feature function for translation probability product is given by Ripp f e a 5 log t e f log t fjle j i Ea J gt log Y 0 x t fjleo 1 5 w 0 I gt log 8 0 x t ei fo 1 li 0 5 where x y is the Kronecker function which is 1 if x y and 0 otherwise We define the fertility of a source word fj as the number of aligned target words w gt 699 6 j i a 4Even though we take empty cepts into account the bias problem still exists because the product will decrease by adding new links if there are no unaligned words For example the product will go down if we further add a link 1 1 to 1 2 2 2 as all source words are aligned This might not be a bad bias because reference alignments usually do not have all words aligned and contain too many links Althoug
43. rent alignment For every possible link J line 9 we produce a new alignment a line 10 and calculate the link gain G by 30 oft E PMA 5 BORK BT T 2R o He held a meeting with Musharraf at Islamabad g Figure 5 An ITG derivation for a Chinese English sentence pair calling the procedure GAIN f e a If a has a higher score line 12 it is added to closed line 13 We also update M to keep the top n alignments explored during the search line 15 The n best list will be used in training feature weights by MERT This process iterates until there are no promising alignments The theoretical running time of this algorithm is O bJ I 5 4 2 Pre Pruning In Algorithm 1 enumerating all possible new links line 9 leads to the major computational overhead and can be replaced by pre pruning Given an align ment a open to be extended we define C C J x I a as the candidate link set Instead of enumerating all possible candidates pre pruning only considers the highly likely candidates j i logt e lf log tleil fo log t fjle log t fjleo gt 7 55 where y is a pre pruning threshold to balance the search accuracy and efficiency The default value of y is 0 in TsinghuaAligner Experiments show that pre pruning dramatically improves the aligning speed by an order of magnitude without sacrificing accuracy significantly 5 4 3 The ITG and Block ITG Constrai
44. u 1997 We extend a beam search algorithm shown in Algorithm 1 to search for Viterbi ITG word alignments Li et al 2012 Figure 6 illustrates the search For simplicity we assume the distribution for the binary rules X gt X X and X gt X X is uniform Xiong et al 2006 propose a maximal entropy model to distinguish between two merging options based on lexical evidence We leave this for future work 32 Algorithm 2 A beam search algorithm for ITG alignment 1 procedure ALIGNITG f e 2 a gt 0 gt the alignment with highest probability 3 L j i p fj e gt p f x ple e gt a set of promising links 4 open gt a list of active alignments 5 a gt begin with an empty alignment 6 ADD open a 3 b gt initialize the list 7 while open 4 do 8 closed 0 gt a list of expanded alignments 9 for all a open do 10 for all l a do gt enumerate all possible new links 11 a aU i gt produce a new alignment 12 if Irc a then gt ensure the ITG constraint 13 ApD closed a 3 b gt update expanded alignments 14 if a gt a then 15 a a gt update the best alignment 16 end if 17 end if 18 end for 19 end for 20 open lt closed gt update active alignments 21 end while 22 return a gt return the alignment with highest probability 23 end procedure space of word alignment Starting from an empty word alignment the beam search al
45. vocabulary e yea ILS GieaLiliee target vocabulary s2t ttable file lt file gt source to target TTable t2s ttable file lt file gt target to source TTable dev src file lt file gt devset source file BRC trot lee SELLOS devset target file dev agt file lt file gt devset alignment file Optional arguments iter limit 1 00 MERT iteration limit default 10 nbest list size 1 00 n best list size default 100 beam size 1 00 beam size default 1 structural constraint 0 1 2 structural constraint 0O arbitrary 1 ITG 2For simplicity the source and target files in the training and development sets are the same In practice training sets are usually much larger than development sets 20 21 22 23 24 25 26 27 28 A O NR NOD GM KR WwW NSH 2 BITG default 0 enable prepruning 0 1 enable prepruning 0 disable 1 enable default 0 prepruning threshold 00 00 prepruning threshold default 0 help prints this message We distinguish between required and optional arguments Required argu ments must be explicitly specified when running the supervisedTraining py script The 7 required arguments are all files including the resulting files from GIZA and the development set Optional arguments include 1 iteration limit limit on minimum error rate training iterations The default value is 10 2 n best list size n best list size The default value is 100 3
46. word alignment is concentrated on a small number of alignments it will be efficient and accurate to only use most likely alignments to approximate the expectation Figure 4 plots the distributions of log linear models parametrized by 1 000 random feature weight vectors We used all the 16 features The true distribu 27 probability D Nn 012345 67 8 9 10111213 1415 16 the n th best alignment Figure 4 Distributions of log linear models for alignment on short sentences lt 4 words tions were calculated by enumerating all possible alignments for short Chinese and English sentences lt 4 words We find that top 5 alignments usually ac count for over 99 of the probability mass More importantly we also tried various sentence lengths language pairs and feature groups and found this concentration property to hold consistently One possible reason is that the exponential function enlarges the differences between variables dramatically i e a gt b gt exp a gt exp b Therefore we propose to approximate the expectation using most likely alignments dy x30 hk x y SO Ply x 6 hx x y 45 ye V x yey x exp h x y he x y 46 Vy ey x PO A x y rs yen x6 exp h x y he x y 47 Dy EN 8 exp h x y where N x 8 C V x contains the most likely alignments depending on 8 Vy1 E N x 0 Yy2 V x NV x 0 0 h x yi gt 0 h x y2
47. y the feature function of link count is hic f e a Jal 9 gic f e a 1 1 10 where a is the cardinality of a i e the number of links in a 5 1 3 Relative Position Absolute Distance The difference between word orders in two languages can be captured by calcu lating the relative position absolute distance Taskar et al 2005 j a Rrpaa f e a gt 11 i i ea Irpoalf eai F 5 12 5 1 4 Cross Count Due to the diversity of natural languages word orders between two languages are usually different For example subject verb object SVO languages such as Chinese and English often put an object after a verb while subject object verb SOV languages such as Japanese and Turkish often put an object before a verb Even between SVO languages such as Chinese and English word orders could be quite different too In Figure 1 while zai is the second Chinese word its counterpart at is the sixth English word Meanwhile the fourth Chinese word fabiao is aligned to the third English word made before at We say that there is a cross between the two links 2 6 and 4 3 because 2 4 x 6 3 lt 0 In Figure 1 there are six crosses reflecting the significant structural divergence between Chinese and English As a result we could use the number of crosses in alignments to capture the divergence of word orders between two languages Formally the feature function of cross count is given by hec frea

document

Contents

Download Pdf Manuals

Related Search

Related Contents