Home

User manual

1. 17 Table 1 Calculating feature values of translation probability product for a source sentence f fz and a target sentence e1 2 alignment feature value log t e1 fo t eal fo t fileo t f2leo 1 2 log t e1 fo t e2 fi t filez t f2leo 1 2 2 2 log t e1 fo t e2 f1 t eal fa lt t filez t fele2 conditioned on empty cepts on the other side in two directions i e t e fo and t fjleo Formally the feature function for translation probability product is given byt htpp f e a 5 log t e f log t fjle j ijEa J gt 108 5 0 x t fjleo 1 5 y 0 I gt 108 5 i 0 x t ei fo 1 6 0 5 where d x y is the Kronecker function which is 1 if x y and 0 otherwise We define the fertility of a source word f as the number of aligned target words w SY 69 9 6 j i Ea Similarly the fertility of a target word e is the number of aligned source words gi X i 7 j i Ea For example as only one English word President is aligned to the first Chi nese word zongtong in Figure 1 the fertility of zongtong is 1 Similarly the fertility of the third Chinese word niuyue is Y3 2 because there are two aligned English words The fertility of the first English word The is 0 Obviously the words with zero fertilities e g The and a in Figure 1 are unaligned In Eq 5 the first term calculates the produ
2. 2 0 2 0 the model scores become 73 71 and 83 respectively Now the model chooses ag as the best candidate correctly 23 If a set of feature weights manages to make model predictions agree with ref erence alignments in training examples we would expect the model will achieve good alignment quality on unseen data as well To do this the MERT algorithm can be used to find feature weights that minimize error scores on a representative hand aligned training corpus Given a reference alignment r and a candidate alignment a we use a loss function E r a to measure alignment performance Note that E r a can be any error function Given a bilingual corpus f7 e with a reference alignment r s and a set of M different candidate alignments CS al a for each sentence pair f e our goal is to find a set of feature weights 6 that minimizes the overall loss on the training corpus S argminf X gt B x A O e0 0 29 0 s 1 S M anganin 5 E r aa a el 9 al 30 0 s lm 1 where a f e is the best candidate alignment produced by the linear model a t el 0 angunax 6 h ef a 31 The basic idea of MERT is to optimize only one parameter i e feature weight each time and keep all other parameters fixed This process runs iter atively over M parameters until the overall loss on the training corpus will not decrease Please refer to Liu et al 2010 fo
3. j i E a j i E a Gnclf e a j i 5 j 7 1i 1 17 j i Ea 20 TsinghuaAligner also uses swapping neighbors in a similar way j j 1Ai v 1 5 1 6 Linked Word Count We observe that there should not be too many unaligned words in good align ments For example there are only two unaligned words on the target side in Figure 1 The and a Unaligned words are usually function words that have little lexical meaning but instead serve to express grammatical relationships with other words or specify the attitude or mood of the speaker To control the number of unaligned words we follow Moore et al 2006 to introduce a linked word count feature that simply counts the number of aligned words J I hiwc f e a gt iv gt 0 gt lo pi 0 18 Jue Ff e a j i pj 0 6 0 19 where Y and are the fertilities before adding I TsinghuaAligner separates the two terms in the above two equations to dis tinguish between source linked word count and target linked word count 5 1 7 Maximal Fertility To control the maximal number of source words aligned to the same target word and vice versa we introduce the maximal fertility feature hmg e a max max 20 It is not straightforward to calculate the feature gain In practice Ts inghuaAligner maintains the positions of maximal fertilities and calculates the feature gain without checking all links TsinghuaAligner distinguishes bet
4. src file lt src_file gt source file trg file lt trg_file gt target file agt file lt agt_file gt alignment file Optional arguments nbest list 1 00 n best list size default 1 verbose 0 1 displays run time information 0 document level default 1 sentence level posterior 0 1 outputs posterior link probabilities default 0 help prints this message to STDOUT There are two optional arguments 1 n best list size n best list size The default value is 1 2 verbose display run time information The default value is 0 3 posterior output posterior link probabilities The default value is 0 Suppose we have unseen source and target sentences as follows gt source txt lt ta he wo dou xihuan yinyue ni he ta dou xihuan dushu gt target txt lt both he and i like music both he and you like reading Let s run the aligner to produce alignment for the sentences TsinghuaAligner ini file TsinghuaAligner ini src file source txt trg file target txt agt file alignment txt The resulting file alignment txt is in the Moses format 1 0 2 3 3 2 4 4 5 5 6 6 12 2 0 3 3 2 4 4 5 5 6 6 where 1 0 denotes that the second source word is aligned to the first target word Note that the subscript of the first word is 0 rather than 1 Sometimes we are interested in how well two words are aligned TsinghuaAlign er is able to output link pos
5. 0 h f e a where is the explored search space i e the set of nodes visited in the search graph during the aligning process P i f e 64 2 65 References Berg Kirkpatrick T Bouchard Cot A DeNero J and Klein D 2010 Painless unsupervised learning with features In Proceedings of NAACL 2010 Brown P F Pietra V J D Pietra S A D and Mercer R L 1993 The mathematics of statistical machine translation parameter estimation Computational Linguistics DeNero J Bouchard Cot A and Klein D 2008 Sampling alignment structure under a bayesian translation model In Proceedings of EMNLP 2008 Dyer C Chahuneau V and Smith N A 2013 A simple fast and effective reparameterization of ibm model 2 In Proceedings of NAACL 2013 35 Dyer C Clark J H Lavie A and Smith N A 2011 Unsupervised word alignment with arbitrary features In Proceedings of ACL 2011 Haghighi A Blitzer J DeNero J and Klein D 2009 Better word align ments with supervised itg models In Proceedings of ACL 2009 Lacoste Julien S Taskar B Klein D and Jordan M I 2006 Word align ment via quadratic assignment In Proceedings of HLT NAACL 2007 pages 112 119 New York City USA Li P Liu Y and Sun M 2012 A beam search algorithm for itg word alignment In Proceedings of COLING 2012 Liu Y Liu Q and Lin S 2005 Log linear models for wo
6. links Our supervised trainer relies on the development set to learn the parameters of log linear models The arguments of the supervisedTraining py in the lFor simplicity the source and target files in the training and development sets are the same In practice training sets are usually much larger than development sets woanan4nk WwW bw 1 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 scripts folder are listed as follows Usage supervisedTraining help Required arguments T ore Veb tile lt tases trg vcb file lt file gt s2t ttable file lt file gt t2s ttable file lt file gt dev src file lt file gt dev trg file lt file gt dev agt file lt file gt Optional arguments iter limit 1 00 nbest list size 1 00 beam size 1 00 structural constraint 0 1 2 enable prepruning 0 1 prepruning threshold 00 00 help source vocabulary target vocabulary source to target TTable target to source TTable devset source file devset target file devset alignment file MERT iteration limit default 10 n best list size default 100 beam size default 1 structural constraint 0 arbitrary 1 ITG 2 BITG default 0 enable prepruning 0 disable 1 enable default 0 prepruning threshold default 0 prints this message We distinguish between required and optional arguments Required argu ments must be explicitly
7. the system is highly extensible It is possible to design and include new features to capture the characteristics of specific languages Supervised learning Provided with manually annotated parallel cor pus the system is able to learn to minimize the difference between its alignment and the manual alignment Unsupervised learning The system is also capable of learning from unlabeled data automatically and delivers pretty good performance Structural constraints TsinghuaAligner supports a variety of struc tural constraints such as many to many ITG and block ITG These con straints prove to be effective in modeling the structural divergence between natural languages Link posterior probabilities The system is capable of producing the posterior probability for each link in alignments to indicate the confidence that two words are aligned Installation System Requirements TsinghuaAligner supports Linux i686 and Mac OSX You need to install the following free third party software to build TsinghuaAligner 1 2 3 4 2 2 GIZA It can be downloaded at https code google com p giza pp g version 4 6 3 or higher Python version 2 7 3 JRE 1 6 or higher optional only used for the visualization tool AlignViz Installing TshinghuaAligner Here is a brief guide on how to install TsinghuaAligner 2 2 1 Step 1 Unpacking Unpack the package using the following command tar xvfiz Tsing
8. value after adding a link to the current alignment a g f e a 1 h f e aU 1 h f e a 54 In our experiments we use a beam search algorithm that is more general than the above greedy algorithm In the greedy algorithm we retain at most one candidate in each level of the space graph while traversing top down In the beam search algorithm we retain at most b candidates at each level Algorithm 1 shows the beam search algorithm The input is a source lan guage sentence f and a target language sentence e line 1 The algorithm maintains a list of active alignments open line 2 and an n best list M line 3 The aligning process begins with an empty alignment a line 4 and the proce dure ADD open a 3 b adds a to open The procedure prunes search space by discarding any alignment that has a score worse than 1 68 multiplied with the best score in the list or 2 the score of b th best alignment in the list For each iteration line 6 we use a list closed to store promising alignments that have higher scores than current alignment For every possible link J line 9 we produce a new alignment a line 10 and calculate the link gain G by 29 Algorithm 1 A beam search algorithm for word alignment procedure ALIGN f e open gt a list of active alignments Ned gt n best list ac gt begin with an empty alignment while open 4 do closed 0 gt a list of pr Ils 2 3 4 5 ADD open a 3
9. words e g 5 7 1 3 in step 9 algorithm manages to find a block corresponding to the input sentence pair The shift reduce algorithm runs in linear time 8 At each level the algorithm at most retains b alignments line 13 As ITG only allows for one to one links the beam search algorithm runs for at 8In practice the algorithm can be even more efficient by recording the sequence of blocks in each hypothesis without unaligned word attachment Therefore block merging needs not to start from scratch for each hypothesis 34 most min J I iterations lines 7 21 Therefore the running time of our beam search algorithm is O bn Similarly TsinghuaAligner also supports the block ITG constraint as de scribed in Haghighi et al 2009 5 4 4 Link Posterior Probabilities Sometimes we are interested in the confidence of aligning a specific word pair rather than the overall word alignment To do so we define link posterior probability as P ilf e Plalf e x lea 62 ajf E al 63 This is especially useful for building bilingual lexicons and inducing weighted alignment matrices Liu et al 2009 However due to the exponential search space of word alignment it is in tractable to calculate the posteriors exactly using log linear model models Al ternatively TsinghuaAligner resorts to an approximate approach Dace P alf e x l a Dace Plalf e Dace exp 9 h f e a x Ea lace exp
10. y Ex yolh y 39 i 1 As there are exponentially many sentences and alignments the two expecta tions in Eq 8 are intractable to calculate for non local features that are critical for measuring the fertility and non monotonicity of alignment Liu et al 2010 Consequently existing approaches have to use only local features to allow dy namic programming algorithms to calculate expectations efficiently on lattices Dyer et al 2011 Therefore how to calculate the expectations of non local features accurately and efficiently remains a major challenge for unsupervised word alignment 5 3 2 Contrastive Learning with Top n Sampling Instead of maximizing the log likelihood of the observed training data we pro pose a contrastive approach to unsupervised learning of log linear models Liu and Sun 2014 25 zai huiyi shang fabiao yanjiang zai fabiao huiyi shang wo yanjiang SS SE he made a speech at he meeting talk a meeting sh he made a b Figure 3 a An observed romanized Chinese sentence an English sentence and the word alignment between them b a noisy training example derived from a by randomly permutating and substituting words As the training data only consists of sentence pairs word alignment serves as a latent variable in the log linear model In our approach the latent variable log linear model is expected to assign higher probabilities to observed training examples than to noisy examples For e
11. 2326 target sibling distance feature weight 0 0100039 one to one link count feature weight 0 0212899 one to many link count feature weight 0 0310621 many to one link count feature weight 0 0334263 many to many link count feature weight 0 0933321 search setting beam size 1 structural constraint 0 arbitrary 1 ITG 2 BITG structural constraint 0 speed up setting enable pre pruning 1 pre pruning threshold 0 0 Lines 1 5 show the knowledge sources used by the aligner which are source and target vocabularies and translation probability tables in two directions gen erated by GIZA Lines 7 23 specify the feature weights of the log linear model We use 16 features in TsinghuaAligner Other parameters in the con figuration file are related to the search algorithm We strongly recommend enabling pre pruning to improve the aligning speed by an order of magnitude 11 a A O N RR o WAN DA A DONK RR eR eR OW A O NF soan A ONAK without sacrificing accuracy significantly The default structural constraint used in search is arbitrary You may try other constraint such as ITG by modifying the configuration file as follows structural constraint 0 arbitrary 1 ITG 2 BITG structural constraint 1 The aligner itself in the bin folder is easy to use Usage TsinghuaAligner help Required arguments ini file lt ini_file gt initialization file
12. 3 b 6 7 8 for all a open do 9 gt initialize the list omising alignments for all l J x I a do gt enumerate all possible new links 10 a aU i gt produce a new alignment 11 g GAIN f e a gt compute the link gain 12 if g gt 0 then gt ensure that the score will increase 13 ApD closed a 6 b gt update promising alignments 14 end if 15 ApD N a 0 n gt update n best list 16 end for 17 end for 18 open closed gt update active alignments 19 end while 20 return M gt return n best list 21 end procedure oft E PME 5 BTS ye Go A s 2 AT Ks o He held2a meeting with Musharraf at Islamabad g Figure 5 An ITG derivation for a Chinese English se ntence pair calling the procedure GAIN f e a If a has a higher score line 12 it is added to closed line 13 We also update M to keep the top n alignments explored during the search line 15 The n best list will be used in training feature weights by MERT This process iterates until there are no promising alignments The theoretical running time of this algorithm is O bJ I 30 5 4 2 Pre Pruning In Algorithm 1 enumerating all possible new links line 9 leads to the major computational overhead and can be replaced by pre pruning Given an align ment a open to be extended we define C C J x I a as the candidate link set Instead of enumerating al
13. R Ku aA a KF WHS FP SC DOD AN DAK O HF SS These files will be used in both supervised and unsupervised training 3 2 Supervised Training In supervised training we require a set of parallel sentences annotated with gold standard alignment manually which we refer to as the development set The example development set in the example devset folder contains three files gt devset f lt wo xihuan dushu wo ye xihuan yinyue ni xihuan dushu ni ye xihuan yinyue wo he ni dou xihuan dushu ta xihuan dushu ma ta xihuan yinyue ma gt devset e lt i like reading i also like music you like reading you like music too both you and me like reading does he like reading does he like music gt devset wa lt lei i Doo Sasi ae 181 1 22 1 88 1 AeA 55 1 1 1 1 2 2 1 3 3 1 4 4 1 Neil Bea Seal Asa Betsy i 1 4 1 2 3 1 3 2 1 4 1 1 5 5 1 6 6 1 Ne Des Seay i o 55 1 1 2 1 2 3 1 3 4 1 4 1 0 5 5 1 The source and target files are similar to the training set The gold standard file devset wa contains manual annotations 1 1 1 denotes a sure link that connects the first source word and the first target word This usually happens for content words such as yinyue and music In contrast 4 1 0 denotes a possible link that connects the fourth source word e g ma and the first English word e g does Function words are usually aligned as possible
14. TsinghuaAligner A Statistical Bilingual Word Alignment System Yang Liu Tsinghua University liuyang2011 tsinghua edu cn December 13 2014 1 Introduction Word alignment is a natural language task that aims to identify the correspon dence between words in two languages zongtong zai niuyue fabiao jianghua BS E AXA RR HE The President made a speech at New York Figure 1 Example of word alignment Figure 1 shows a romanized Chinese sentence an English sentence and the word alignment between them The links highlighted in blue indicate the correspondence between Chinese and English words Word alignment is a chal lenging task because both the lexical choices and word orders in two languages are significantly different For example the English function words the and a have no counterparts in Chinese In addition a verb phrase e g made a speech is usually followed by a prepositional phrase e g at New York in English but the order is reversed in Chinese TsinghuaAligner is statistical bilingual word alignment system that takes a set of parallel sentences as input and produces word alignment automatically It has the following features m 2 2 1 Language independence The system is language independent and can be used for any language pairs Extensibility Our system is based on log linear models which are capa ble of incorporating arbitrary information sources as features Therefore
15. W NSH ta xihuan yinyue ma The target file trnset e contains target language sentences Each line cor responds to a tokenized target language sentence For English lowering cases is usually used to reduce the data sparseness and improve the accuracy Note that the source and target sentences with the same line number are assumed to be translations of each other For example the first sentence in trnset f and the first sentence in trnset e constitute a parallel sentence i like reading i also like music you like reading you like music too both you and me like reading does he like reading NO GM RW HSH does he like music To run GIZA 4 simply use the GIZA py script in the scripts folder with the two above files as input m GIZA py trnset f trnset e GIZA training usually takes a long time especially for large parallel corpora i e millions of sentence pairs We recommend using nohup to keep executing the training script after you exit from a shell prompt m nohup GIZA py trnset f trnset e amp When GIZA training is complete there should be four resulting files 1 source vcb source language vocabulary 2 target vcb target language vocabulary 3 source_target tTable source to target translation probability table 4 target_source tTable target to source translation probability table o anaa4ark wb wR bo wb bo WH WH WH HO RHR HR HR HR RR HR K
16. al challenge in unsupervised learning of log linear models However it is still intractable to calculate the expectation with respect to the posterior distribution Ey 9 h x y for non local features due to the exponential search space One possible solution is to use Gibbs sampling to draw samples from the posterior distribution P y x DeNero et al 2008 But the Gibbs sampler usually runs for a long time to converge to the equilibrium distribution Fortunately by definition only alignments with highest probabilities play a central role in calculating expectations If the probability mass of the log linear model for word alignment is concentrated on a small number of alignments it will be efficient and accurate to only use most likely alignments to approximate the expectation Figure 4 plots the distributions of log linear models parametrized by 1 000 random feature weight vectors We used all the 16 features The true distribu tions were calculated by enumerating all possible alignments for short Chinese and English sentences lt 4 words We find that top 5 alignments usually ac count for over 99 of the probability mass 27 More importantly we also tried various sentence lengths language pairs and feature groups and found this concentration property to hold consistently One possible reason is that the exponential function enlarges the differences between variables dramatically i e a gt b gt exp a gt exp
17. b Therefore we propose to approximate the expectation using most likely alignments Sy x 0 hr x y X Ply x O Ax x y 45 ye V x _ yey xP O A x y hes y 46 Dy ey x PO A x y yen x 0 XP O h x y h x y J yeno xp h x y II where N x 0 C V x contains the most likely alignments depending on 8 Yy E N x 0 Yy2 V x WV x 0 0 h x y1 gt 0 h x y2 48 Let the cardinality of N x 0 be n We refer to Eq 47 as top n sam pling because the approximate posterior distribution is normalized over top n alignments exp 0 h x y Zyren HPO hiy In this paper we use the beam search algorithm proposed by Liu et al 2010 to retrieve top n alignments from the full search space Starting with an empty alignment the algorithm keeps adding links until the alignment score will not increase During the process local and non local feature values can be calculated in an incremental way efficiently The algorithm generally runs in O bl m time where b is the beam size As it is intractable to calculate the objective function in Eq 42 we use the stochastic gradient descent algorithm SGD for parameter optimization which requires to calculate partial derivatives with respect to feature weights on single training examples Py y x 8 49 5 4 Search 5 4 1 The Beam Search Algorithm Given a source language sentence f and a target language sentence e we
18. b m oO bo Y Ke l AlignViz aal File About 11H12 ta both he he wo and dou i xihuan like yinyue music Figure 2 Visualization of word alignment 3 6 Evaluation To evaluate our system we need a test set to calculate alignment error rate AER The test set in the example tstset folder is gt tstset f lt ta he wo dou xihuan yinyue ni he ta dou xihuan dushu gt tstset e lt both he and i like music both he and you like reading gt tstset wa lt 1 2 1 2 3 1 3 4 1 4 1 1 5 5 1 6 6 1 1 2 1 2 3 1 3 4 1 4 1 1 5 5 1 6 6 1 Now let s run the aligner to produce alignment for the test set TsinghuaAligner ini file TsinghuaAligner ini src file tstset f trg file tstset e agt file alignment txt The resulting file alignment txt is in the Moses format 1 0 2 3 3 2 4 4 5 5 6 6 0 3 3 2 4 4 5 5 6 6 Finally you may run the waEval program in the bin folder to calculate the AER score waEval tstset wa alignment txt The evaluation result is shown as follows 1 3366 gt 0 5 2 2 2 5 6 gt 0 636364 14 o an aaet 10 11 12 13 14 15 total matched sure 5 total matched possible 5 total actual 11 total sure 12 Precision 0 454545 Recall 0 416667 AER 0 565217 Top 10 wrong predictions 2 0 636364 1 0 5 waEval outputs not only the overall AER score but also the AER score f
19. becomes t ea fi x t file2 x t e2 fo x t fale2 x t e1 fo Similarly the new product for 1 2 2 2 now becomes t e2 fi x t file2 x t ea fa x t felea x t er fo Note that after adding the link 2 2 the new product still has more factors than the old product However the new product is not necessarily always smaller than the old one In this case the new product divided by the old product is t e2 f2 x t falea t f2leo Whether a new product increases or not depends on actual translation proba bilities Depending on whether aligned or not we divide the words in a sentence pair into two categories aligned and unaligned For each aligned word we use translation probabilities conditioned on its counterpart in two directions i e t e f and t f e For each unaligned word we use translation probabilities 3Even though we take empty cepts into account the bias problem still exists because the product will decrease by adding new links if there are no unaligned words For example the product will go down if we further add a link 1 1 to 1 2 2 2 as all source words are aligned This might not be a bad bias because reference alignments usually do not have all words aligned and contain too many links Although translation probability product is degenerate as a generative model the bias problem can be alleviated when this feature is combined with other features such as link count see Section 4 1 2
20. cally The alignment satisfies the ITG constraint if and only if the If the Viterbi alignment is a full alignment e g aig and the beam size is 1 XD xXIFY nodes will be explored Apparently this can hardly happen in practice For ITG alignments however our algorithm can reach at most the min J I th layer because ITG only allows for one to one links 33 o0 Islamabad at Kel Musharraf N with vt meeting step operation stack 1 S 0 1 0 1 2 S 0 1 0 1 1 2 6 7 3 S 0 1 0 1 1 2 6 7 2 3 7 8 4 Rm 0 1 0 1 1 3 6 8 5 S 0 1 0 1 1 3 6 8 3 4 4 5 6 S 0 1 0 1 1 3 6 8 3 4 4 5 4 5 5 6 7 Ru 0 1 0 1 1 3 6 8 3 5 4 6 8 Ry 0 1 0 1 1 5 4 8 9 S 0 1 0 1 1 5 4 8 5 7 1 3 10 S 0 1 0 1 1 5 4 8 5 7 1 3 7 8 3 4 11 Rm 0 1 0 1 1 5 4 8 5 8 1 4 12 Rr 0 1 0 1 1 8 1 8 13 Ru 0 8 0 8 Figure 7 A shift reduce algorithm for judging ITG alignment The algorithm scans links from left to right on the source side Each link j i is treated as an atomic block j 1 j i 1 7 Three operators are defined S shift a block Rm merge two blocks in a monotone order and Ry merge two blocks in an inverted order The algorithm runs in a reduce eager manner merge blocks as soon as possible Unaligned words are attached to the left nearest aligned
21. cross between the two links 2 6 and 4 3 because 2 4 x 6 3 lt 0 In Figure 1 there are six crosses reflecting the significant structural divergence between Chinese and English As a result we could use the number of crosses in alignments to capture the divergence of word orders between two languages Formally the feature function of cross count is given by hec f e a 5 i i x i i lt 0 13 j t a j i E a Jelf e aji XO G a x a i lt o 14 jsi Ea where expr is an indicator function that takes a boolean expression expr as the argument _ J 1 if expr is true expr 0 otherwise ae 5 1 5 Neighbor Count Moore 2005 finds that word alignments between closely related languages tend to be approximately monotonic Even for distantly related languages the num ber of crossing links is far less than chance since phrases tend to be translated as contiguous chunks In Figure 1 the dark points are positioned approximately in parallel with the diagonal line indicating that the alignment is approximately monotonic To capture such monotonicity we follow Lacoste Julien et al 2006 to en courage strictly monotonic alignments by adding bonus for a pair of links j i and j 7 such that j j 1Ai v 1 In Figure 1 there is one such link pair 2 6 and 3 7 We call them monotonic neighbors Formally the feature function of neighbor count is given by hnc f e a 5 5 l 7 1Ai v 16
22. ct of aligned words the second term deals with unaligned source words and the third term deals with unaligned target words Table 1 shows the feature values for some word alignments For efficiency we need to calculate the difference of feature values instead of the values themselves which we call feature gain The feature gain for 4We use the logarithmic form of translation probability product to avoid manipulating very small numbers e g 4 3 x e7100 just for practical reasons 18 translation probability product is Jtpp f e a j log t ei f log t fjle _ log 5 0 x t fileo 1 4 0 log 5 di 0 x t ei fo 1 5 i 0 8 where y and are the fertilities before adding the link j i Although this feature is symmetric we obtain the translation probabilities t fle and t e f by training the IBM models using GIZA Och and Ney 2003 5 1 2 Link Count Given a source sentence with J words and a target sentence with J words there are J x I possible links However the actual number of links in a reference alignment is usually far less For example there are only 6 links in Figure 1 although the maximum is 5 x 8 40 The number of links has an important effect on alignment quality because more links result in higher recall while less links result in higher precision A good trade off between recall and precision usually results from a reasonable number of links Using the number o
23. e that User Jack TsinghuaAligner is the directory where Tsinghua Aligner is installed You may use the command pwd to get the full path name of the root directory Do the same to the other two Python scripts supervisedTraining py and unsupervisedTraining py and complete the installation 3 User Guide 3 1 Running GIZA As noted before TsinghuaAligner is based on log linear models that are capable of incorporating arbitrary features that capture various characteristics of word alignment One of the most important feature in TsinghuaAligner is translation probability product see Section 5 derived from GIZA 4 the state of the art generative alignment system As a result we need to run GIZA first before training the log linear models This can be done by running the GIZA py script in the scripts directory The input of this script is a parallel corpus which can be found in the example trnset folder The source file trnset f contains source language sentences Each line cor responds to a tokenized source language sentence We strongly recommend us ing UTF 8 encoding for all the training development and test sets In addition each sentence should contain no more than 100 words because GIZA 4 usu ally truncates long sentences before training which might lead to unexpected problems wo xihuan dushu wo ye xihuan yinyue ni xihuan dushu ni ye xihuan yinyue wo he ni dou xihuan dushu ta xihuan dushu ma NOD GM KR Ww
24. ealing to directly learn the feature weights from unlabled data Our unsupervised trainer unsupervisedTraining py in the scripts folder only uses the training set for parameter estimation Usage unsupervisedTraining help Required arguments ETECO KELGA trg file lt file gt SiS Veber ilens fiiez m ere Veb filens iiez s2t ttable file lt file gt t2s ttable file lt file gt Optional arguments training corpus size 1 00 sent length limit 1 00 shuffle 0 1 replace 0 1 insert 0 1 delete 0 1 beam size 1 00 sample size 1 00 sgd iter limit 1 00 sgd converge threshold 0 00 sgd converge limit 1 00 sgd lr numerator 0 00 sgd lr denominator 0 00 file file source target source vocabulary target vocabulary source to target TTable target to source TTable training corpus size default 10 sentence length limit default 100 shuffling words randomly default 1 replacing words randomly default 0 inserting words randomly default 0 deleting words randomly default 0 beam size default 5 sample size default 10 SGD iteration limit default 100 SGD convergence threshold default 0 01 SGD convergence limit default 3 SGD learning rate numerator default 1 0 SGD learning rate denominator default 1 0 The optional arguments include 1 10 11 12 13 training cor
25. ence of our algorithm from Algorithm 1 is that we only consider ITG alignments which is highlighted by shading in Figure 5 Wu 1997 shows that ITG alignments only account for 0 1 in the full search space The percentage is even lower for long sentences As the worst case running time is O bn b is a beam size for the beam search algorithm of Liu et al 2010 this can be reduced to O bn for the beam search algorithm that searches for ITG word alignment 7 Algorithm 2 shows the beam search algorithm for ITG alignment The best alignment is set to empty at the beginning line 2 The algorithm collects promising links before alignment expansion line 3 By promising we mean that adding a link will increase the probability of current alignment For each alignment the algorithm calls a procedure ITG a to verify whether it is an ITG alignment or not line 12 We use a shift reduce algorithm for ITG verification As shown in Figure 7 the shift reduce algorithm scans links from left to right on the source side Each link j i is treated as an atomic block j 1 j i 1 7 The algorithm maintains a stack of blocks on which three operators are defined 1 S shift a block into the stack 2 Rm merge two blocks in a monotone order 3 Rr merge two blocks in an inverted order The algorithm runs in a reduce eager manner merge blocks as soon as possible Unaligned words are attached to the left nearest aligned words de terministi
26. f links as a feature could also alleviate the bias problem posed by the translation probability product feature see Section 4 1 1 A negative weight of the link count feature often leads to less links while a positive weight favors more links Formally the feature function of link count is hi f e a Jal 9 gic f e a 1 1 10 where a is the cardinality of a i e the number of links in a 5 1 3 Relative Position Absolute Distance The difference between word orders in two languages can be captured by calcu lating the relative position absolute distance Taskar et al 2005 j a hrpaa f a gt J z T 11 j i Ea Grpaa F a j Z 12 5For clarity we use gtpp f e a j i instead of gtpp f e a l because j and i appear in the equation 19 5 1 4 Cross Count Due to the diversity of natural languages word orders between two languages are usually different For example subject verb object SVO languages such as Chinese and English often put an object after a verb while subject object verb SOV languages such as Japanese and Turkish often put an object before a verb Even between SVO languages such as Chinese and English word orders could be quite different too In Figure 1 while zai is the second Chinese word its counterpart at is the sixth English word Meanwhile the fourth Chinese word fabiao is aligned to the third English word made before at We say that there is a
27. fi f fz and a target language sentence e 1 e7 we define a link j i to exist if fj and e are translations or part of a translation of one another Then an alignment is defined as a subset of the Cartesian product of the word positions AIG DIGS i JST 1 Note that the above definition allows for arbitrary alignments while IBM models impose the many to one constraint Brown et al 1993 In supervised learning the log linear model for word alignment is given by exp 0 h f e a Xa exp 0 h f e a where h R is a real valued vector of feature functions that capture the characteristics of bilingual word alignment and 6 R is the corresponding feature weights In unsupervised learning the latent variable log linear model for word align ment is defined as P f e X P f e a 3 exp 0 h f e a 3 Xe ee eal exp 0 j h f e a The major advantage of log linear models is to define useful features that capture various characteristics of word alignments The features used in Ts inghuaAligner mostly derive from Liu et al 2010 P alf e 2 4 5 1 1 Translation Probability Product To determine the correspondence of words in two languages word to word trans lation probabilities are always the most important knowledge source To model a symmetric alignment a straightforward way is to compute the product of the translation probabilities of each link in two direct
28. ghuaAligner distinguishes between source sibling distance and tar get sibling distance 5 1 9 Link Type Count Due to the different fertilities of words there are different types of links For instance one to one links indicate that one source word e g zongtong is trans lated into exactly one target word e g President while many to many links exist for phrase to phrase translation The distribution of link types differs for different language pairs For example one to one links occur more frequently in closely related language pairs e g French English while one to many links are more common in distantly related language pairs e g Chinese English To capture the distribution of link types independent of languages we use features to count different types of links Following Moore 2005 we divide links in an alignment into four categories 1 one to one links in which neither the source nor the target word partic ipates in other links 2 one to many links in which only the source word participates in other links 3 many to one links in which only the target word participates in other links 22 Table 2 Example feature values and error scores feature values candidate h hao Te AER ay 85 4 10 0 21 a2 89 3 12 0 20 as 93 6 11 0 22 4 many to many links in which both the source and target words partic ipate in other links In Figure 1 1 2 2 6 4 3 and 5 5 are one to one lin
29. huaAligner tar gz Can anak WwW ds RR OW OR 9 2 2 2 Step 2 Compiling Entering the TsinghuaAligner directory you may find five folders code doc example GUI and scripts and one script file install sh The code fold er includes the source code the doc folder includes the documentation the example folder contains example data the GUI folder contains the visualization tool AlignViz and the scripts folder includes Python scripts for training the system First change the mode of the installing script chmod x install sh Then install the system by running the script install sh If everything goes well you should see the following information Creating a directory to store binary executables done Compiling TsinghuaAligner done Compiling waEval done Compiling convertNBestListFormat done Compiling genIni done Compiling mergeNBestList done Compiling optimizeAER done Chmoding genNoise done Compiling trainer done Chomoding scripts done The system is installed successfully The script creates a folder bin to place all binary executables 1 TsinghuaAligner the main component of the system that produces word alignment for parallel corpora 2 optimizeAER the main component for supervised training 3 trainer the main component for unsupervised training 4 convertNBestListFormat converting n best list format for supervised
30. ions For example suppose that there is an alignment 1 2 for a source language sentence f fz and a target language sentence e1e2 the translation probability product is t e2 fr x t filez where t e f is the probability that f is translated to e and t fle is the proba bility that e is translated to f respectively 16 Unfortunately the underlying model is biased the more links added the smaller the product will be For example if we add a link 2 2 to the current alignment and obtain a new alignment 1 2 2 2 the resulting product will decrease after being multiplied with t e2 f2 x t fole2 t ea fi x t file2 x t e2 fe x t felea The problem results from the absence of empty cepts Following Brown et al 1993 a cept in an alignment is either a single source word or it is empty They assign cepts to positions in the source sentence and reserve position zero for the empty cept All unaligned target target words are assumed to be aligned to the empty cept For example in the current example alignment 1 2 the unaligned target word e is said to be aligned to the empty cept fo As our model is symmetric we use fp to denote the empty cept on the source side and eo to denote the empty cept on the target side respectively If we take empty cepts into account the product for 1 2 can be rewritten as t e2 f1 x t filez x t e1 fo x t feleo Similarly the product for 1 2 2 2 now
31. ks and others are one to many links As a result we introduce four features herolf e a X yj 1A 1 25 j i Ea hoam f e a X yy gt 1A i 1 26 j i Ea hm2o f e a X yj 1 i gt 1 27 j i Ea hm m f e a X bj gt 1 gi gt 1 28 j i Ea Their feature gains cannot be calculated in a straightforward way because the addition of a link might change the link types of its siblings on both the source and target sides Please refer to Liu et al 2010 for the algorithm to calculate the four feature gains 5 2 Supervised Training In supervised training we are given the gold standard alignments for the parallel corpus Using the minimum error rate training MERT algorithm Och 2003 training log linear models actually reduces to training linear models Suppose we have three candidate alignments a1 ag and a3 Their error scores are 0 21 0 20 and 0 22 respectively Therefore ag is the best candidate alignment a is the second best and ag is the third best We use three features to score each candidate Table 2 lists the feature values for each candidate If the set of feature weights is 1 0 1 0 1 0 the linear model scores of the three candidates are 71 74 and 76 respectively While reference alignment considers ag as the best candidate a has the maximal model score This is unpleasant because the model fails to agree with the reference If we change the feature weights to 1 0
32. l possible candidates pre pruning only considers the highly likely candidates C J 4 log t cil f log t ei fo log t fjle log t fjleo gt y 5 where y is a pre pruning threshold to balance the search accuracy and efficiency The default value of y is 0 in TsinghuaAligner Experiments show that pre pruning dramatically improves the aligning speed by an order of magnitude without sacrificing accuracy significantly 5 4 3 The ITG and Block ITG Constraints One major challenge in word alignment is modeling the permutations of words between source and target sentences Due to the diversity of natural languages the word orders of source and target sentences are usually quite different es pecially for distantly related language pairs such as Chinese and English In version transduction grammar ITG Wu 1997 is a synchronous grammar for synchronous parsing of source and target language sentences It builds a syn chronous parse tree that indicates the correspondence as well as permutation of blocks i e consecutive word sequences based on the following production rules X gt KX 56 X gt XX 57 X gt f e 58 X fje 59 X gt eje 60 where X is a non terminal f is a source word e is a target word and e is an empty word While rule 57 merges two blocks in a monotone order rule 58 merges in an inverted order Rules 59 61 are responsible for aligning source and target words Figure 5 shows a
33. lation In Proceedings of ACL 2006 36
34. n ITG derivation for a Chinese English sentence pair The decision rule of finding the Viterbi alignment for a sentence pair ff el is given by argmax l II p e x Het x Lle ei 61 Jt Ea jga tga Traditionally this can be done in O n time using bilingual parsing Wu 1997 6For simplicity we assume the distribution for the binary rules X gt X X and X gt X X is uniform Xiong et al 2006 propose a maximal entropy model to distinguish between two merging options based on lexical evidence We leave this for future work 31 D N f gt w D gt t n e e e e fh N all aa NJ of DC a aA A S Figure 6 The search space of word alignment ITG alignments are highlighted by shading Algorithm 2 A beam search algorithm for ITG alignment 1 procedure ALIGNITG f e 2 a gt the alignment with highest probability 3 L gt j i p fj e gt p x ple e gt a set of promising links 4 open 0 gt a list of active alignments 5 a gt begin with an empty alignment 6 ApD open a 3 b gt initialize the list 7 while open 4 do 8 closed 0 gt a list of expanded alignments 9 for all a open do 10 for all l a do gt enumerate all possible new links 11 a 4 aU il gt produce a new alignment 12 if Irc a then gt en
35. nominator for computing learning rate in SGD The default value is 1 0 An example command for unsupervised training is unsupervisedTraining py src file trnset f trg file trnset e src vcb file source vcb trg vcb file target vcb s2t ttable file source_target tTable t2s ttable file target _source tTable The resulting file of unsupervised training is also a configuration file TsinghuaAligner ini for the aligner 10 o anan4nrk ONK 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3 4 Aligning Unseen Parallel Corpus TsinghuaAligner takes a configuration file TsinghuaAligner ini as input knowledge sources source vocabulary file source vcb target vocabulary file target vcb source to target TTable file source_target tTable target to source TTable file target_source tTable feature weights translation probability product feature weight 0 0504511 link count feature weight 0 0661723 relative position absolute distance feature weight 0 264923 cross count feature weight 0 0588821 mono neighbor count feature weight 0 137836 swap neighbor count feature weight 0 049596 source linked word count feature weight 0 00257702 target linked word count feature weight 0 0229796 source maximal fertility feature weight 0 072508 target maximal fertility feature weight 0 0126342 source sibling distance feature weight 0 07
36. or each sentence pair It lists the top 10 wrong sentence pairs to facilitate error analysis 4 Additional Datasets In the TsinghuaAligner tar gz package we only offer toy training develop ment and test sets for showing how to use the system In practice you need large training corpus and manually annotated development and test sets for running the system We offer the following additional Chinese English datasets for FREE on our system website 1 model ce tar gz the model files see Section 3 1 are trained by GIZA on millions of Chinese English sentence pairs Note that all Chinese word s are composed of halfwidth characters and all English words are lower cased 2 Chinese English training set the training set comprises parallel cor pora from the United Nations website UN and the Hong Kong Govern ment website HK The UK part contains 43K sentence pairs and the HK part contains 630K sentence pairs 3 Chinese English evaluation set the evaluation set comprises two part s the development set 450 sentences and the test set 450 sentences Note that we use the UTF 8 encoding for all Chinese and English files http nlp csai tsinghua edu cn ly systems TsinghuaA ligner TsinghuaA ligner html 15 5 Tutorial 5 1 Log Linear Models for Word Alignment TsinghuaA ligner originates from our early work on introducing log linear models into word alignment Liu et al 2005 Given a source language sentence f
37. pus size the number of training examples used for training It turns out our unsupervised trainer works pretty well only using a small number of training examples The default value is 10 sentence length limit the maximal length of sentences in the training corpus The default value is 100 shuffle our unsupervised training algorithm is based on a contrastive learning approach which differentiates the observed examples from noises Turning this option on will generate noisy examples by shuffling words The default value is 1 replace generating noises by replacing words randomly The default value is 0 insert generating noises by replacing words randomly The default value is 0 delete generating noises by inserting words randomly The default value is 0 beam size beam size for the search algorithm The default value is 5 sample size sample size for top n sampling that approximates the ex pectations of features The default value is 10 SGD iteration limit we use stochastic gradient descent for optimiza tion This argument specifies the limit on SGD iterations The default value is 100 SGD convergence threshold the threshold for judging convergence in SGD The default value is 0 01 SGD convergence limit the limit for judging convergence in SGD The default value is 3 SGD learning rate numerator the numerator for computing learning rate in SGD The default value is 1 0 SGD learning rate denominator the de
38. r more details 5 3 Unsupervised Training 5 3 1 Unsupervised Learning of Log Linear Models To allow for unsupervised word alignment with arbitrary features latent variable log linear models have been studied in recent years Berg Kirkpatrick et al 2010 Dyer et al 2011 2013 Let x f e be a pair of source and tar get sentences and y be the word alignment A latent variable log linear model parametrized by a real valued vector 0 R is given by P x 0 SS P x y 0 32 yey x yey x exp 0 h x y Z 6 33 where h R is a feature vector and Z is a partition function for normalization Z 0 gt Y exp O h x y 34 xEX yEY x 24 We use 4 to denote all possible pairs of source and target strings and V x to denote the set of all ee alignments for a sentence pair x Given a set of training examples x _ the standard training objective is to find the parameter that maximizes the log likelihood of the training set 6 argmax e 35 0 argmax l log ri P x o 36 i 1 argmax 5 log exp 0 h x y GT yeya log xo 37 Standard numerical optimization methods such as L BFGS and Stochastic Gradient Descent SGD require to calculate the partial derivative of the log likelihood L with respect to the k th feature weight Ox L 0 6 gt Pie 6 he x y 4 1 yey x gt X SO Pex y ha x y 38 xEX yEY x I Ey xw e he x
39. rd alignment In Proceedings of ACL 2005 Liu Y Liu Q and Lin S 2010 Discriminative word alignment by linear modeling Computational Linguistics Liu Y and Sun M 2014 Contrastive unsupervised word alignment with non local features Under review Liu Y Xia T Xiao X and Liu Q 2009 Weighted alignment matrices for statistical machine translation In Proceedings of EMNLP 2009 Moore R C 2005 A discriminative framework for bilingual word alignmen t In Proceedings of HLT EMNLP 2005 pages 81 88 Vancouver British Columbia Canada Moore R C Yih W t and Bode A 2006 Improved discriminative bilin gual word alignment In Proceedings of COLING ACL 2006 pages 513 520 Sydney Australia Och F J 2003 Minimum error rate training in statistical machine translation In Proceedings of ACL 2003 pages 160 167 Sapporo Japan Och F J and Ney H 2003 A systematic comparison of various statistical alignment models Computational Linguistics 29 1 19 51 Taskar B Lacoste Julien S and Klein D 2005 A discriminative matching approach to word alignment In Proceedings of HLT EMNLP 2005 pages 73 80 Vancouver British Columbia Canada Wu D 1997 Stochastic inversion transduction grammars and bilingual pars ing of parallel corpora Computational Linguistics Xiong D Liu Q and Lin S 2006 Maximum entropy based phrase re ordering model for statistical machine trans
40. source vcb target vocabulary file target vcb source to target TTable file source_target tTable target to source TTable file target_source tTable feature weights translation probability product feature weight 0 0504511 link count feature weight 0 0661723 relative position absolute distance feature weight 0 264923 cross count feature weight 0 0588821 mono neighbor count feature weight 0 137836 swap neighbor count feature weight 0 049596 source linked word count feature weight 0 00257702 target linked word count feature weight 0 0229796 source maximal fertility feature weight 0 072508 target maximal fertility feature weight 0 0126342 source sibling distance feature weight 0 072326 target sibling distance feature weight 0 0100039 one to one link count feature weight 0 0212899 one to many link count feature weight 0 0310621 many to one link count feature weight 0 0334263 many to many link count feature weight 0 0933321 search setting beam size 1 structural constraint 0 arbitrary 1 ITG 2 BITG structural constraint 0 34 speed up setting 35 enable pre pruning 0 36 pre pruning threshold 0 0 Can anak ONK 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 We will explain the configuration file in detail in Section 3 4 3 3 Unsupervised Training As manual annotation is labor intensive it is app
41. specified when running the supervisedTraining py script The 7 required arguments are all files including the resulting files from GIZA and the development set Optional arguments include 1 iteration limit limit on minimum error rate training iterations The default value is 10 2 n best list size n best list size The default value is 100 3 beam size beam size of the search algorithm The default value is 1 4 structural constraint TsinghuaAligner supports three kinds of struc tural constraints arbitrary ITG and block ITG The default value is 0 5 enabling prepruning prepruning is a technique that improves the align ing speed by constraining the search space The default value is 0 mw Me o AN DBA KR WH DH wBwwwnhy bs dN YH HY HY NY NY NY NKPRP RRP RB Re eR O N Ee OHM DAN TDA KWH H CBO HB AN TDA KWH KF SG 6 prepruning threshold The threshold for prepruning The default value is 0 You need not specify optional arguments in running the script unless you want to change the default setting An example command for supervised train ing is supervisedTraining py src vcb file source vcb trg vcb file target vcb s2t ttable file source_target tTable t2s ttable file target_source tTable dev src file dev f dev trg file dev e dev agt file dev wa The resulting file is a configuration file TsinghuaAligner ini for the aligner knowledge sources source vocabulary file
42. sure the ITG constraint 13 ApD closed a 3 b gt update expanded alignments 14 if a gt then 15 a a gt update the best alignment 16 end if 17 end if 18 end for 19 end for 20 open closed gt update active alignments 21 end while 22 return a gt return the alignment with highest probability 23 end procedure 32 We extend a beam search algorithm shown in Algorithm 1 to search for Viterbi ITG word alignments Li et al 2012 Figure 6 illustrates the search space of word alignment Starting from an empty word alignment the beam search algorithm proposed by Liu et al 2010 keeps add single links to current alignments until all expanded alignments do not have higher probabilities For example adding single links to the empty alignment a results in four expanded alignments a2 a3 a4 and as Suppose only ag has a higher probability than a Then expanding a3 gives three new alignments a7 a9 and aj If all of them have lower probabilities than as then the algorithm returns ag as the optimal alignment From a graphical point of view the search space is organized as a directed acyclic graph that consists of 27 nodes and J x I x 27 1 1 edges The nodes are divided into J x I 1 layers The number of nodes in the kth layer k 0 J x TI is Gar The maximum of layer width is given by i The goal of word alignment is to find a node that has the highest probability in the graph The major differ
43. terior probabilities to indicate the degree of aligned ness This can be simply done by turning the posterior option on 1 TsinghuaAligner ini file TsinghuaAligner ini src file 2 source txt trg file target txt agt file alignment txt 3 posterior 1 The result file is as follows 4 4 0 997164 5 5 0 981469 6 6 0 941433 2 3 0 868177 3 2 0 570701 1 0 0 476397 2 0 0 090127 4 4 0 996101 5 5 0 976339 6 6 0 919651 3 2 0 777563 0 3 0 646714 1 0 0 415113 A O N K Note that each link is assigned a probability within 0 1 to indicate the con fidence the two words are aligned The results file just contains sets of links collected from the aligning process and do not form reasonable alignments You must specify a threshold to prune unlikely links and get high quality alignments 3 5 Visualization We provide a visualization tool called Align Viz to display the alignment results which is located in the GUI folder The input of AlignViz are three files gt source txt lt ta he wo dou xihuan yinyue ni he ta dou xihuan dushu gt target txt lt both he and i like music both he and you like reading o anan4nk OQ bw gt alignment txt lt 2 2 3 3 0 4 4 5 2 2 1 3 0 4 4 5 0 1 is 5 6 6 Os il 5 6 6 m m To launch AlignViz simply use the following command java jar AlignViz jar The aligned sentence pair is shown in Figure 2 13 o Noaoa A W
44. training 5 genIni generating configuration files for supervised training 6 megerNBestList merging n best lists of multiple iterations for super vised learning 7 genNoise py generating noises for unsupervised learning 8 waEval evaluating the system in terms of alignment error rate m 2 2 3 Step 3 Compiling GIZA TsinghuaAligner takes the translation probabilities derived from GIZA as the central feature in the log linear model As a result GIZA needs to be properly installed to run our system Please visit https code google com p giza pp to download GIZA and compile it according to its user manual After installing GIZA copy the following binary executables to the bin folder 1 GIZA the main component for training IBM models 2 mkcls training word classes on monolingual corpus 3 plain2snt out converting plain files to the snt format 4 snt2cooc out collecting co occurrence statistics Note that there should be 12 binary executables in the bin folder if the system is properly installed 2 2 4 Step 4 Locating the Executables Training TsinghuaAligner is mainly done by using the Python scripts in the scripts folder You need to enable these scripts to locate the binary executables by modifying the root_dir variable in each script For example you may change line 7 of the GIZA py script root_dir to root_dir User Jack TsinghuaAligner Not
45. try to find the best candidate alignment with the highest model score angmax P E e a 50 28 a 7 angina 6 h f e a 51 To do this we begin with an empty alignment and keep adding new links un til the model score of current alignment will not increase Graphically speaking the search space of a sentence pair can be organized as a directed acyclic graph Each node in the graph is a candidate alignment and each edge corresponds to a link We define that alignments that have the same number of links constitute a level There are 27 possible nodes and J x I 1 levels in a graph Our goal is to find the node with the highest model score in a search graph As the search space of word alignment is exponential although enumerable it is computationally prohibitive to explore all the graph Instead we can search efficiently in a greedy way During the above search process we expect that the addition of a single link to the current best alignment a will result in a new alignment aU l with a higher score 0 h f e aU 1 h f e a gt 0 52 As a result we can remove most of computational overhead by calculating only the difference of scores instead of the scores themselves The difference of alignment scores with the addition of a link which we refer to as a link gain is defined as G f e a l 0 g f e a 1 53 where g f e a l is a feature gain vector which is a vector of the incrementals of feature
46. ween source maximal fertility and target maximal fertility 5 1 8 Sibling Distance In word alignments there are usually several words connected the same word on the other side For example in Figure 1 two English words New and York are aligned to one Chinese word niuyue We call the words aligned to the same word on the other side siblings A word often tends to produce a series of words in another language that belong together while others tend to produce a series of words that should be separate To model this tendency we introduce a feature that sums up the distances between siblings Formally we use w to denote the position of the k th target word aligned to a source word fj and use 7 to denote the position of the k th source word 21 aligned to a target word e Obviously wj k 1 is always greater than wj by definition As New and York are siblings we define the distance between them is w3 2 w31 1 0 The distance sum of fj can be efficiently calculated as E Wj pj Wj 1 Yj 1 if pj gt 1 AG Y3 0 otherwise 21 Accordingly the distance sum of e is TEA AE Tip Ta O 1 if Qi gt 1 Vig 0 otherwise 22 Formally the feature function of sibling distance is given by I hsa F e a AG bj 5 VG gi 23 j l i 1 The corresponding feature gain is Gsa f e a j i A j Yj 1 AQ pi V i di 1 V G gi 24 where 7 and are the fertilities before adding the link j i Tsin
47. xample given an observed training example as shown in Figure 3 a it is possible to generate a noisy example as shown in Figure 3 b by randomly shuffling and substituting words on both sides Intuitively we expect that the probability of the observed example is higher than that of the noisy example This is called contrastive learning which has been advocated by a number of authors More formally let x be a noisy training example derived from an observed example x Our training data is composed of pairs of observed and noisy exam ples D x x _ The training objective is to maximize the difference of probabilities between observed and noisy training examples 0 asmax f 10 40 0 argmax 4 lo Il Ee 41 argmax ET p eit PR argmax 5 log exp 0 x h x y e Ui yey log SO ost hy yEYV x 42 Accordingly the partial derivative of J with respect to the k th feature weight Ox is given by OJ 0 00 26 1 0 0 9 0 8 0 7 0 6 0 5 0 4 0 3 0 2 0 1 0 0 probability 01234567 8 910111213141516 the n th best alignment Figure 4 Distributions of log linear models for alignment on short sentences lt 4 words T gt P y x 0 h x y i l yey x So Ply k 6 hg amp y 43 yEyY x I 5 Loo lhk x y by 2 0 Re X y 44 The key difference is that our approach cancels out the partition function Z which poses the major computation

User manual

Contents

Download Pdf Manuals

Related Search

Related Contents