Home
D9.1: 1st Report on system validation & evaluation
Contents
1. 9 TABLE 4 VALIDATORS RESPONSES FOR THE TRANSLATION einen 18 TABLE 5 VALIDATORS COMMENTS FOR THE TRANSLATION FUNCTIONALITY esee seen enn 23 TABLE 6 VALIDATORS RESPONSES FOR THE POST PROCESSING FUNCTIONALITY see ee ee ene 27 TABLE 7 VALIDATORS COMMENTS FOR THE POST PROCESSING FUNCTIONALITY eee 32 Figures FIGUREASPRESEMT GUT 5 FIGURE 2 SCATTER PLOT OF BLEU RESULTS FOR THE 1 10 FIGURE 3 SCATTER PLOT OF NIST RESULTS FOR THE EN DE LANGUAGE PAIR 11 FIGURE 4 BOX PLOT OF BLEU RESULTS FOR THE EN DE LANGUAGE PAIR 2 2 12 FIGURE 5 BOX PLOT OF NIST RESULTS FOR THE EN DE 1 13 Page 2 of 34
2. 8 3 3 EVALUATION RESULTS t 9 3 3 1 Analysis of the evaluation results ee ret teeth yer ehe eg io Reo cess 10 3 3 2 Comments and Future WOTK 2 tette people ua a 13 4 REFERENCES sssssscscocssssccsccccceccesscscccccescessscsscccecesesssssccsccssensssnscccsesecsssssscccesseesssssceccsseessscsscesesssessscssceccsssesesses 14 5 APPENDIX I VALIDATION 85 2 asso sa stes setas stesse tasa stes setas asses setas asta 15 6 APPENDIX II VALIDATION SCHEDULE naso tasas tease toas stesse toss sees s toas ena 17 7 APPENDIX II VALIDATION RESULTS TRANSLATION PROCESS 18 8 APPENDIX IV VALIDATION RESULTS POST PROCESSING sccsssssssscssccssscesscsssccessccsscssssesesscssecsscessesssscess 27 Tables TABLE 12 LANGUAGE PAIRS EVALUATED C 7 TABLE2 EVALUATION DATA DETAILS ae 8 EVALUATION RESULTS E
3. 4 i 5 424 e M 2 dir x 7 2 0 05 1 250004637 Mp at epis oes 0 T T 5 10 15 20 25 30 35 40 45 50 Sentence length in tokens Page 10 of 34 PRESEMT D9 1 1 Report on system validation amp evaluation Figure 3 Scatter plot of NIST results for the EN DE language pair NIST score 10 10 15 20 25 30 Sentence length tokens 35 40 45 50 Page 11 of 34 PRESEMT 0911 1 Report on system validation amp evaluation Furthermore a boxplot diagram is used to indicate for each of the aforementioned bins the characteris tics of BLEU scores as shown in Figure 4 It can be seen that the average BLEU score does not vary too much for bins 2 to 7 indicating that the BLEU score is affected for larger sentences at least up to a size of 35 tokens bin 7 The variance is largest for bin 3 whiel a few outliers appear Figure 4 Box plot of BLEU results for the EN DE language pair 143 154 8000 0178 O29 2 4000 12 43 T m 63 79 112 O 2000 0000 1 2 3 4 5 6 7 8 9 10 5 Page 12 of 34 PRESEMT 0911 1 Report on system validation amp evaluation Finally in Figure 5 the same diagram is created for the NIST metric In this case the best translation ac curacy seems to be obtained for bin 3 though again similar results are obtained for sizes up to 35 to kens It is only for bin 8 and thereafter i e for sent
4. SEVENTH FRAMEWORK PROGRAMME EH 9 1 1 REPORT ON SYSTEM VALIDATION amp EVALUATION Grant Agreement number Project acronym Project title Funding Scheme Deliverable title Version Responsible partner Dissemination level Due delivery date Actual delivery date Project coordinator name amp title Project coordinator organisation Fax E mail Project website address ICT 248307 PRESEMT Pattern REcognition based Statistically Enhanced MT Small or medium scale focused research project STREP CP FP INFSO D9 1 1 Report on system validation amp evaluation 6 ILSP Public 30 11 2011 60 days 20 1 2012 Dr George Tambouratzis Institute for Language and Speech Processing RC Athena 30 210 6875411 30 210 6854270 iorg_t ilsp gr WWW presemt eu PRESEMT D9 1 1 Report on system validation amp evaluation Contents 1 EXECUTIVE SUMMARY 3 2 VALIDATION tenui d q ai iS 4 2 1 DESCRIPTION OF THE VALIDATION PROCESS 2 2 benc 3 EVALUATION Wien cp 7 3 1 COMPILING THE EVALUATION DATA 7 3 2 AUTOMATIC EVALUATION METRICS
5. Czech English German Greek Czech English German Greek Norwegian Norwegian 1 1 5 5 1 000 1 000 7 40 tokens 7 40 tokens 180 200 14 3 2 Automatic evaluation metrics used For the current evaluation phase four 4 automatic evaluation metrics were employed i e BLEU NIST Meteor and TER BLEU Bilingual Evaluation Understudy metric was developed by Papineni et al 2002 and cur rently is one of the most widely used metrics in the MT field although primarily designed for assessing the translation quality of statistical MT systems Its basic function is to calculate the number of common n grams between a translation produced by the system candidate translation and the whole of the reference translations provided The BLEU score may range between o 1 with 1 denoting a perfect match i e a perfect translation NIST NIST 2002 developed by the National Institute for Standards and Technology encompasses a similar philosophy to that of BLEU in that it also counts the matching n grams between candidate and reference translations However it additionally introduces information weights for less frequently oc curring hence more informative n grams The score range is o where a higher score signifies a better translation quality The same process is planned to be followed for compiling the test set The number of reference translations will be increased in the future 5 ftp jaguar nc
6. Every form is in format and should be completed electronically A different copy of the form should be completed for each new experiment The following form naming convention should be used Translation ValidForm_ExperXX doc where XX stands for the number of a given experiment PostProcessing ValidForm_ExperXX doc where XX stands for the number of a given experiment After the validation process After the validation process is over the completed forms should be uploaded on the PRESEMT website in the Archive under the folder Validation where each partner will create their own folder The validation process is summarised in the following table Partner staff non member of the devel At least 2 Translation process All 5 12 2011 opment team per partner Partner staff non member of the devel At least 2 Post processing All 5 12 2011 opment team per partner 8 The validation forms can be found in the Archive under the folder Validation Page 17 of 34 18 7 Appendix II Validation results Translation process In this section the validation results for the translation functionality are presented Table 4 contains the responses of the validators and is followed by their comments as these were recorded in the corresponding forms The comments are presented per partner The numbers enclosed in brackets denote the form from which the comments originate Table 4 Validators respons
7. Belastung System System didn t translate the questions correctly but delivered word by word translation System often chooses the wrong translation in the current context but provides the correct translation in the list of lexical alternatives 4 System translated names despite upper case writing BBC Travel Lonely Planet System doesn t recognise imperative sentences source Verify critical information before travel transla tion System often chooses the wrong translation for the current context but provides the correct translation in the list of lexical alternatives e g source nuclear reactors translation nukleare Apparaten gt lex alterna tives nukleare Reaktor word by word translations e g source This containment absorbs radiation and prevents radioactive mate 5 rial from being released into the environment translation diese Begrenzung absorbiert Strahlung und verhindert radioaktives Material aus lautesten gel st in die Umgebung no adaption for the genetive case source reactor core s heat translation Reaktorkern s Lauf 6 word by word translation However system recognised the superlative form correctly System often chooses the wrong translation for the current context but provides the correct translation in the list of lexical alternatives e g source translation tool translation Umsetzung gt lex alternative bersetzung Tool 7 word by word translation so
8. I Validation forms Help notes 1 You should complete the form above and save a different copy for each new experiment Please use the following naming Translation ValidForm ExperXX doc where XX stands for the number of a given experiment 2 Please fill in the date the serial number of the experiment and the site you work at in the respective fields 3 Next proceed with your personal details 4 In the Input section you should state whether you input a sentence or text for translation and specify the number of words in case of sentences and the number of sentences in case of texts Next use the drop down lists for selecting the source and target languages of the experiment 6 The fields Source text and Translation should be filled with the text that you input to the system and the system transla tion respectively 7 Please describe any possible problems that the system may have encountered with the size of the input text 8 If the overall process was unsuccessful please state so and describe the problem 9 Finally add any comments Page 15 of 34 PRESEMT 0911 1 Report on system validation amp evaluation Help notes 1 You should complete the form above and save different copy for each new experiment Please use the following naming PostProcessing ValidForm ExperXX doc where XX stands for the number of a given experiment Please fill in the date the serial number
9. be disabled until the translation process is terminated All the aforementioned comments have been forwarded to the development team for revising the technical and design characteristcis of the prototype as appropriate Page 6 of 34 PRESEMT 0911 1 Report on system validation amp evaluation 3 Evaluation activities Evaluation within PRESEMT involved assessing the quality of the system translation output Within the reporting period the results evaluated were obtained by the 1 PRESEMT prototype which handles the following eight 8 language pairs Table 1 Language pairs evaluated Source Language Target Language English German German English Greek German Greek English Czech German Czech English Norwegian German Norwegian English At the current development phase the evaluation of the translation output was performed consortium internally and relied solely on automatic evaluation metrics using data compiled from material drawn from the web 31 Compiling the evaluation data Before compiling the evaluation data it has been decided to collect two sets of data a development data and b test data The development data would be evaluated with automatic metrics and used consortium internally to study the system s performance In other words this data would be utilised for discovering possible problems in the translation engine In a similar vein this set is planned to be used as input to the Opti misation module
10. for optimising the system parameters The second category of data involves a sentence set which is planned to be used both consortium internally and consortium externally and will be evaluated on the basis of automatic metrics as well as assessed by humans The process of creating both data categories up to this point only the development data have been compiled was subject to appropriately defined specifications cf Table 2 All data originate from the web More specifically the web was crawled over for retrieving 1 000 sentences of specific length for each project source language Thus five 5 corpora were collected one per source language 2 At this point it should be noted that it is intended to use primarily benchmark data for consortium external evaluation e g data sets compiled for MT competition purposes However the lack of such data for some project languages namely Greek and Norwegian necessitates the creation of these data sets Page 7 of 34 PRESEMT 0911 1 Report on system validation amp evaluation Subsequently 200 sentences were randomly chosen out of each corpus these sentences constituting the development set and manually translated into the project target languages namely English and German The correctness of these translations which would serve as reference ones was next checked by native speakers Table 2 summarises the particulars of the evaluation data Table 2 Evaluation data details
11. substantial Page 13 of 34 PRESEMT D9 1 1 Report on system validation amp evaluation The linguistic resources may provide only limited coverage For instance the lexica used for most language pairs are not particularly large In addition by design the small bilingual corpus from which the TL structure is extracted is limited in size On the contrary the monolingual corpus is sufficiently large as it stands Therefore it is intended to investigate the effect of each linguistic resource in more detail to provide coverage information This shall be reported in the next evalua tion report Also it may be that the reference translations are not sufficient only one reference translation is provided per sentence currently Therefore it has been decided to perform a more detailed evaluation of the aforementioned re sults This will include a study to indicate the main sources of errors For the relevant translation stages that cause the largest problems a specific study will be performed The time to provide the present deliverable has been limited due to the constraints of the review date so relevant work will continue along the lines described above NOTE In the next version of this deliverable for objective measures such as BLEU NIST METEOR and TER it is planned to also test other systems to provide reference values Candidates to serve as refer ence systems include commercial systems as well as freely available ones
12. though with Google chrome the translation results are presented to the user in a top to bottom way With Mozilla Firefox the results are presented as they should 2 Wrong translation 3 Wrong translation 4 Wrong translation 5 Bad quality in translation 6 Bad translation 7 Bad quality in translation 8 Not every word be selected New sentences do not start with capital letter 10 Process successful but wrong translation Bad translation Page 23 of 34 23 24 GFAI Validator 1 The source language text consists of many paragraphs separated by one or more blank lines Server crashes an error message is displayed Client has to be restarted Validator 2 System often chooses the wrong translation for the current context but provides the correct translation in 1 the list of lexical alternatives e g source civilizations translation Kulturen gt lexical alternative Zivilisa tion source scale translation Dimensionen gt lex alternatives Ausma System often chooses the wrong translation for the current context but provides the correct translation in the list of lexical alternatives e g source low translation geringes gt lexical alternative Tiefdruckgebiet 2 source 24 hours translation 24 Zeiten gt lexical alternative Stunden System delivers a word by word translation most of the time source e g A deep low pressure system translation ein tiefes geringes
13. PRESEMT D9 1 1 Report on system validation amp evaluation 1 Executive summary The current deliverable falling within Tasks T9 1 and T9 2 of WP9 Validation amp Evaluation provides an outline of the validation and evaluation activities that were carried out within the PRESEMT project af ter the release of the 1 system prototype 1 Validation amp Evaluation cycle These activities concern the assessment of the system in terms of performance amp conformance to the system design principles validation and is a consortium internal process and of translation quality evaluation The validation process on which this deliverable reports concerned the testing of two system func tionalities a translation process and b post processing It was performed consortium internally at each partner s site by personnel members not belonging in the PRESEMT development team and it fol lowed a concrete plan and time schedule Validators experimented with both system functionalities and documented their experience on purpose built validation forms The evaluation of the translation output using data compiled for development purposes involved eight 8 language pairs those covered by the 1 system prototype and is was also performed consortium internally based on automatic evaluation metrics Source Language Target Language English German German English Greek German Greek English Czech German Czech English Norwegian German Norwe
14. ated and in a reasonable amount of time 2 Functionality 2 Optimisation of the translation system In this case the system optimisation process will be examined by utilising a set of reference trans lations provided by the user in order to automatically modify the translation system parameters 3 Functionality 3 Post processing of translations using the PRESEMT GUI In this case the aim is to ensure that the PRESEMT GUI allows the user to modify the system generated translation in an effective manner according to their preferences 4 Functionality 4 Adaptation of the translation system The aim here is to test whether the system is able to be adapted towards the user specified cor rections Within the aforementioned timeframe only functionalities 1 amp 3 underwent a validation process since the Optimisation module functionality 2 had not yet been finalised when the validation was initiated while the User adaptation module functionality 4 was still under development When testing functionality 1 the aim was naturally to check whether the system produced a translation but additional aspects were also of interest such as the system behaviour when handling long texts op eration time display features relation of text size to the system performance time etc For functionality 3 we wanted to test whether user oriented post processing provisions implemented were functional such as the lexical substitution and the free post e
15. ation though Sentences do not start with capital letters System faces problems in recognising compound words Process successful bad translation though 5 Sentences do not start with capital letters System faces problems in recognising compound words The system often use that instead of the Process successful bad translation though 6 Not able to recognise compound words or not able to translate correct the compound words from source language Process successful bad translation though 71 Tense is not really translated as it should Not every translated word is highlighted when hovering the mouse over Process successful bad translation though 8 Compound words of source language are not correctly translated Not every word is highlighted when hov ering the mouse over Process successful bad translation though 9 Compound words of source language are not correctly translated Not every word is highlighted when hov ering the mouse over Process successful bad translation though 10 Compound words of source language are not correctly translated Not every word is highlighted when hov ering the mouse over Page 32 of 34 NTNU w w Validator 1 Goggle Translate provides better user experience UX for word substitution Post editing user UX can be improved by preserving the formatting of text in several lines rather than 6 presnting itin a textbox in one line It is possible to press F
16. diting Display features were also of interest as well as the validators opinion on the post processing process as a whole Page 4 of 34 PRESEMT 0911 1 Report on system validation amp evaluation 21 Description of the validation process ILSP was responsible for coordinating the validation process which took place at each partner s site A relevant schedule was drawn up see Appendix II Validation schedule according to which validators by definition not belonging to the development teams of the project assessed the performance of two functionalities i e the translation process and the post processing which are available via the PRE SEMT GUI http 147 102 3 151 8080 presemt_interface It should be noted that when those functionalities were tested only two language pairs had been inte grated into the main system platform namely German gt English and English gt German Hence all vali dators used these language pairs Figure 1 PRESEMT GUI q From English 2 To German Choose Language Translate Free Post Editing Reset The validators were requested to document their experimentation with the system and report on any problems by filling in the appropriate validation forms see Appendix I Validation forms which have been compiled for this purpose The validators profile included almost exclusively computer analysts and linguists as expected since the process was a consortium i
17. e view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a 7 paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a 8 paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a 9 paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a 10 paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer s
18. ences with more than 35 tokens that the NIST score is reduced Of course these observations are to be verified by extending to other language pairs Figure 5 Box plot of NIST results for the EN DE language pair NIST 3 3 2 Comments and Future Work Even however if the scores obtained are not particularly high there are a number of factors that need to be taken into account as listed below One of them is the trade off between translation accuracy and ease of development of new lan guage pairs For instance a higher accuracy could result in more demanding specifications regard ing the linguistic resources to be provided as well as linguistic knowledge At least the proposed methodology is easily applicable to other language pairs while it should be noted that PRESEMT aims to provide a translation quality suitable for gisting purposes The second one concerns the chain of modules responsible for the translation Currently for a new language pair this involves the phrase alignment of the parallel corpus the PGM derived parser for the input sentence the first translation phase and the second translation phase All of these of course probably introduce small errors in comparison to dedicated resources for a se lected language pair and it is likely that these errors multiply Thus the final accuracy may be re duced quite considerably On the other hand by improving the accuracy of even a single stage the actual improvement may be
19. entence Please use the textarea element for editing LCL Validator 1 o when free editing the longer sentences do not fit in the input box o when user manages to press free editing during long computation the form gets filled with previous results 2 the translation isn t really helpful 3 thesentence during free editing does not fit in the input box 4 thetext does not fit into input field when free editting Page 34 of 34
20. er sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing Double quote characters 4 were wrongly converted in to question marks thus messing up the sentence borders in the Free Post Edit mode cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing Page 33 of 34 34 cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a 6 paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to th
21. ers to words when the input is text the number refers to sentences SL TL The language pair selected for a given experiment LP selection It corresponds to the question Can you select the language pair Page 22 of 34 5 N 2 Unsuccessful Yes Translation It corresponds to the question Does the system produce a translation Display It corresponds to the question Does the system display the source text amp its translation next to each other Time It corresponds to the question Translation time approximately The time is measured in seconds Long Text It corresponds to the question Problems with longer texts Reset It corresponds to the question Does the Reset button clear the screen Process It indicates whether the whole experiment was successful or not Comments It indicates whether there were comments inserted by the valida tor Table 5 Validators comments for the translation functionality Validator 1 1 Client cannot be found is the message that appears 3 the output is in column format 5 the output is in column format 6 when paste the output it is in column format 7 when paste the output it is in column format 8 when I paste the output it turns into column format 9 when paste the output it turns into column format 10 when paste the output it turns into column format Validator 2 Not the right translation
22. es for the translation functionality Page 18 of 34 Unsuccessful Unsuccessful Unsuccessful Page 19 of 34 2 fo i x He o0 l 9 E Hr gt de E l Em m m T 9 2 2 Input Site Site number Experiment 0 Page 20 of 34 2 fo i x He o0 l 9 E Hr gt ie Hc E Em E E 2 2 Input Site Site number Experiment w w w w w w w w w w Page 21 of 34 2 E E H x He o0 l 9 E Hr gt ie Hc E Em E E 74 J 2 2 Input Site Site number Experiment zZ a D Notes Experiment The given experiment s serial number Site The partner responsible for the corresponding experiment Profile The validator s profile Input The type of text input for translation Number The number of words or sentences constituting the input When the input is a sentence the number ref
23. gian English The deliverable has the following structure Section 2 is dedicated to the validation process and pro vides a unified account of the validators comments and suggestions Section 3 describes the evaluation data used and reports on the results obtained Finally in a series of appendices more details on the vali dation process are given namely validation forms amp schedule and a comprehensive presentation of the results obtained Page 3 of 34 PRESEMT 0911 1 Report on system validation amp evaluation 2 Validation activities Validation within PRESEMT involves testing of PRESEMT modules and functionalities and is aimed at as certaining that they function in accordance to the general system design principles and those of the in dividual modules According to deliverable D2 2 Evaluation Set up which outlined the validation and evaluation plan of the project 3 validation sessions have been foreseen the first one of which had been estimated to take place around M20 following the release of the 1 PRESEMT prototype During the first validation ses sion the following four system functionalities were scheduled to be tested 1 Functionality 1 Translation process for an already created language pair The aim of this activity is to ensure that the PRESEMT prototype can perform the translation of given sentences or given pieces of text the main concern here being to ensure that a non trivial working translation is gener
24. ince this entails the minimum number of edits The calculated score with a range of o derives from the total number of edits namely insertion deletion and substitution of single words as well as shifts of word sequences Hence a zero score number of edits 0 denotes a perfect translation Another variant of this metric TER Plus TERp additionally provides more options paraphrasing stemming etc 3 3 Evaluation results The following table illustrates the scores obtained per metric and language pair Table 3 Evaluation results Metrics Language pair Sentence set pe eee translations English German 189 web 1 0 1052 3 8433 0 1789 83 233 German English 195 web 1 0 1305 4 5401 0 2324 74 804 Greek German 200 web 1 Greek English 200 web 1 Czech German 183 web 1 0 0168 2 1878 0 1007 99 383 Czech English 183 web 1 0 0424 2 5880 0 1739 99 798 Norwegian German 200 web 1 0 0604 3 2351 0 1484 84 728 Norwegian English 200 web 1 0 0942 3 6830 0 2110 78 078 According to the results summarised in Table 3 it can be seen that the best results are obtained for the German to English and English to German corpora both for NIST and BLEU For these two languages the BLEU scores are approximately 0 10 to 0 13 while NIST scores are in the range of 3 8 to 4 3 Simi larly the METEOR results are around the 0 20 mark while TER results are above 70 0 7 http www cs umd edu snover tercom Page 9 of 34 PRESEMT 0911 1 Report o
25. indicatively one can mention GoogleTranslate Systran and Moses 4 References Banerjee 5 amp Lavie 2005 METEOR An Automatic Metric for MT Evaluation with Improved Correla tion with Human Judgments Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and or Summarization at the 43 Annual Meeting of the Association of Compu tational Linguistics ACL 2005 Ann Arbor Michigan pp 65 72 Denkowski M amp Lavie A 2011 Meteor 1 3 Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems Proceedings of the EMNLP 2011 Workshop on Statistical Machine Translation Edinburgh Scotland pp 85 91 Levenshtein V I 1966 Binary codes capable of correcting deletions insertions and reversals Soviet Physics Doklady 10 707 10 NIST 2002 Automatic Evaluation of Machine Translation Quality Using n gram Co occurrences Statis tics Papineni K Roukos S Ward T amp Zhu W J 2002 BLEU A Method for Automatic Evaluation of Ma chine Translation Proceedings of the Annual Meeting of the Association for Computational Linguistics Philadelphia U S A pp 311 318 Snover M Dorr B Schwartz R Micciulla L amp Makhoul J 2006 A Study of Translation Edit Rate with Targeted Human Annotation Proceedings of Association for Machine Translation in the Americas Page 14 of 34 PRESEMT 0911 1 Report on system validation amp evaluation 5 Appendix
26. ined disabled 5 Very slow Validator 2 20 minutes felt a bit long given the size of the text I am not sure whether or not this can be characterised as problem though It took some time but got a result in the end 4 This is two times the text from Expero2 When changing languages did at one point got error message every time tried to change languages The message said something like unknown without any additional information am not able to recreate the situation at will so am guessing it have something to do with the page GUI and its communication with the webservice It got fixed after refreshed the page Validator 1 There s no indicator that the system actually does something once you hit Translate 1 in another preliminary experiment encountered pop up error message The client could not be find needed to reload the page several times to fix it cannot reproduce the problem now There s no indicator that the system actually does something once you hit Translate 2 was able to reproduce the problem with the error pop up When the browser Firefox with the PRESEmt interface open is left idle for an hour or so it will reject any input to translate with the The client could not be found error message Reloading the page solves the problem There s no indicator that the system actually does something once you hit Translate 3 was able to reproduce the problem
27. m actually does something once you hit Translate 9 There s no indicator that the system actually does something once you hit Translate 10 There s no indicator that the system actually does something once you hit Translate LCL Validator 1 o firstletter of the translated sentences were often in lower case o the Translate button was of different height compared to other buttons and the location of buttons were changing during filling the text o changing language causes clearing the source text o when user opens a new tab while translating a long text in the first one the second tab gets broken when the translation arrives into the first tab Validator 2 3 message appeared Client could not be found Page 26 of 34 27 8 Appendix IV Validation results Post processing In this section the validation results for the post processing functionality are presented Table 6 contains the responses of the validators and is followed by their comments as these were recorded in the corresponding forms The comments are presented per partner The numbers enclosed in brackets denote the form from which the comments originate Table 6 Validators responses for the post processing functionality s n Ex eriment Site number Site Profile SL TL Highlight Lexical alternatives Substitution Post editin EE Comments P g Page 27 of 34 28 Page 28 of 34 29 Page 29
28. n system validation amp evaluation Since the development of the PRESEMT translation system started with these two language pairs it may be expected that these results are better than those achieved for instance for language pairs in volving Norwegian and Czech Still it is very promising that by using the same modules it was possible to build the MT systems in a short period of time As indicated by the BLEU results for the language pairs involving Czech and Norwegian there is definitely scope for further improvement for these lan guage pairs The same applies of course to the pairs German to English and English to German 3 3 1 Analysis of the evaluation results In the present section the aim is to visualise the evaluation results In Figure 2 the BLEU results are in dicated a scatter plot as a function of the sentence size As can be seen there does not seem to be a dominant relation between the size in tokens and the BLEU score Even by grouping together different sizes to create fewer classes where the first bin is generated by grouping together sentences with be tween 1 and 5 tokens the second contains sentences from 6 to 10 tokens etc no trend is clearly shown Figure 2 Scatter plot of Bleu results for the EN DE language pair 0 51 0 45 1 0 4 4 z e 0 35 7 u 03 9 E 5 0 25 x 2 02 e ii 0 15 4 ep os
29. ng and size of the but tons while it was noted that sometimes the interface buttons were disabled thus preventing the user from launching the translation process Finally it was suggested that the source text should remain intact and not be cleared when changing language Browser A few validators observed that the text rendering was faulty when using Google Chrome or that the interface did not work at all with that browser So the validators turned to either Internet Explorer or Mozilla Firefox Text formatting amp character rendering Almost all validators pointed out the fact that when trying to copy and paste the system translation output each word appeared in a new line with multiple empty lines in between In a similar vein a few validators noticed that the first letter of sentences was not capitalised Similarly it was pointed out that some characters e g double quote characters or the hyphen were replaced by a question mark in the translated text POST PROCESSING Translation server amp GUI It was noted that the small size of the input box makes the free editing of long text inconvenient so it was suggested that the text area element should be used It was noted that it is possible to press the Free Post Editing button before the completion of the translation process thus resulting in a post editing GUI without text So the suggestion was expressed that the Free Post editing button should
30. nternal one involving personnel members of the partners sites 2 2 Validation results The comprehensive results of the validation as these were depicted in the corresponding forms are to be found in Appendix 11 Validation results Translation amp Appendix IV Validation results Post processing The comments of the validators highlighting problems they have encountered during the validation and including suggestions for improvement relate to the GUI layout and the function of the translation server to potential incompatibilities with specific browsers and the text formatting The comments are summarised as follows It should be noted that the same validation pattern is to be followed in the future for the residual system functionalities Page 5 of 34 PRESEMT 0911 1 Report on system validation amp evaluation TRANSLATION Translation server amp GUI Almost all validators noted that the server crashed after a few minutes of continuous use thus forcing them to restart the browser and reinitiate the whole process There is a general consensus about the system being slow the suggestion was expressed that there should be a progress indicator One validator noted that when opening a new tab while waiting for the translation of a long text to be completed in the first tab then the second one got broken when the translation completed Furthermore a few comments related to the interface layout and the positioni
31. of 34 30 Page 30 of 34 31 Notes Lexical alternatives It corresponds to the question Does the system provide A lexical alternatives Experiment The given experiment s serial number Exical diternatiyes 3 Substitution It corresponds to the question Can you substitute a word with Site The partner responsible for the corresponding experiment ASA P q y a lexical alternative Profile The vali fil Postediting It corresponds to the question Can you freely post edit the SLTL The language pair selected for a given experiment text Highlight It corresponds to the question Are the words highlighted when Process It indicates whether the whole experiment was successful or not moving the cursor over them is 5 Comments It indicates comments inserted by the validator Page 31 of 34 32 Table 7 Validators comments for the post processing functionality ILSP Validator 1 2 client could not be found 4 the output is in column format 5 the output is in column format 6 the output is in column format 7 the output is in column format 8 the output is in column format 9 the output is in column format 10 the output is in column format Validator 2 1 Not every word is highlighted when moving the cursor over them 2 Bad translation 3 Process successful bad translation though Process successful bad transl
32. of the experiment and the site you work at in the respective fields Next proceed with your personal details In the Input section use the drop down lists for selecting the source and target languages of the experiment The fields Source text and Translation should be filled with the text that you input to the system and the system transla tion respectively If the overall process was unsuccessful please state so and describe the problem Finally add any comments Page 16 of 34 PRESEMT 0911 1 Report on system validation amp evaluation 6 Appendix Validation schedule All partners will ask members of their personnel not belonging to the development teams to validate two system functionalities a the translation process and b the post processing The whole process should have been completed by early December The validators will access the PRESEMT web interface for performing the corresponding activity The in terface version tested will be the one implemented by the 10 of November 2011 Before the validation process Before the actual process the validators should preferably read the user manual Deliverable D7 3 1 or receive the corresponding guidance by the partner Besides every validation form includes accompanying help notes which guide the validators Validation process details The validators will be asked to document the whole process by filling in the corresponding validation form
33. ree Post Editing button before the translation is completed which results in a post editing GUI without text It is better to disable this button before the translation process is completed Validator 2 5 noticed that after translation all with space on each side were replaced by question marks in both texts MU Validator 1 cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a longer sentence Please use the textarea element for editing cannot go back from the Free Post Editing view to the view with lexical alternatives The translated text is technically a vertical text for viewing and editing it might be useful to convert it into a paragraph When free editing the translated text only a single line input field is available which is inconvenient even for a long
34. sl nist gov mt resources mteval v13a 20091001 tar gz http www nist gov speech tests mt Page 8 of 34 PRESEMT 0911 1 Report on system validation amp evaluation Meteor Metric for Evaluation of Translation with Explicit ORdering was developed at CMU Banerjee amp Lavie 2005 and Denkowski amp Lavie 2011 with the aim of explicitly addressing weaknesses in BLEU such as the lack of recall Banerjee amp Lavie 2005 3 hoping to achieve a higher correlation with human judgements It evaluates a machine translation hypothesis against a reference translation by calculating a similarity score based on an alignment between the two strings When multiple references are provided the hypothesis is scored against each and the reference producing the highest score is used It additionally offers various options such as stemming or paraphrasing for achieving a better matching Its score range is o 1 where 1 signifies a perfect translation TER Translation Error Rate developed at the University of Maryland resembles the philosophy of Levenshtein distance in that it calculates the minimum number of edits needed to change a hypothesis i e candidate translation so that it exactly matches one of the reference translations normalised by the average length of the references Snover et al 2006 3 In case of more than one references then only the reference translation closest to the hypothesis is taken into account s
35. urce We suggest that you print this tutorial manual as you follow the step by step instructions to complete the various exercises translation wir vorstellen dass euch Eindruck diese Anleitung Anleitung wie folgen ihrer die schrittweisen Anleitungen den vielf ltigen Aufgaben zu erledigen 8 system crashed after 5 min 1 try after 10 min 2 try 9 system crashed after 5 min 3 times System often chooses the wrong translation for the current context but provides the correct translation in 10 the list of lexical alternatives source growth translation Entwicklung gt lex alternatives Wachstum word by word translation but a good result in this case Page 24 of 34 25 NTNU Validator 1 The layout could be much nicer look at Google Translate UI No automatic language detection When changing languages a textfield is cleared which is inconvinient when a user have already typed in or pasted a text into the textfield 1 Behaviour of buttons is inconsistent sometimes translation is impossible because the Translate button is disabled Doesn t work in Google Chrome had to switch to Internet Explorer Very slow with no indicator of the progress When copying the translated text and pasting it each word appears in a newline with multiple empty lines in between Had to repeat this experiment 2 times The first time when pasted a text into the textfield Translate button 3 as well as all other buttons rema
36. with the error pop up When the browser Firefox with the PRESEmt interface open is left idle for an hour or so it will reject any input to translate with the The client could not be found error message Reloading the page solves the problem Page 25 of 34 26 There s no indicator that the system actually does something once you hit Translate 4 was able to reproduce the problem with the error pop up When the browser Firefox with the PRESEmt interface open is left idle for an hour or so it will reject any input to translate with the The client could not be found error message Reloading the page solves the problem There s no indicator that the system actually does something once you hit Translate 5 was able to reproduce the problem with the error pop up When the browser Firefox with the PRESEmt interface open is left idle for an hour or so it will reject any input to translate with the The client could not be found error message Reloading the page solves the problem 6 There s no indicator that the system actually does something once you hit Translate 7 There s no indicator that the system actually does something once you hit Translate The process got too long for the input sentence 8 No result produced after 2 minutes of waiting no feedback provided see comment below When gave up and reset the form the translation finally appeared There s no indicator that the syste
Download Pdf Manuals
Related Search
Related Contents
spiral arch design fireplace screen pare PDFファイル Welch Allyn Connex® Integrated Wall System Mode d`emploi Serie SONOTRAX Sony SRS-D25 User's Manual documentazione - Comune di Roma Samsung RW33EBSS User Manual Dell Vostro 270 Manual do proprietário Copyright © All rights reserved.
Failed to retrieve file