Advanced search


Zuzanna Parcheta, Germán Sanchis-Trilles, Francisco Casacuberta. Filtering of Noisy Parallel Corpora Based on Hypothesis Generation. Proceedings of the Fourth Conference on Machine Translation (WMT), 2019. pp. 980-986. ACL.

The filtering task of noisy parallel corpora inWMT2019 aims to challenge participants tocreate filtering methods to be useful for train-ing machine translation systems. In this work,we introduce a noisy parallel corpora filter-ing system based on generating hypotheses bymeans of a translation model. We train trans-lation models in both language pairs: Nepali–English and Sinhala–English using providedparallel corpora. To create the best possibletranslation model, we first join all providedparallel corpora (Nepali, Sinhala and Hindi toEnglish) and after that, we applied bilingualcross-entropy selection for both language pairs(Nepali–English and Sinhala–English). Oncethe translation models are trained, we trans-late the noisy corpora and generate a hypoth-esis for each sentence pair. We compute thesmoothed BLEU score between the target sen-tence and generated hypothesis. In addition,we apply several rules to discard very noisyor inadequate sentences which can lower thetranslation score. These heuristics are basedon sentence length, source and target similar-ity and source language detection. We com-pare our results with the baseline published onthe shared task website, which uses the Zip-porah model, over which we achieve signifi-cant improvements in one of the conditions inthe shared task. The designed filtering systemis domain independent and all experiments areconducted using neural machine translation