Publications

Advanced search

Abstract

Ismael García-Varea, Daniel Ortiz-Martínez, Francisco J. Nevado, Pedro A. Gómez, Francisco Casacuberta. Automatic segmentation of bilingual corpora: A comparison of different techniques. Proceedings of the Second Iberian Conference on Pattern Recognition and Image Analysis, 2005. pp. 614-621. .

Segmentation of bilingual text corpora is a very important issue to deal with in machine translation. In this paper we present a new method to perform bilingual segmentation of a parallel corpus, em SPBalign, which is based on phrase-based statistical translation models. The new technique proposed here is compared with other two existing techniques, which are also based on statistical translation methods: the em RECalign technique, which is based on the concept of recursive alignment, and the em GIATIalign technique, which is based on simple word alignments. Experimental results are obtained for the EuTrans-I English-Spanish task, in order to create new, shorter bilingual segments to be included in a translation memory database. The evaluation of these three methods has been performed comparing the bilingual segmentations obtained by these techniques with respect to a manually segmented bilingual test corpus. These results show us that the new method proposed here outperforms in all cases the two already proposed bilingual segmentation techniques.