Francisco J. Nevado, Francisco Casacuberta, Enrique Vidal. Parallel corpora segmentation by using anchor words. Proceedings of the of EACL 2003 workshop on EAMT, 2003.

A new technique for monotone segmentation of parallel corpora is introduced. This segmentation is based on a set of anchor words defined manually. The parallel segments are computed using a dynamic programming algorithm. To assess the introduced technique, finite-state transducers are inferred from both non-segmented and segmented corpora. Experiments have been carried out with a Spanish-English and an Italian-English translation tasks. This technique has proven useful to help improving the results with respect to those obtained with unsegmented corpora.