Jesús Tomás, Joan M. Val, Ferran Fabregat, Francisco Casacuberta, David Picó, Alberto Sanchis, Enrique Vidal. Automatic Development of Spanish-Catalan Corpora for Machine Translation. Proceedings of the Second International Workshop on Spanish Language Processing and Language Technologies, 2001. pp. 175-179.

To be able to successfully translate a text using example-based techniques, it is necessary to have a large computerized database of parallel sentences. In this paper, we describe an automatic procedure to construct a bilingual corpus from Internet. We also describe how we obtained two Spanish and Catalan corpora from two periodical publications (a legal bulletin and a newspaper). The corpus construction process consists of four main phases. First, the information is automatically obtained from Internet and no significant information is eliminated. Second, the corpus is fragmented into linguistic units (tokens, sentences, paragraph, etc.) by specific rules. Then, a procedure detects certain translation units, which have a specific behavior (numbers, abbreviations, proper nouns, etc.). Finally, the sentences from the two different languages are aligned. We introduce a new iterative algorithm for aligning parallel texts, which is based on Dynamic Programming. A manual test was done to verify the output of each phase. At the end of the paper, we discuss the results.