Publications

Advanced search

Abstract

Martha-Alicia Rocha, Joan-Andreu Sánchez. Translating the Penn Treebank with an Interactive-Predictive MT System. Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011), 2011.

The Penn Treebank corpus is a commonly used corpus in the Computational Linguistics community. This corpus is manually annotated with lexical and syntactical information. It has been extensively used for Language Modeling, Probabilistic Parsing, PoS Tagging, etc. In recent years, with the increasing use of Syntactic Machine Translation approaches, the Penn Treebank corpus has also been used for extracting monolingual linguistic information for further use in these Machine Translation systems. Therefore, the availability of this corpus adequately translated to other languages can be considered an challenging problem. The correct translation of the Penn Treebank corpus by using Machine Translation techniques and then amending the errors in a post-editing phase can require a large human effort. Since there is not parallel text for this dataset, the translation of this corpus can be considered as a translation problem in the absence of in-domain training data. Adaptation techniques have been previously considered in order to tackle this problem. In this work, we explore the translation of this corpus by using Interactive-Predictive Machine Translation techniques, that has proved to be very efficient in reducing the human effort that is needed to obtain the correct translation.