Advanced search


Daniel Ortiz-Martínez. Online Learning for Statistical Machine Translation. Computational Linguistics, 2016. Vol. 42 (1), pp. 121-161.

We present online learning techniques for statistical machine translation (SMT). The availability of large training data sets that grow constantly over time is becoming more and more frequent in the field of SMT—for example, in the context of translation agencies or the daily translation of government proceedings. When new knowledge is to be incorporated in the SMT models, the use of batch learning techniques require very time-consuming estimation processes over the whole training set that may take days or weeks to be executed. By means of the application of online learning, new training samples can be processed individually in real time. For this purpose, we define a state-of-the-art SMT model composed of a set of submodels, as well as a set of incremental update rules for each of these submodels. To test our techniques, we have studied two well-known SMT applications that can be used in translation agencies: post-editing and interactive machine translation. In both scenarios, the SMT system collaborates with the user to generate high-quality translations. These user-validated translations can be used to extend the SMT models by means of online learning. Empirical results in the two scenarios under consideration show the great impact of frequent updates in the system performance. The time cost of such updates was also measured, comparing the efficiency of a batch learning SMT system with that of an online learning system, showing that online learning is able to work in real time whereas the time cost of batch retraining soon becomes infeasible. Empirical results also showed that the performance of online learning is comparable to that of batch learning. Moreover, the proposed techniques were able to learn from previously estimated models or from scratch. We also propose two new measures to predict the effectiveness of online learning in SMT tasks. The translation system with online learning capabilities presented here is implemented in the open-source Thot toolkit for SMT.