Advanced search


Miguel Domingo, Francisco Casacuberta. Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents. Proceedings of the International Conference on Image Analysis and Processing. International Workshop on Pattern Recognition for Cultural Heritage (PatReCH 2019), 2019.

The nature of human language and the lack of a spelling con- vention make historical documents hard to handle for natural language processing. Spelling normalization tackles this problem by adapting their spelling to modern standards in order to get an orthography consistency. In this work, we compare several character-based machine translation approaches, and propose a method to profit from modern documents to enrich neural machine translation models. We tested our proposal with four different data sets, and observed that the enriched models success- fully improved the normalization quality of the neural models. Statistical models, however, yielded a better result.