The development of a hybrid neural machine translation platform reaches its first milestone. Its goal is the design and development of advance machine translation software using hybridization techniques over a core based on neural machine translation, which offers an added value to professional users and the final clients. This project is being developed by the Machine Translation group of the PRHLT Research Center in collaboration with the company Pangeanic, and it is integrated under HYBRID NEURAL MACHINE TRANSLATION PLATFORM, which has the financial support from CDTI and the European Union through Programa Operativo de Crecimiento Inteligente (EXPEDIENT: IDI-20170964).
Neural machine translation
Neural machine translation has become the state of the art in the last few years, as it can be seen in the recent increase of scientific publications around it. These systems offer the great advantage of analyzing context at a sentence level–unlike statistical machine translation systems, whose context was limited to a 5 to 7 word window. Moreover, all the system’s components are trained simultaneous, which allows for an increase in translation quality. Big companies such as Google1 and Microsoft2 are interested in these systems and assert that they are achieving machine translation results similar to human translation.
Due to the novelty of neural systems–whose architecture is radically different to statistical systems– there is a need of researching again all the functionalities existent for statistical machine translation systems. This task is not trivial and requires a deep study and the comprehension of the training models, as well as the quantity of data and examples necessary for training.
Current state of the project
During the first part of the project, pre-processing and post-processing procedures–independents of the neural architecture–have been developed. These procedures were designed for statistical systems and can work correctly with neural systems. Additionally, an algorithm that combines existing alignment methods has been developed, with the aim of being able to add labels to the translation. This allows for a text to be automatically translated without losing information about its format. Finally, the tools to use in the project and the specific data to train each domain have been selected. Additionally, the project’s design has been created, using the principal standard model: bidirectional sequence-to-sequence recurrent neuronal networks.
After trying out different toolkits for training neural machine translation systems, OpenNMT was selected. This toolkit offers the advantage of being open access, it has multiple functionalities and a complete documentation. Furthermore, it has Harvard and Systran’s support, as well as a big user community.
Once the decision of using OpenNMT was made, we run a diverse experimentation to study the networks’ parameters and architecture needed according to he quantity of data that we have.
Currently, we have under revision a paper that compiles the study we conducted in order to assess the impact that tokenization has in the quality of the final translation. We have also planned the elaboration of several articles that compile the research done during this second part of the project. These articles will be submitted to relevant conferences and workshop that will take place during 2019. Finally, we shall prepare a demo of the platform developed during the project, and we shall submitted it to to most relevant conferences that will take place next year.
Project CDTI for the hybrid neural machine translation platform.
1Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
2Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Ta, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, Ming Zhou (2018). Achieving Human Parity on Automatic Chinese to English News Translation. arXiv preprint arXiv 1803.05567.