Advanced search


Peris Abril. Interactivity, Adaptation and Multimodality in Neural Sequence-to-sequence Learning. Universitat Politècnica de València. 2019. Supervised by Francisco Casacuberta

The sequence-to-sequence problem consists in transforming an input sequence into an output sequence. A variety of problems can be posed in these terms, includ- ing machine translation, speech recognition or multimedia captioning. In the last years, the application of deep neural networks has revolutionized these fields, achiev- ing impressive advances. However and despite the improvements, the output of the automatic systems is still far to be perfect. For achieving high-quality predictions, fully-automatic systems require to be supervised by a human agent, who corrects the errors. This is a common procedure in the translation industry. This thesis is mainly framed into the machine translation problem, tackled using fully neural systems. Our main objective is to develop more efficient neural machine translation systems, that allow for a more productive usage and deployment of the technology. To this end, we base our contributions on two main cornerstones: how to better use of the system and how to better leverage the data generated along its usage. In the first case, we apply the so-called interactive-predictive framework to neural machine translation. This embeds the human agent and the system into a cooperative correction process, that seeks to reduce the human effort spent for obtaining high- quality translations. We develop different interactive protocols for the neural machine translation technology, namely, a prefix-based and a segment-based protocols. They are implemented by modifying the search space of the model. Moreover, we introduce mechanisms for achieving a fine-grained interaction while maintaining the decoding speed of the system. We carried out a wide experimentation that shows the potential of our contributions. The previous state of the art is overcame by a large margin and the current systems are able to react better to the human interactions. Next, we study how to improve a neural system using the data generated as a byproduct of this correction process. To this end, we rely on two main learning paradigms: online and active learning. Under the first one, the system is updated on the fly, as soon as a sentence is corrected. Hence, the system is continuously learning from the corrections, avoiding previous errors and specializing towards a given user or domain. A large experimentation stressed the adaptive systems under different conditions and domains, demonstrating the capabilities of adaptive systems. More- over, we also carried out a human evaluation of the system, involving professional users. They were very pleased with the adaptive system, and worked more efficiently using it. The second paradigm, active learning, is devised for the translation of huge amounts of data, that are infeasible to being completely supervised. In this sce- nario, the system selects samples that are worth to be supervised, and leaves the rest automatically translated. Applying this framework, we obtained reductions of approximately a quarter of the effort required for reaching a desired translation qual- ity. The neural approach also obtained large improvements compared with previous translation technologies. Finally, we address another challenging problem: visual captioning. It consists in generating a description in natural language from a visual object, namely an image or a video. We follow the sequence-to-sequence framework, under a a multimodal perspective. We start by tackling the task of generating captions of videos from a general domain. Next, we move on to a more specific case: describing events from egocentric images, acquired along the day. Since these events are consecutive, we aim to extract inter-eventual relationships, for generating more informed captions. To this end, we propose a context-augmented system, able to consider the previous events while analyzing the current one. The results show that the context-aware model improved the generation quality with respect to a regular one. As final point, we apply the interactive-predictive protocol to these multimodal captioning systems. As in the machine translation case, this protocol diminished the effort required for correcting the outputs of an automatic system.