Duration: 1 September 2024 to 31 August 2028
Supported by: Generalitat Valenciana (GVA) under reference CIPROM/2023/17
PI: Roberto Paredes, Alberto Albiol
Members: José Miguel Benedí, Paolo Rosso, Joan Andreu Sánchez, Carlos D. Martínez

This project focuses on innovating Vision Encoder-Decoder (VED) models in artificial intelligence, specifically targeting their size and efficiency challenges. VED models, which bridge visual perception and language processing, are being transformed by replacing their transformer-based decoders with MLP-Mixer layers and employing Connectionist Temporal Classification (CTC) loss for training. This approach aims to reduce the model complexity and computational demands, making them more efficient for tasks like image captioning and text recognition. Additionally, the integration of reinforcement learning as a training method, using classifiers as reward functions, represents a significant shift from traditional training methods. This
novel strategy is expected to enhance the model ability to generate high-quality, domain-specific text.


The project methodology is systematically structured into seven Work Packages (WPs), covering literature review, fundamental research on VED model innovation, advanced training techniques, and specific applications. These applications include line-based and full-page handwritten text recognition, medical report generation from X-ray images, and lip reading.
Each WP is designed to thoroughly explore the potential and limitations of the new VED model approach. The main advantage of this approach is its balance between computational efficiency and the capability to accurately process complex visual-textual data, which is crucial for practical and powerful AI solutions in real-world scenarios.