Grant PID2021-124719OB-I00 funded by MCIN/AEI/10.13039/501100011033 and by ERDF, EU A way of making Europe
PI: Carlos D. Martínez
Members: Francisco Casacuberta, Roberto Paredes, Moisés Pastor, Emilio Granell

Speech recognition is usually performed by processing the associated audio signal. However, speech is a physically complex process which employs articulations of the phonic system; many of these articulations are visible. In this sense, lip reading is an alternative decoding form for speech, which can have a noticeable impact on speech recognition and understanding. Besides, many people with severe hearing difficulties employ lip reading as a fundamental way of decoding and understanding speech from other people. Consequently, automatic speech recognition by a computer system from lip movement images supposes a highly interesting task, for both its social applications (to help understand speech by a person with hearing problems, to allow speech synthesis for people that lost their phonic capacity, …) and its practical and leisure applications (improving speech recognition performance, use of silent speech passwords, transcription of ancient video sequences without sound, …).

The approach proposed in this project is based on the machine learning paradigm, which has been really successful in speech recognition from an audio signal. In this paradigm, the relations between the input object (audio signal in regular speech recognition, video signal for lip reading) and the final object (emitted sequence of words) are given by a set of models whose parameters can be automatically estimated. This estimation requires the use of examples that relate inputs and outputs, i.e., audio or video signals with their corresponding transcriptions. The use of specific algorithms based on statistics allows the estimation of the models parameters and their posterior use for unknown sequences, giving a transcription of them.

In this line, this project proposes the generation and annotation of a dataset of videos in Spanish along with their transcriptions in order to create a dataset that would allow estimating these models. Videos can be obtained from different sources that present realistic scenarios and would be selected to obtain the most appropriate set of sequences. Apart from that, these videos would need to be properly processed, since raw image contains too much information for the task. Several options for relevant information extraction from raw images would be proposed and tested on the available dataset, searching for the most robust against adverse conditions (illumination, face physiognomy, environment, etc.). In parallel, different machine learning models would be studied, implemented, and estimated, apart from being evaluated by using the available dataset. As final steps, the possibility of integrating lip reading with audio speech recognition would be performed, and final systems with only lip reading capabilities and with audiovisual (lip reading audio recognition) capabilities would be implemented.