Apprentissage multi-modal pour la compréhension de videos


Spécialité : Image, Vision et Robotique

5/10/2022 - 14:00 Mr Valentin Gabeur (Grenoble-Alpes University) Maison Jean Kuntzmann (Grenoble University Campus) Amphithéâtre (1er étage)

Mots clé :
  • multi-modal
  • audio-visual
  • video retrieval
  • speech recognition
  • computer vision
  • deep learning
  • artificial intelligence
With the ever-increasing consumption of audio-visual media on the internet, video understanding has become an important problem in order to provide users with the right content. Compared to more traditional media, the video signal is inherently multi-modal, its information is scattered across several modalities such as speech, audio or vision. This thesis aims at designing and training deep learning models that exploit those different modalities to automatically understand video content.

While the community has developed high performance deep-learning models for uni-modal problems, processing multi-modal inputs such as video has received less attention. In the first part of this thesis, we introduce a transformer-based architecture that leverages pre-extracted uni-modal features obtained at different moments in the video and fuses them into a single video representation. Taking advantage of the attention mechanism, we show that our model processes the video signal across both modalities and time. We couple our video encoder with a caption encoder into a cross-modal architecture and obtain state-of-the art results for the text-to-video retrieval task on three datasets.

Training such a model unfortunately requires a large amount of manually captioned videos. In order to leverage the billions of unlabelled videos on the internet, the common approach has been to use the accompanying speech as supervision to pretrain a video encoder on the other modalities. While the resulting encoder is capable of processing visual information and sound, it has not been trained to attend the spoken language in videos. In the second part of this thesis, we propose a method to pretrain a multi-modal encoder on all video modalities, including speech. At each training batch, we completely mask out a different modality from the video encoder input and use it as supervision. We finetune our model on the task of text-to-video retrieval and show that our approach is particularly suitable for datasets where user queries relate to the spoken content in the video.

Extracting spoken language from the audio signal can be challenging, particularly when the audio is noisy. In the case of audio-visual media, the visual modality can provide precious cues to better extract speech from videos. In the third part of this thesis we introduce an encoder-decoder architecture as well as a pretraining strategy to not only train a speech recognition model on the audio signal but also on the visual signal. We evaluate the contribution of our approach on a challenging test set which demonstrates the contribution of the visual modality under challenging audio conditions.


Dr Eric Gaussier (Grenoble-Alpes University)