Motion in action: optical flow estimation and action localization in videos


Spécialité : Mathématiques et Informatique

23/09/2016 - 17:00 Mr Philippe Weinzaepfel (Université Grenoble Alpes) Grand Amphi de l'INRIA Rhône-Alpes, Montbonnot

With the recent overwhelming growth of digital video content, automatic video understanding has become an increasingly important issue. In this presentation, I will describe several contributions on two automatic video understanding tasks: optical flow estimation and human action localization. 
Optical flow estimation consists in computing the displacement of every pixel in a video and faces several challenges including large non-rigid displacements, occlusions and motion boundaries. We first introduce an optical flow approach based on a variational model that incorporates a new matching method. The proposed matching algorithm is built upon a hierarchical multilayer correlational architecture and effectively handles non-rigid deformations and repetitive textures. It improves the flow estimation in the presence of significant appearance changes and large displacements. We also introduce a novel scheme for estimating optical flow based on a sparse-to-dense interpolation of matches while respecting edges. This method leverages an edge-aware geodesic distance tailored to respect motion boundaries and to handle occlusions. Furthermore, we propose a learning-based approach for detecting motion boundaries. Motion boundary patterns are predicted at the patch level using structured random forests. We experimentally show that our approach outperforms the flow gradient baseline on both synthetic data and real-world videos, including an introduced dataset with consumer videos. 
Human action localization consists in recognizing the actions that occur in a video, such as 'drinking' or 'phoning', as well as their temporal and spatial extent. We first propose a novel approach based on Deep Convolutional Neural Network. The method extracts class-specific tubes leveraging recent advances in detection and tracking. Tube description is enhanced by spatio-temporal local features. Temporal detection is performed using a sliding window scheme inside each tube. Our approach outperforms the state of the art on challenging action localization benchmarks. Second, we introduce a weakly-supervised action localization method, i.e., which does not require bounding box annotation. Action proposals are computed by extracting tubes around the humans. This is performed using a human detector robust to unusual poses and occlusions, which is learned on a human pose benchmark. A high recall is reached with only several human tubes, allowing to effectively apply Multiple Instance Learning. Furthermore, we introduce a new dataset for human action localization. It overcomes the limitations of existing benchmarks, such as the diversity and the duration of the videos. Our weakly-supervised approach obtains results close to fully-supervised ones while significantly reducing the required amount of annotations.


  • Mme Cordelia Schmid (directeur de recherche - INRIA Grenoble )
  • Mr Zaid Harchaoui (Chargé de recherche - New York University )


  • Mr Martial Hebert (Professeur - Carneige Mellon University USA )
  • Mr Ivan Laptev (Directeur de recherche - Inria Paris )


  • Mr Malik Jitendra (Professeur - University of California at Berkeley )
  • Mr Jean Ponce (Professeur - ENS Paris )