Driving scene understanding from automotive-grade sensors

Speciality : Image, Vision et Robotique

30/05/2023 - 10:00 Florent Bartoccioni (Inria/Valeo AI) Inria Grenoble

Autonomous driving technology has the potential to revolutionize transportation, making it safer, more efficient, and more accessible for everyone. However, achieving full autonomy requires a complex system that can perceive and understand the environment in real-time. In the context of mass-produced passenger cars, automotive-grade sensors, such as cameras and few-beam LiDARs, are crucial components of such a system. Despite their ability to provide rich and diverse information about the scene, these sensors also present significant challenges. For instance, few-beam LiDARs may suffer from noise and sparsity, while estimating the scene geometry from cameras only is difficult. To overcome these challenges, this thesis proposes two novel approaches to leverage automotive-grade sensors for driving scene understanding.

The first part of the thesis revisits the task of depth estimation from a monocular camera; a key feature of autonomous systems that are often equipped with multiple independent cameras. Existing methods either rely on costly LiDARs (32 or 64 beams), or only on a monocular camera signal, which present various ambiguities. To circumvent these limitations, we propose a new approach that combines a monocular camera with a lightweight LiDAR, such as a 4-beam scanner, typical of today's mass-produced automotive laser scanners. Our self-supervised approach overcomes scaling ambiguity and infinite depth problems associated with camera-only methods. It produces a rich 3D representation of the environment without requiring ground truth during learning. Moreover, as our approach leverages sensors typical of automated cars on the public market, it finds direct applications in Advanced Driver Assistance Systems (ADAS).

The second part of this thesis presents a transformer-based architecture for vehicle and driveable area segmentation in Bird's-Eye-View (BEV) from multiple cameras. A setup particularly challenging as both the geometry and semantic of the scene must be extracted from 2D visual signals alone. Although BEV maps have become a common intermediate representation in autonomous driving, real-time prediction of these maps requires complex operations, such as multi-camera data extraction and projection into a common top-view grid. These operations are usually performed with error-prone geometric methods (e.g., homography or back-projection from monocular depth estimation) or expensive direct dense mapping between image pixels and pixels in BEV (e.g., with MLP or attention). The proposed model addresses these issues by using a compact collection of latent vectors to deeply fuse information from multiple sensors. This results in an internal representation of the scene that is reprojected into the BEV space to segment vehicles and driveable areas. We also provide evidence that the model also enables accumulating knowledge about the scene over time directly in the latent space, paving the way for efficient reasoning and planning.

The proposed models are validated on real-world datasets and prototype cars, demonstrating the potential of utilizing automotive-grade sensors for driving scene understanding. By addressing the challenges associated with these sensors, our approaches provide a viable path towards their deployment in autonomous driving systems.

Directors:

Karteek Alahari (Inria )

Raporteurs:

Vincent Lepetit (ENPC )
Alexandre Alahari (EPFL )

Examinators:

Matthieu Cord (Sorbonne Univ. )
Aurelie Bugeau (Univ. Bordeaux )
Cordelia Schmid (Inria )
Jean-Sebastien Franco (Grenoble INP )
Patrick Perez (Valeo AI )