Transfer Talk - Venkatesh Gurram Munirathnam - 22nd June 2020

Title: 3D Object Detection for Autonomous Vehicles Using Multimodal Temporal Data

This research fuses multimodal and temporal information from multiple sensors into deep learning models to improve the performance of 3D object detection for autonomous and instrumented vehicles. The race to autonomy in instrumented vehicles or self-driving cars has stimulated significant research into developing systems to impersonate human-like perception and behaviour in driving environments. Object detection in autonomous vehicle (AV) technology plays a key role in such perception systems by identifying the locations of vehicles, pedestrians, roads, and obstacles in the video stream. The advent of deep learning in computer vision and availability of new sensing modalities such as LiDAR and RADAR has led to state-of-the-art architectures in estimating object location and depth information. However, vehicle navigation in a complex driving environment such as crowded city traffic, is still an open challenge due to the complexity involved particularly when using only single sensor modalities. This limitation can potentially be addressed by fusing different modalities to exploit their complementary properties, and this forms the basis for this research.

Apart from using multiple sensing modalities, another information source that could be utilised are temporal cues (spatial feature variation with time). Although temporal information is extensively used for improving moving object detection (MOD) accuracy in video streams, it has rarely been explored in 3D object detection. Further, most of the existing research uses a single modality for extracting temporal information. Thus, this research work hypothesises that multimodal temporal information can contribute to bridging the gap between 2D images to 3D  space.

The aim of this research is to use multiple sensing modalities to extract temporal cues and fuse this data for  deep network architectures for 3D object detection. In a pilot study, detection of object classes (Car, Pedestrian, Cyclist) incorporating temporal information into the 3D object detection network has demonstrated comparable overall detection accuracy to the baseline network while  changes in individual accuracy of the classes is observed with ~8% improvement in pedestrian detection and a slight decline(~1%) in performance on vehicle detection. However, performance on cyclists declined by ~6%. Furthermore, this study has paved the way to additional experiments to determine the best approach to utilise  temporal cues from multimodal data and fusion schemes in 3D object detection networks.