Survey for Real-time 3D Detection in Autonomous Driving

Summary

In this blog, I summarize 3D detection methods, including implementation using inference optimization techniques like TensorRT.

Based on the performance comparison, following models of the multi-camera 3D detection stand out:

  • StreamPETR (ResNet50): This is a lightweight model, making it suitable for a wide range of applications.
  • StreamPETR (ResNet101): This model strikes a good balance between detection performance and inference time.
  • Far3D (V2-99): This model may be too computationally heavy for certain environments. So it may be more appropriate as offline model.

Based on the performance comparison, following models of the LiDAR-based 3D detection stand out:

  • CenterPoint (pillar): The backbone of Pillar encoder is stable in TensorRT environment, so it is suitable for a wide range of applications.

  • TransFusion-L (spconv): This model has a moderate balance between detection performance and inference time.

  • BEVFusion-CL (spconv+ResNet50): This model uses both camera images and LiDAR pointcloud. If your system can reliably obtain the sensor data, it may be suitable.

  • Summary of Table

    • The values in italics are estimated.
    • Mo.: Modality of input (C = Camera input, L = LiDAR input).
    • mAP-Nv: mAP on Nuscenes validation set (fp32, PyTorch).
    • mAP-Nt: mAP on Nuscenes test set (fp32, PyTorch).
    • mAP-Av: mAP on Argoverse validation set (fp32, PyTorch).
    • TensorRT: Inference time and GPU memory by TensorRT (RTX3090, fp16).
    • Note that the time delay also increases as the resolution of the camera input increases.
Model Mo. mAP-Nv mAP-Nt mAP-Av TensorRT
StreamPETR (ResNet50) C 43.2 14ms
StreamPETR (V2-99, 320*800) C 48.2 32ms
Far3D (V2-99) C 24.4 100ms
CenterPoint (pillar) L 48.4
CenterPoint (spconv) L 59.5 60.3 27.4 10ms
TransFusion-L (spconv) L 64.9 10ms
BEVFusion-CL (spconv+ResNet50) CL 67.9 15ms

Performance Analysis

Performance

  • Table for performance
    • Pre.: Pretrain model
    • Ten.: TensorRT(C++) implementation
    • Mo.: Modality of input (C = Camera input, L = LiDAR input).
    • mAP-Nv: mAP on Nuscenes validation set (fp32, PyTorch).
    • mAP-Nt: mAP on Nuscenes test set (fp32, PyTorch).
    • mAP-Av: mAP on Argoverse validation set (fp32, PyTorch).
    • PyTorch: Inference time for Nuscenes data by PyTorch (RTX3090, fp32).
    • TensorRT: Inference time and GPU memory by TensorRT (RTX3090, fp16).
Model Pre. Ten. Mo. mAP-Nv mAP-Nt mAP-Av PyTorch TensorRT TensorRT other
BEVFormer tiny (ResNet50) Yes Yes C 25.2 62ms 14ms, VRAM 1.7GB 26ms (RTX3090, fp32)
BEVFormer small (ResNet101) Yes Yes C 37.5 40.9 196ms 78ms, VRAM 3.7GB 151ms (RTX3090, fp32)
BEVFormer base (ResNet101) Yes Yes C 41.6 44.5 417ms 555ms, VRAM 11.8GB 666ms (RTX3090, fp32)
BEVFormer (V2-99) C 48.1
StreamPETR (ResNet50) Yes Yes C 43.2 36ms 33ms (Orin)
StreamPETR (ResNet101) C 50.4 156ms
StreamPETR (V2-99, 320*800) Yes C 48.2 80ms
StreamPETR (V2-99, 960*640) C 20.3
StreamPETR (V2-99, 640*1600) C 55.0
StreamPETR (ViT-L) C 62.0
Far3D (ResNet101) C 51.0
Far3D (ViT-L, 1536*1536) C 63.5
Far3D (V2-99, 960*640) Yes Yes C 24.4 366ms (Orin, fp16)
CenterPoint (pillar) Yes Yes L 48.4
CenterPoint (spconv) Yes Yes L 59.5 60.3 27.4 80ms 10ms, VRAM 1.4GB 43ms (Orin, fp16)
TransFusion-L (spconv) Yes Yes L 64.9 67.5 80ms 10ms, VRAM 1.5GB
TransFusion-CL (spconv) Yes CL 68.9 156ms
BEVFusion-CL (spconv+Swin-T) Yes Yes CL 68.5 70.2 119ms 20ms
BEVFusion-CL (spconv+ResNet50) Yes Yes CL 67.9 15ms, VRAM 3GB 55ms (Orin, fp16)

Comparison of Models

In this section, I estimate Inference Speed of TensorRT with RTX3090.

We can compare Pytorch(RTX3090) and TensorRT (RTX3090, fp16) from

  • BEVFormer small (ResNet101) TensorRT (RTX3090, fp16): 78ms
  • BEVFormer small (ResNet101) PyTorch (RTX3090): 196ms

We can compare TensorRT (Oirn, fp16) and TensorRT (RTX3090, fp16) from

  • BEVFusion-CL (spconv+ResNet50) TensorRT (RTX3090, fp16): 15ms
  • BEVFusion-CL (spconv+ResNet50) TensorRT (Orin, fp16): 55ms

From these comparison, we can estimate as follows.

  • StreamPETR (ResNet50) TensorRT (RTX3090, fp16): 36ms * (78ms / 196ms) = 14ms
    • StreamPETR (ResNet50) PyTorch (RTX3090): 36ms
  • StreamPETR (V2-99, 320*800) TensorRT (RTX3090, fp16): 80ms * (78ms / 196ms) = 32ms
    • StreamPETR (V2-99, 320*800) PyTorch (RTX3090): 80ms
  • StreamPETR (ResNet101) TensorRT (RTX3090, fp16): 156ms * (78ms / 196ms) = 62ms
    • StreamPETR (ResNet101) PyTorch (RTX3090): 156ms
  • Far3D (V2-99) TensorRT (RTX3090, fp16): 366ms * (15ms / 55ms) = 99.8ms
    • Far3D (V2-99) TensorRT (Orin, fp16): 366ms

Backbone Models

VoVnet

VoVnet is used for image encoder of image method. “V2-99” means image encoder of VoVNet-99.

Sparse Convolution

Sparse convolution is used for LiDAR pointcloud encoder. Sparse convolution improves the efficiency and scalability of segmentation tasks on sparse 3D data. Because Sparse Convolution is both efficiency and high performance, the backbone of Sparce Convolution is used for many models. Note that the library of Sparse Convolution often has the issue related to CUDA environment, so if the system like OS are unstable (update many times including CUDA version), I cannot recommend to use Sparse Convolution.

PointPillars

PointPillars is first standard model as LiDAR-based 3D object detection. PointPillars represents a breakthrough in real-time 3D object detection by efficiently encoding LiDAR point clouds into a structured grid and using fast 2D convolutions to process them. Since Pillar Feature Net is efficient backbone, so it has been used for many models. Note that TensorRT implementation of Pillar Feature Net is more stable than Sparse Convolution.

Camera-based Models

BEVFormer

The BEVFormer framework offers 3D detection based on Bird’s-Eye View (BEV) representations for autonomous driving systems using multi-camera inputs and spatiotemporal transformers.

StreamPETR

StreamPETR introduces an object-centric temporal modeling approach for multi-view 3D object detection, offering a more efficient and scalable solution. This is particularly useful for autonomous driving, where real-time performance and high accuracy are essential.

Far3D

Far3D provides a significant advancement in surround-view 3D object detection by expanding the detection horizon to include distant objects.

LiDAR-based Models

Car Truck CV Bus Tra Bar Mot Bic Ped Cone
CenterPoint (spconv) 84.5 47.9 7.9 58.9 28.1 62.4 44.1 16.0 76.3 51.9
TransFusion-L (spconv) 87.7 59.7 19.3 72.8 40.9 70.0 72.0 54.7 86.5 73.9
BEVFusion-CL (spconv) 88.3 53.6 21.8 73.3 38.1 72.4 76.2 61.2 87.4 78.0

CenterPoint

CenterPoint is a simple and efficient model for 3D detection, often considered the standard in autonomous driving. CenterPoint can choose from spconv (Sparse Convolution) or pillar (Pillar Feature Net) as backbone of LiDAR encoder.

TransFusion

TransFusion introduces Camera-LiDAR fusion for 3D object detection using a transformer-based architecture and a cross-modal attention mechanism. While Camera-LiDAR fusion is not efficient results, the LiDAR-only method using TransFusionHead is an efficient model. TransFusion can choose from spconv (Sparse Convolution) or pillar (Pillar Feature Net) as backbone of LiDAR encoder.

BEVFusion

BEVFusion introduces a robust and efficient framework for multi-sensor fusion in autonomous systems. By leveraging a unified Bird’s-Eye View (BEV) representation, the framework simplifies and improves the process of combining data from various sensors (LiDAR, cameras, radar) while supporting multiple tasks such as object detection, tracking, and segmentation in a single pipeline. BEVFusion use spconv (Sparse Convolution) as backbone of LiDAR encoder and ResNet or SwinTransformer as backbone of image encoder.