Scepter914 Website

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception (arxiv2022/11)

Summary

https://github.com/Megvii-BaseDetection/BEVDepth
BEVDepthの後継、軽量version
軽量なBEV-base Camera 3d detection
- CPUでも動作するレベルで軽量
- CPUでも数10msとかで動作、CUDAなら数ms
contribution: Lift-Splat paradigm を活かしながら計算量を削減
- Prime Extraction, Ring & Ray Decomposition による計算量の削減
- Feature Transportation Matrix (FTM) によるメモリ・計算量削減

Method

全体
- Backbone: extract image features
- DepthNet: a depth predictor is adopted to predict categorical depth distribution for each feature pixel to obtain the depth prediction.
- Prime Extraction module: obtain the Prime Feature and the Prime Depth, which is the compressed feature and depth
- BEV Feature: Use by Prime Feature, Prime Depth, and the pre-defined Ring & Ray Matrices

既存手法との違い

splat operation

横の方がcontextを含んでいる
- Prime Feature Extractor (PFE) についての説明

“Ring and Ray” Decomposition
- a polar coordinate の考えを導入
- the image feature required for a specific BEV grid can be located by direction and distance
- 計算量の削減
  - from: WI × Nd × HB × WB
  - to: (WI + Nd) × HB × WB
  - 30 to 50 times 少ない

Experiment

Nuscenes val set
- test setじゃないのが気になる
- 汎化性能が犠牲になっている可能性はある
- BEVDepthと同じ性能を出せている
  - Low resolution: Res-50 + BEV feature size 128 × 128, C= 80
  - High resolution: V2-99 + BEV feature size 256 × 256, Cは不明
- MF = Multi frame fusion

BEV segmentationでも高性能
- BEVFormerと同等の性能

軽量化
- 恐らく上記の結果はC=80でS3
- CPUでも数10msとかで動作する
- CUDAなら数ms

param

Discussion