Table of contents

MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception (arxiv2022/11)

Summary

  • https://github.com/Megvii-BaseDetection/BEVDepth
  • BEVDepthの後継、軽量version
  • 軽量なBEV-base Camera 3d detection
    • CPUでも動作するレベルで軽量
    • CPUでも数10msとかで動作、CUDAなら数ms
  • contribution: Lift-Splat paradigm を活かしながら計算量を削減
    • Prime Extraction, Ring & Ray Decomposition による 計算量の削減
    • Feature Transportation Matrix (FTM) によるメモリ・計算量削減

Method

  • 全体
    • Backbone: extract image features
    • DepthNet: a depth predictor is adopted to predict categorical depth distribution for each feature pixel to obtain the depth prediction.
    • Prime Extraction module: obtain the Prime Feature and the Prime Depth, which is the compressed feature and depth
    • BEV Feature: Use by Prime Feature, Prime Depth, and the pre-defined Ring & Ray Matrices

  • 既存手法との違い

  • splat operation

  • 横の方がcontextを含んでいる
    • Prime Feature Extractor (PFE) についての説明

  • “Ring and Ray” Decomposition
    • a polar coordinate の考えを導入
    • the image feature required for a specific BEV grid can be located by direction and distance
    • 計算量の削減
      • from: WI × Nd × HB × WB
      • to: (WI + Nd) × HB × WB
      • 30 to 50 times 少ない

Experiment

  • Nuscenes val set
    • test setじゃないのが気になる
    • 汎化性能が犠牲になっている可能性はある
    • BEVDepthと同じ性能を出せている
      • Low resolution: Res-50 + BEV feature size 128 × 128, C= 80
      • High resolution: V2-99 + BEV feature size 256 × 256, Cは不明
    • MF = Multi frame fusion

  • BEV segmentationでも高性能
    • BEVFormerと同等の性能

  • 軽量化
    • 恐らく上記の結果はC=80でS3
    • CPUでも数10msとかで動作する
    • CUDAなら数ms

  • param

Discussion