Summary

background

  • 3d 処理
    • 3dのまま: 計算コスト大
    • BEV: 情報量落ちすぎて小さいobjectのdetectionができない
  • Related works
    • BEV 3d detection: PIXOR
    • Voxel base 3d detection: VoxcelNet, SECOND
    • BEV LV fusion detecion: MV3D, AVOD
    • 3D semantic segmentation: PointNet, STD
    • “End-to-end multi-view fusion for 3d object detection in lidar point clouds"
      • multi view関連
      • 別branchが走っていてデバッグが複雑
      • “In contrast, our approach uses explicit features and representations for perspective and topdown view, which makes the system easy to train and debug; our network performs multi-class object detection and is an order of magnitude faster”
    • RangeBase semantic segmentation RangeNet++

Method

  • multi-classなのでclassごとの学習は要らない
  • 7 classes: cars, trucks, pedestrians, cyclists, road surface, sidewalks, and unknown
    • “We experimented with more or fewer classes but found the best results were obtained with this choice”

Segmentation

  • input: Lidar range image (3, 64, 2048)
    • (点の有無、depth, intensity)?
    • 点の有無についての言及がない?
      • 図を見ると欠損値自体はある
      • [[stdcv_00032_baidu_seg]] とか考えるとそうじゃない?
  • output: 7 class semantic image (7, 64, 2048)
    • drivable spaceも出力される

BEV detection

  • input:
    • BEV semantic image (7, 1024, 1024)
    • BEV lidar feature (min height, max height, mean intensity) (3, 1024, 1024)
  • network output
    • classhead 3class (3, 256, 256)
    • bboxhead3 (6, 256, 256)
      • (δx, δy) 重心, (wo, lo) object size , (sin θ, cos θ) yaw
  • final output
    • class, (δx, δy) 重心, (wo, lo) object size , (sin θ, cos θ) yaw
Projection

  • 以下考察
  • BEV semantic image のinput: 生dataのlidar featureを入力する
    • " Using class probabilities (rather than the most likely class) enables the network to perform

ore complex reasoning about the data (e.g., a person on a bicycle); we experimented with both and found this approach to yield better results"

    • probabilities = scoreみたいな話だろうか
      • 複数点入っていたらscoreの平均だろうか?
  • 処理予想

    • Lidar range image (10, 64, 2048) 点の有無、depth, intensity, 7class scores
    • for point
      • (u, v)座標系 -> x, y, z計算
      • BEV point_num += 1
      • BEV semantic image にscore足す + intensity足す
      • depthはmin?max? ave?
      • max height, min heightの更新
    • intensity / point_num, score / point_num
  • できあがっているもの

    • BEV semantic image (7, 1024, 1024)
    • BEV lidar feature (min height, max height, mean intensity) (3, 1024, 1024)
network

  • Loss function
    • focal loss for the classification head and L1 loss for the regression head
  • Network
    • それぞれのinputを feature map (32, 512, 512)にしてconcatしている
    • その後classとbboxで別れる
clustering
  • DBSCAN algorithm
    • “M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in International Conference on Knowledge Discovery and Data Mining (KDD), 1996.”
  • classのscoreでthreshold超えたものに対して行う

Experiment

  • 3d detectionで何しているかを他の人にぱっと見せるのに非常にわかりやすい図
    • こういう気配り的なものもいい論文だなと思う

  • 2 stageは一気通貫でもtrainingできるが、データ・セットが対応していないので別々にtrainしている
    • segmentation dataset: semantic kitti
    • object detection dataset: kitti
  • semantic KITTIとオレオレデータセットの両方で結果を示した
  • param
    • BEV size: 80m *80m
    • cell size(1024): 7.8cm *7.8cm -> output(256) 31.3 cm
  • TensorRTを使用

Discussion

  • 非常に理にかなったframework
    • range image + BEVなら2dで考えられるのでデバッグが非常にしやすい
    • range image basedやBEV baseを
  • 2-stageの中では拡張性が高いので使いやすいと思われる