MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous Driving Using Multiple Views (IROS2020)

multi-classなのでclassごとの学習は要らない
7 classes: cars, trucks, pedestrians, cyclists, road surface, sidewalks, and unknown
- “We experimented with more or fewer classes but found the best results were obtained with this choice”

input: Lidar range image (3, 64, 2048)
- (点の有無、depth, intensity)？
- 点の有無についての言及がない？
  - 図を見ると欠損値自体はある
  - [[stdcv_00032_baidu_seg]] とか考えるとそうじゃない？
output: 7 class semantic image (7, 64, 2048)
- drivable spaceも出力される

input:
- BEV semantic image (7, 1024, 1024)
- BEV lidar feature (min height, max height, mean intensity) (3, 1024, 1024)
network output
- classhead 3class (3, 256, 256)
- bboxhead3 (6, 256, 256)
  - (δx, δy) 重心, (wo, lo) object size , (sin θ, cos θ) yaw
final output
- class, (δx, δy) 重心, (wo, lo) object size , (sin θ, cos θ) yaw

以下考察
BEV semantic image のinput: 生dataのlidar featureを入力する
- " Using class probabilities (rather than the most likely class) enables the network to perform

ore complex reasoning about the data (e.g., a person on a bicycle); we experimented with both and found this approach to yield better results"

- probabilities = scoreみたいな話だろうか
  - 複数点入っていたらscoreの平均だろうか？
処理予想
- Lidar range image (10, 64, 2048) 点の有無、depth, intensity, 7class scores
- for point
  - (u, v)座標系 -> x, y, z計算
  - BEV point_num += 1
  - BEV semantic image にscore足す + intensity足す
  - depthはmin？max? ave?
  - max height, min heightの更新
- intensity / point_num, score / point_num
できあがっているもの
- BEV semantic image (7, 1024, 1024)
- BEV lidar feature (min height, max height, mean intensity) (3, 1024, 1024)

Loss function
- focal loss for the classification head and L1 loss for the regression head
Network
- それぞれのinputを feature map (32, 512, 512)にしてconcatしている
- その後classとbboxで別れる

DBSCAN algorithm
- “M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in International Conference on Knowledge Discovery and Data Mining (KDD), 1996.”
classのscoreでthreshold超えたものに対して行う

2 stageは一気通貫でもtrainingできるが、データ・セットが対応していないので別々にtrainしている
- segmentation dataset: semantic kitti
- object detection dataset: kitti
semantic KITTIとオレオレデータセットの両方で結果を示した
param
- BEV size: 80m *80m
- cell size(1024): 7.8cm *7.8cm -> output(256) 31.3 cm
TensorRTを使用

非常に理にかなったframework
- range image + BEVなら2dで考えられるのでデバッグが非常にしやすい
- range image basedやBEV baseを
2-stageの中では拡張性が高いので使いやすいと思われる