Summary
- https://arxiv.org/pdf/2205.09743.pdf
- https://github.com/zhangyp15/BEVerse
- Multi-Camera BEV perception
- 3d detection, motion prediction, semantic map のmulti-task learning
- semantic map は mapのみでobjectは含まない
- motion prediction はsegmentationの表現で行われている
- Multi-frame, transformer base
- Encode-decode した後transformerでBEV featureにする
- BEV feature * n frame
- detectionだけならRTX 3090 GPUs, 12.6fpsで意外と利用に耐えうるレベル
- ただmulti task modelだと4.4fpsなので微妙
- motion prediction が重い
data:image/s3,"s3://crabby-images/85c65/85c659a216e782043889bf4d3a9b24a037ac5878" alt=""
Methods
data:image/s3,"s3://crabby-images/d5575/d5575a6c6102b76a947290877fff6b1b3674f2e3" alt=""
data:image/s3,"s3://crabby-images/0670a/0670ad5a9ad84d0f46fa152ce1264a18799c13b7" alt=""
Experiment
data:image/s3,"s3://crabby-images/3462b/3462b18520640ef9265abfa84df46202ba7d3bd1" alt=""
data:image/s3,"s3://crabby-images/739c0/739c04cee47a60eefdcb48fc157f5ae821ef1ddc" alt=""
- 共通model
- BEVerse-Det 55.8M param, 12.6fps = 80ms
- 全部載せだと102.5M param, 4.4fps で流石に重い
data:image/s3,"s3://crabby-images/94367/943677df883109b492a167b385baf89d9e83895f" alt=""
data:image/s3,"s3://crabby-images/036bb/036bb91acf3ef20d8d6bf531f01923e154fb2afe" alt=""
data:image/s3,"s3://crabby-images/04a8d/04a8dde380c3405741703d14ba1631ec95eb0840" alt=""
- Future instance segmentation
data:image/s3,"s3://crabby-images/c3590/c35903421c014343bc7eba75642544c491c9143e" alt=""