Scepter914 Website

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data (arxiv 2024/01)

Summary

https://github.com/LiheYoung/Depth-Anything
link
- https://huggingface.co/spaces/LiheYoung/Depth-Anything/tree/main model
- https://github.com/spacewalk01/depth-anything-tensorrt TensorRT実装
from TikTok
汎用性のある Zero-shot monocular relative depth estimation
- 基盤model + Unsupervisedな学習の手法を取り入れてより汎化性能を得た

Background

multi-dataset joint training
MiDaS: relative depth information
ZoeDepth も強いらしい

Method

Data
- 1.5M labeled images from 6 public datasets
- 62 M unlabeled images
- NYUv2 and KITTIは評価用に使わなかった
- Movies and WSVDはクオリティ低

Learning Labeled Images
- MiDaS のreproduce
Unlabeled images
- perturbations
  - 主に2つ、シンプルながら強かった
  - color distortion: color jittering + Gaussian blurring
  - spatial distortion: CutMix
    - CutMixは 50%にしている
- Labeledで学習したあとだとunlabeledから学習するのが難しい
  - TeacherとStudentが同じpre trainスタートだとよりその傾向になる
Semantic-Assisted Depth estimation
- RAM [85] + GroundingDINO [37] + HQ-SAM [26] modelを使用

Experiment

Zero-Shot Relative Depth Estimation

Method	Params	KITTI		NYUv2		Sintel		DDAD		ETH3D		DIODE
		AbsRel	$\delta_1$	AbsRel	$\delta_1$	AbsRel	$\delta_1$	AbsRel	$\delta_1$	AbsRel	$\delta_1$	AbsRel	$\delta_1$
MiDaS	345.0M	0.127	0.850	0.048	0.980	0.587	0.699	0.251	0.766	0.139	0.867	0.075	0.942
Ours-S	24.8M	0.080	0.936	0.053	0.972	0.464	0.739	0.247	0.768	0.127	0.885	0.076	0.939
Ours-B	97.5M	0.080	0.939	0.046	0.979	0.432	0.756	0.232	0.786	0.126	0.884	0.069	0.946
Ours-L	335.3M	0.076	0.947	0.043	0.981	0.458	0.760	0.230	0.789	0.127	0.882	0.066	0.952

Fine-tuning Absolute depth estimation

Semantic segmentation for Cityscapes

Semantic segmentation for ADE20K

Pre-trained models のInference time (ms)

Model	Params	V100	A100	RTX4090 TensorRT
Depth-Anything-Small	24.8M	12	8	3
Depth-Anything-Base	97.5M	13	9	6
Depth-Anything-Large	335.3M	20	13	12

Discussion