Zhenyao Wu

Date of Award

Summer 2022

Document Type

Open Access Dissertation


Computer Science and Engineering

First Advisor

Song Wang

Second Advisor

Lili Ju


Estimating depth from images has become a very popular task in computer vision which aims to restore the 3D scene from 2D images and identify important geometric knowledge of the scene. Its performance has been significantly improved by convolutional neural networks in recent years, which surpass the traditional methods by a large margin. However, the natural scenes are usually complicated, and hard to build the correspondence between pixels across frames, such as the region containing moving objects, illumination changes, occlusions, and reflections. This research explores rich and comprehensive spatial correspondence across images and designs three new network architectures for depth estimation whose inputs can be a single image, stereo pairs, or monocular video.

First, we propose a novel semantic stereo network named SSPCV-Net, which includes newly designed pyramid cost volumes for describing semantic and spatial correspondence on multiple levels. The semantic features are inferred from a semantic segmentation subnetwork while the spatial features are constructed by hierarchical spatial pooling. In the end, we design a 3D multi-cost aggregation module to integrate the extracted multilevel correspondence and perform regression for accurate disparity maps. We conduct comprehensive experiments and comparisons with some recent stereo matching networks on Scene Flow, KITTI 2015 and 2012, and Cityscapes benchmark datasets, and the results show that the proposed SSPCV-Net significantly promotes the state-of-the-art stereo-matching performance.

Second, we present a novel SC-GAN network with end-to-end adversarial training for depth estimation from monocular videos without estimating the camera pose and pose change over time. To exploit cross-frame relations, SC-GAN includes a spatial correspondence module that uses Smolyak sparse grids to efficiently match the features across adjacent frames and an attention mechanism to learn the importance of features in different directions. Furthermore, the generator in SC-GAN learns to estimate depth from the input frames, while the discriminator learns to distinguish between the ground-truth and estimated depth map for the reference frame. Experiments on the KITTI and Cityscapes datasets show that the proposed SC-GAN can achieve much more accurate depth maps than many existing state-of-the-art methods on monocular videos.

Finally, we propose a new method for single image depth estimation which utilize the spatial correspondence from stereo matching. To achieve the goal, we incorporate a pre-trained stereo network as a teacher to provide depth cues for the features and output generated by the student network which is a monocular depth estimation network. To further leverage the depth cues, we developed a new depth-aware convolution operation that can adaptively choose subsets of relevant features for convolutions at each location. Specifically, we compute hierarchical depth features as the guidance, and then estimate the depth map using such depth-aware convolution which can leverage the guidance to adapt the filters. Experimental results on the KITTI online benchmark and Eigen split datasets show that the proposed method achieves the state-of-the-art performance for single-image depth estimation.