Obtaining a depth map is not a easy task, as for the past 30 years a wide array of methods has been developed, with varied results. In this blog, those methods will be discussed.
Daniel Paz
Computer vision
Basic concepts
Conventional camera systems usually capture bi-dimensional information from a tri-dimensional scene. In contrast, the human vision system uses two eyes to extract the information, called “binocular vision”.
What stereo system proposes are solutions to simulate the human vision, where there is an overlap of field of view between both eyes, which creates an illusion of depth.
Fig. 0
To generate the 3D representation, It is necessary to find the correspondences between both images, obtaining its displacement. That information is known as the disparity map. There are different procedures to generate the it, ranging from per-pixel operations to neural networks or optimization, even if the last one is not on this publication’s scope.
Classic Approaches
Naive stereo algorithms try to obtain the depth information by minimizing the difference between a subset of pixels, and there are three main approximations to algorithmically solve this problem: the local, global, and semi-global methods. All of which will be described in the next sections.
Reference Images
To demonstrate the features of the different algorithms we choose the next images as references. The idea is to compare the disparity maps obtained by the different methods in the same situation.
Fig 1: Left/Right parallel images
The images must be aligned. With that, the correspondences are only found on the X axis in a horizontal stereo, or the Y axis on the vertical one.
Local Methods
The disparity is calculated by corresponding a pixel and a set of neighbors from one image to the other. The idea is to minimize the difference between this subset of pixels. Some methods to calculate this disparity are normalized cross-correlation or sum of squared differences among others. These methods have been improved with dynamically set window sizes or different base lined images. As the images are parallel, it is only necessary to search for a horizontal shift.
Fig 2: Ground Truth (Left), raw output of SSD (Center), post-processed output of SSD (Right)
Local Matching methods are usually fast because the disparity values are calculated with a small window of near pixels for each point. Its main disadvantages are the problems it encounters with occlusions, plain textures and edge detection. Furthermore, the resultant disparity map may have excess noise and low precision (Fig 2).
Fig 3: Clearly matched pixel (Left) vs Noise (Right)
This noise is caused because of plain textures and occlusions. As it can be seen on Fig 3, in the first case the shift is clearly defined, whereas on the second image It’s muddled and harder to find.
The noise can be reduced by using a matching threshold and post-processing the output map (Fig 2), but some precision is lost.
Global Methods
Unlike local procedures, global methods are rooted on an optimization process. Which consists of a function to be minimized that involves all the pixels of the image. One of the most used is the Graph Cut optimization.
Fig 4: Ground truth (Left), Graph Cut (Right)
This produces a cleaner output than local methods at the cost of computational time and memory, as it cannot be run in real time. Even so, it is not perfect as it still has artifacts on the edges and very marked steps on the depth.
Mixed Methods
There are also widely used mixed methods like SGM (SemiGlobal Matching) or SGBM (SemiGlobal Block Matching algorithm), the aim of which is to reduce the computational cost of global methods.
Fig 5: Ground truth (Left), SGM (Center), SGBM (Right)
SemiGlobal Matching relies on performing optimization along pairs of disparity values. These values are being obtained with standard local matching methods like SSM. By comparing each disparity value with its neighbors and penalizing big changes, it is possible to reduce the noise and improve the edge detection.
On the other hand, SemiGlobal Block Matching is a modified version of SGM, focusing on speed.
Those methods are improvements upon local methods at the cost of computational time and extensive use of memory but can still be used on real time.
Complete Comparison
The stereo matching classic approaches are a tradeoff between computational time and quality.
Even if the fastest methods can produce a map that somewhat resembles the ground truth, to get a reliable and sharp disparity map, complex and heavy procedures are almost mandatory.
For a real time application, the one that makes the most sense is the SGBM, as it can produce accurate results fast, even if it’s blurred.
In the case of 3D mapping, the best approach would be the Graph Cut, as it gets the sharpest result and it can be processed afterwards.
Recently, this field has been advancing rapidly using Neural Networks, which improves both the noise and sharpness of the obtained disparity map.