In each video, an area of interest of a rigid object is tracked using a perspective camera. A reference template is selected manually in the first image of the sequence. The shape of the rigid object is unknown and it is thus recovered online during the visual tracking. In the videos, a red grid is superposed to the current area of interest in order to show the 3D structure of the object.
In each video, a part of a deformable object is tracked using a perspective camera. A reference template is selected manually in the first image of the sequence. In the top left corner we show the undeformed image (i.e. the reprojection of the current image in the reference frame). The undeformed image is almost unchanged during the entire sequence proving that the ESM visual tracking is able to estimate the deformation of each pixel of the reference template.
We propose a new approach to the direct image alignment of either Lambertian or non-Lambertian objects under shadows, inter-reflections, glints as well as ambient, diffuse and specular reflections which may vary in power, type, number and space. The method is based on a proposed model of illumination changes together with an appropriate geometric model of image motion. The parameters related to these models are simultaneously obtained through the ESM optimization technique which minimizes directly the intensity discrepancies. Comparison results with existing direct methods show significant improvements in the tracking performance.
When the models used in the visual tracking are not accurate enough the optimization may fail. For example, if the target is partially occluded the overall motion of the area of interest will not be coherent. In the general case, outlier measures can be discarded by using robust cost functions. We have tested the use of M-estimators in the ESM visual tracking algorithm. The video below show an example of the robustness of the algorithm when tracking a planar object with severe illumination changes and specular reflections. In this case, illumination changes and specular reflections are not explicitly modeled (see section above) but treated as outliers. Although M-estimators allows to handle the case of partial occlusions that can be hardly modeled, the price to pay is a higher computation time and a lower convergence rate.
The ESM technique has been successfully applied to the visual tracking using a stereo pair. The video below show an example of the visual tracking of a sphere. The user selects a region of interest in the left image (the blue rectangle in this case). After the corresponding region in the right image is found, the visual tracking starts.
The ESM visual tracking can be used for estimating the displacement of a robot with respect to a reference frame (e.g, the initial position). The translation can be estimated only up to a scale factor. Additional information (e.g. a known distance) allow to recover the scale factor. When compared with the odometry of a well calibrated robot we obtain a very good precision. In the first video, the pose of the robot is directly estimated from the images (i.e. we impose a rigidity constraint on the planes in the scene) knowing the camera parameters and a model of the scene. In the second video, the pose of the robot and a piecewise-planar model of the scene are simultaneously estimated. This is a first step towards a visual SLAM approach for single viewpoint cameras.
The stereo ESM visual tracking can be used for estimating the displacement of a car with respect to a reference frame (e.g, the initial position). The algorithm was tested on real full-scale sequences as can be seen in the videos below. Several test sequences from different streets in Versailles, France, were used to validate the results. The estimated 3D trajectory from a stereo-pair of cameras was superimposed on satellite images of the area. This first sequence is that of a relatively straight road. The distance traveled by the car has been measured using road markings in the images and satellite views with a precision of 2.9 cm/pixel for the Versailles region. The length of the path measured by Google earth was about 436m and the estimated length from the tracker is 420m giving an approximate drift of 4%. Throughout the sequence several moving vehicles pass in front of the cameras and a one stage a car is overtaken. The second sequence is particularly illustrative since a full loop of a round-about was performed. In particular this enables the drift to be measured at the crossing point in the trajectory. The drift at the crossing point was approximately 2m in the vertical direction to the road-plane. Considering that the trajectory around the round-about is approximately 200m long (measured using Google earth), this makes a drift of 1% on the measurable axis.
We built an efficient method (named direct visual SLAM) that directly computes the 3D camera displacement and the scene structure. The method is robust to arbitrary illumination variations. Feature extraction is not needed given that the intensities of all possible pixels are directly used. The system is automatically initialized from the first image. Motion and structure parameters are simultaneously estimated with the ESM technique for faster processing and avoidance of irrelevant local minima. Thus, the method can cope with large inter-frame displacements. Rigidity and visibility constraints on the structure are enforced. All these factors significantly contribute to achieve accurate SLAM results. The video below shows the SLAM results on a real-world urban sequence captured in Versailles, France. The left frame shows the input images superposed with the tracked regions, while the right frame shows both the 3D pose and scene structure being incrementally recovered.