One of the key technologies in automatic navigation is gathering odometry information by integrating various types of sensors including camera, IMU (inertial measurement unit), etc, which is named as Visual Odometry (VO). Recently, some of the learning-based visual odometry techniques are introduced, thanks to the significant growth of deep learning era.

## The conventional methods

Most of the conventional methods to compute the robot’s position and orientation information are relied on geometric constrains extracted from imagery. They can be further divided into sparse feature based methods and dense/direct methods according to whether images are pre-processed with sparse feature extraction and matching.

### Sparse Feature Method

Some of the challenging problems that geometry based methods have are drifts over time and it requires sophisticated camera calibration. In addition, the feature based methods are sensitive to the outliers of feature matching, which can be caused by image motion blur, appearance similarity, occlusion, etc. The sparse feature based methods only use salient features without benefiting from rich information in the whole image. This is a reason why dense/direct methods have been proposed.

### Dense/Direct Method

The dense/direct methods do not employ manually designed feature point detection, descriptor or matching. Since the dense/direct methods tend to work better in texture-less and large environments by exploiting all the pixels in the images than sparse feature based ones, they are increasingly gaining more favour in the field. However, the accuracy is severely degraded when robot moves fast and camera frame rate is relatively low.

## Learning-based Method

### Traditional learning based methods

Optical flow of the whole image is trained by using K Nearest Neighbour (KNN), Gaussian Processes (GP) and Support Vector Machines (SVM) regression algorithms for monocular VO. This work proved that the learning based method can potentially estimate the VO without camera calibration. However, they are inefficient when dealing with extensive high-dimentional data

### Deep learning based methods

Since DL automatically learn suitable feature representation, it has been widely used in computer vision and robotics problems, showing remarkable outcomes. The following DL techniques are mainly used to solve the VO problem.

- CNN:
- Used to predict depth and surface normal from a single image.
- Multiple stacked encoder-decoder convolutional networks are designed to estimate depth and motion concurrently
- Some geometric computations need to be implemented as network layers in order to explicitly incorporate geometric transformation

- RNN:
- Sequential learning and modeling are essential to the applications involving time-series data, such as VO problem that deals with the series of image frames, solving a sequence of poses and uncertainties.

- Uncertainty Estimation in DNNs:
- In recent paper, it is attempted to compute uncertainty from the DNNs by multiple predictions with dropout.
- It is common to estimate uncertainty of DNNs based on the Mixture Density Networks (MDNs)

## Overall Architectures

### Feature Extraction: CNN

The CNN is developed to perform effective feature extraction on the concatenation of two consecutive single images. This structure of the CNN is inspired by the network for optical flow estimation in “FlowNet: Learning Optical Flow with Convolutional Networks”

- Composed of
- 9 convolutional layers
- Each layer followed by a Rectified Linear Unit (ReLU) except Conv6
- (please refer to Table 1)

### Sequential Learning: Deep RNN

RNN is well suited to the VO problem which involves both temporal model (motion model) and sequential data (image sequence). However RNN is not suitable to directly learn a sequential representation from high-dimensional raw images. Thus, adopts the CNN feature instead.

To explore and exploit correlations among images taken in long trajectories, Long Short-Term Memory (LSTM) has been employed in RNN.

### Fully-connected layer

To collect the features of RNN and transfer them into motion and uncertainty, two fully-connected layers are introduced. The first layer has 128 hidden states with ReLU activation. The other one contains 12 dimensions to represent both motion and uncertainty

- 6 dimensions on motion:
- travelled distance: ∆x,∆y,∆z
- heading change: ∆φ,∆θ,∆φ

- 6 dimensions on the corresponding variances

**With the aid of uncertainty estimates, this predicted motion can be used a odometry or dead-reckoing to fuse with other sensors under certain filtering technique such as Extended Kalman Filter**

### SE(3) composition layer

In order to derive the global poses with respect to the initial coordinate frame, the transformation relations on position and orientation needs to be composed over time. To resolve this issue, SE(3) composition layer is introduced.

**For more details on Special Euclidean Group, SE(n), please refer to the link here**

### Uncertainty estimation of VO

(TBD)

## Cost function and Optimisation

The neural network of ESP-VO predicts pose and uncertainty. Therefore, the cost function for training is hybrid and composed of two parts dealing with each type of prediction.

### For pose & orientation estimation

Pose and orientation estimation can be considered to compute the conditional probability of the poses given a sequence of images up to time t shown in equation 7:

Optimal network parameters θ∗ for pose estimation can be learnt by maximising equation 7.

### For Motion and uncertainty estimation

## Experimental Results

Various types of datasets are utilized to analyze accuracy and robustness of ESP-VO algorithm as listed below:

- KITTI VO dataset
- Malaga dataset
- Cityscapes dataset
- EuRoC MAV dataset
- NYU depth dataset

And to evaluate the proposed VO method’s performance, other state of the art VO algorithms are compared in the subsequent experiments. Since visual SLAM without using loop closure becomes VO, some visual SLAM algorithms were also included for comparison.

- VISO2
- Sparse feature based ORB-SLAM
- Direct method based LSD-SLAM

## Conclusions

The author proposes a novel end-to-end, sequence-to-sequence VO algorithm for monocular image based on Deep Learning. The proposed algorithm automatically learn the feature representation of the raw monocular images and solves for odometry and the corresponding variances without burdens that the conventional VO algorithms suffer. Hence, there is no need to carefully tune the parameters of the VO system and calibrate camera every time there is some modifications on the mechanical system.

Despite of some potential of the proposed algorithm in VO area, the author emphasizes that it is not expected as a replacement to the geometry based approaches; but would be a viable complement, incorporating them.