This paper has been accepted to CVPR 2018 workshop for deep learning based VO. It contributes to build a selfsupervised framework for training interest point detectors and descriptors suitable for a large number of multipleview geometry problems. Unlike the traditional methods, its fullyconvolutional model operates on fullsized images and jointly computes pixellevel interest point locations and associated descriptors. After all, the proposed model is able to repeatedly detect a much richer set of interest points than the initial preadapted deep model and any other traditional corner detector.
Overview
Interest point detection is illdefined compared to semantic tasks such as humanbody keypoint estimation. In addition, all the dataset of 2D ground truth locations are labeled by human annotators. In this paper, the author suggests a selfsupervised solution using selftraining.
To generate the pseudoground truth interest points, train a fullyCNN from a synthetic dataset called Synthetic Shapes as show in (a) below, consisting of simple geometric shapes with no ambiguity in the interest point locations. The resulting trained detector (“Base Detector” in (a)) is called MagicPoint. Even if MagicPoint performs surprising well on real images, it misses many potential interest point locations.
To overcome the gap in performance on real images, a multiscale, multitransform technique, called Homographic Adaptation is developed. it is designed to enable selfsupervised training of interest point detectors. It warps the input image multiple times to help an interest point detector see the scene from many different viewpoints and scales. Homographic Adaptation is used in conjunction with the MagicPoint detector to boost the performance and generate the pseudoground truth interest points (as shown in (b)) Now, the resulting detector is called SuperPoint that are more repeatable and outperforms on a larger set of stimuli.
After all, a fixed dimensional descriptor vector to each point for higher level semantic tasks such as image matching are combined with SuperPoint. Finally, the interest point network is combined with an additional subnetwork that computes interest point descriptors.
The author insists that the proposed method is the only one to compute both interest points and descriptors in a single network in realtime. The comparison chart is pictured below.
SuperPoint Architecture
Shared Encoder
 Uses VGGstyle encoder
 Encoder uses three maxpooling layers
 Hc = H / 8 and Wc = W / 8 for an image sized H x W
Interest Point Decoder
 To reduce the high computation used for upsampling layers, designed the interest point detection head with an explicit decoder.
 The interest point head computes
 The 65 channels correspond to local, nonoverlapping 8 x 8 grid regions of pixels plus an extra “no interest point” dustbin.
 After a softmax, dustbin dimension is removed
Descriptor Decoder
 Descriptor head computes and outputs
 To outputa dense map of L2normalized fixed length descriptors, use a model similar to UCN.
 Outputs semidense grid of descriptors (one every 8 pixels); thus it reduces training memory and computes faster.
 Then, the decoder performs bicubic interpolation of the descriptors and then L2normalizes the activations to be unit length.
Loss Functions
 The sum of two intermediate losses:
 L_p: Loss for the interest point detector
 L_d: Loss for the descriptor
 Use pairs of synthetically warped images which have both (a) pseudoground truth interest point locations and (b) the ground truth correspondence from a randomly generated homography as shown in the figure C.
 Use \lambda to balance the final loss:

L_p is a fully convolutional crossentropy loss over the cell x_{hw}
 descriptor loss is applied to all pairs of descriptor cells.
Synthetic PreTraining
Synthetic Shapes
Due to the lack of database of interest point labeled images, the author creates a largescale synthetic dataset, called Synthetic Shapes. They are consisted of simplified 2D geometry via synthetic data rendering of quadrilaterals, triangles, lines and ellipses.
 Generated onthefly
 Applied homographic warps to each image
MagicPoint
The MagicPoint detector performs very well on Synthetic Shapes compared to the traditional methods as shown in the Table 2 w/ and w/o noise. However, in the space of all natural images, it is relatively underperforms. This motivated the author to design selfsupervised approach for training on realworld images, called Homographic Adaptation.
Homographic Adaptation
A process that applies random homographies to warped copies of the input image and combines the results.
Formulation
Homographies are used at the core of the selfsupervised approach due to the reasons below:
 Homographies give almost exact imagetoimage transformations for camera motion.
 Since most of the world is reasonably planar, a homography is good model for knowing the same 3D point is seen from different viewpoints.
 Does not require 3D information; easily applied to any 2D images.
where \hat{F} is superpoint detector, f_\theta represents the initial interest point function, I the input image, x the resulting interest points. H a random homography, N_h the number of homographic warps(hyper parameter).
Choosing Homographies
 For sampling good homographies, decompose a potential homography into more simple, less expressive transformation classes.
 The added benefit of using more than 100 homographies is minimal.