SuperPoint: Self-Supervised Interest Point Detection and Description

(김태규) #1

This paper has been accepted to CVPR 2018 workshop for deep learning based VO. It contributes to build a self-supervised framework for training interest point detectors and descriptors suitable for a large number of multiple-view geometry problems. Unlike the traditional methods, its fully-convolutional model operates on full-sized images and jointly computes pixel-level interest point locations and associated descriptors. After all, the proposed model is able to repeatedly detect a much richer set of interest points than the initial pre-adapted deep model and any other traditional corner detector.


Interest point detection is ill-defined compared to semantic tasks such as human-body keypoint estimation. In addition, all the dataset of 2D ground truth locations are labeled by human annotators. In this paper, the author suggests a self-supervised solution using self-training.

To generate the pseudo-ground truth interest points, train a fully-CNN from a synthetic dataset called Synthetic Shapes as show in (a) below, consisting of simple geometric shapes with no ambiguity in the interest point locations. The resulting trained detector (“Base Detector” in (a)) is called MagicPoint. Even if MagicPoint performs surprising well on real images, it misses many potential interest point locations.

To overcome the gap in performance on real images, a multi-scale, multi-transform technique, called Homographic Adaptation is developed. it is designed to enable self-supervised training of interest point detectors. It warps the input image multiple times to help an interest point detector see the scene from many different viewpoints and scales. Homographic Adaptation is used in conjunction with the MagicPoint detector to boost the performance and generate the pseudo-ground truth interest points (as shown in (b)) Now, the resulting detector is called SuperPoint that are more repeatable and outperforms on a larger set of stimuli.

After all, a fixed dimensional descriptor vector to each point for higher level semantic tasks such as image matching are combined with SuperPoint. Finally, the interest point network is combined with an additional subnetwork that computes interest point descriptors.

The author insists that the proposed method is the only one to compute both interest points and descriptors in a single network in real-time. The comparison chart is pictured below.

SuperPoint Architecture


Shared Encoder

  • Uses VGG-style encoder
  • Encoder uses three max-pooling layers
    • Hc = H / 8 and Wc = W / 8 for an image sized H x W

Interest Point Decoder

  • To reduce the high computation used for upsampling layers, designed the interest point detection head with an explicit decoder.
  • The interest point head computes
R^{H_c \times W_c \times 65 } \to R^{H \times W}
  • The 65 channels correspond to local, non-overlapping 8 x 8 grid regions of pixels plus an extra “no interest point” dustbin.
  • After a softmax, dustbin dimension is removed
R^{H_c \times W_c \times 64 } \to R^{H \times W}

Descriptor Decoder

  • Descriptor head computes and outputs
R^{H_c \times W_c \times D } \to R^{H \times W}
  • To outputa dense map of L2-normalized fixed length descriptors, use a model similar to UCN.
  • Outputs semi-dense grid of descriptors (one every 8 pixels); thus it reduces training memory and computes faster.
  • Then, the decoder performs bi-cubic interpolation of the descriptors and then L2-normalizes the activations to be unit length.

Loss Functions

  • The sum of two intermediate losses:
    • L_p: Loss for the interest point detector
    • L_d: Loss for the descriptor
  • Use pairs of synthetically warped images which have both (a) pseudo-ground truth interest point locations and (b) the ground truth correspondence from a randomly generated homography as shown in the figure C.
  • Use \lambda to balance the final loss:
  • L_p is a fully convolutional cross-entropy loss over the cell x_{hw}
  • descriptor loss is applied to all pairs of descriptor cells.

Synthetic Pre-Training

Synthetic Shapes

Due to the lack of database of interest point labeled images, the author creates a large-scale synthetic dataset, called Synthetic Shapes. They are consisted of simplified 2D geometry via synthetic data rendering of quadrilaterals, triangles, lines and ellipses.


The MagicPoint detector performs very well on Synthetic Shapes compared to the traditional methods as shown in the Table 2 w/ and w/o noise. However, in the space of all natural images, it is relatively underperforms. This motivated the author to design self-supervised approach for training on real-world images, called Homographic Adaptation.

Homographic Adaptation

A process that applies random homographies to warped copies of the input image and combines the results.


Homographies are used at the core of the self-supervised approach due to the reasons below:

  • Homographies give almost exact image-to-image transformations for camera motion.
  • Since most of the world is reasonably planar, a homography is good model for knowing the same 3D point is seen from different viewpoints.
  • Does not require 3D information; easily applied to any 2D images.
    where \hat{F} is super-point detector, f_\theta represents the initial interest point function, I the input image, x the resulting interest points. H a random homography, N_h the number of homographic warps(hyper parameter).

Choosing Homographies

  • For sampling good homographies, decompose a potential homography into more simple, less expressive transformation classes.
  • The added benefit of using more than 100 homographies is minimal.

Iterative Homographic Adaptation