Deep Metric and Representation Learning

Deep Metric and Representation Learning

To understand visual content, computers need to learn what makes images similar. This similarity learning directly implies a representation of the visual content that captures the inherent structure of the data. We present several approaches that can be applied on top of arbitrary deep metric learning methods and various network architectures. Key issues that these works tackle include improving generalization and transfer to novel data, shared feature learning, and adaptive sampling strategies based on reinforcement learning to effectively utilize large amounts of training data.

Following, we provide a (selective) overview of our research on visual similarity learning. For a comprehensive list, please visit our publication page.

S2SD: Simultaneous Similarity-based Self-Distillation for Deep Metric Learning
Roth, K, Milbich, T, Ommer, B, Cohen, J.P, Ghassemi, M
IEEE International Conference on Machine Learning (ICML) 2020

Deep Metric Learning (DML) provides a crucial tool for visual similarity and zero-shot retrieval applications by learning generalizing embedding spaces, although recent work in DML has shown strong performance saturation across training objectives. However, generalization capacity is known to scale with the embedding space dimensionality. Unfortunately, high dimensional embeddings also create higher retrieval cost for downstream applications. To remedy this, we propose S2SD - Simultaneous Similarity-based Self-distillation. S2SD extends DML with knowledge distillation from auxiliary, high-dimensional embedding and feature spaces to leverage complementary context during training while retaining test-time cost and with negligible changes to the training time. Experiments and ablations across different objectives and standard benchmarks show S2SD offering notable improvements of up to 7% in Recall@1, while also setting a new state-of-the-art. Code available at this https URL.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

Diverse Visual Feature Aggregation for Deep Metric Learning
Milbich, T, Roth, K, Bharadhwaj, H, Sinha, S, Bengio, Y, Ommer, B and Cohen, J Paul
IEEE European Conference on Computer Vision (ECCV) 2020

Visual Similarity plays an important role in many computer vision applications. Deep metric learning (DML) is a powerful framework for learning such similarities which not only generalize from training data to identically distributed test distributions, but in particular also translate to unknown test classes. However, its prevailing learning paradigm is class-discriminative supervised training, which typically results in representations specialized in separating training classes. For effective generalization, however, such an image representation needs to capture a diverse range of data characteristics. To this end, we propose and study multiple complementary learning tasks, targeting conceptually different data relationships by only resorting to the available training samples and labels of a standard DML setting. Through simultaneous optimization of our tasks we learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance on multiple established DML benchmark datasets.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

Sharing Matters for Generalization in Deep Metric Learning
Milbich, T, Roth, K, Brattoli, B and Ommer, B
to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2020

Learning the similarity between images constitutes the foundation for numerous vision tasks. The common paradigm is discriminative metric learning, which seeks an embedding that separates different training classes. However, the main challenge is to learn a metric that not only generalizes from training to novel, but related, test samples. It should also transfer to different object classes. So what complementary information is missed by the discriminative paradigm? Besides finding characteristics that separate between classes, we also need them to likely occur in novel categories, which is indicated if they are shared across training classes. This work investigates how to learn such characteristics without the need for extra annotations or training data. By formulating our approach as a novel triplet sampling strategy, it can be easily applied on top of recent ranking loss frameworks. Experiments show that, independent of the underlying network architecture and the specific ranking loss, our approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

Revisiting Training Strategies and Generalization Performance in Deep Metric Learning
Roth, K, Milbich, T, Sinha, S, Gupta, P, Ommer, B and Cohen, J Paul
IEEE International Conference on Machine Learning (ICML) 2020

Deep Metric Learning (DML) is arguably one of the most influential lines of research for learning visual similarities with many proposed approaches every year. Although the field benefits from the rapid progress, the divergence in training protocols, architectures, and parameter choices make an unbiased comparison difficult. To provide a consistent reference point, we revisit the most widely used DML objective functions and conduct a study of the crucial parameter choices as well as the commonly neglected mini-batch sampling process. Under consistent comparison, DML objectives show much higher saturation than indicated by literature. Further based on our analysis, we uncover a correlation between the embedding space density and compression to the generalization performance of DML models. Exploiting these insights, we propose a simple, yet effective, training regularization to reliably boost the performance of ranking-based DML models on various standard benchmark datasets.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

Policy-Adapted Sampling for Visual Similarity Learning
Milbich, T, Roth, K and Ommer, B
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training triplets. Thus, sampling strategies that decide when to use which training sample during learning are crucial. Currently, the prominent paradigm are fixed or curriculum sampling strategies that are predefined before training starts. However, the problem truly calls for a sampling process that adjusts based on the actual state of the similarity representation during training. We, therefore, employ reinforcement learning and have a teacher network adjust the sampling distribution based on the current state of the learner network, which represents visual similarity. Experiments on benchmark datasets using standard triplet-based losses show that our adaptive sampling strategy significantly outperforms fixed sampling strategies. Moreover, although our adaptive sampling is only applied on top of basic triplet-learning frameworks, we reach competitive results to state-of-the-art approaches that employ diverse additional learning signals or strong ensemble architectures.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

Unsupervised Representation Learning by Discovering Reliable Image Relations
Milbich, T, Ghori, O and Ommer, B
Pattern Recognition (PR) 102, 2020

Learning robust representations that allow to reliably establish relations between images is of paramount importance for virtually all of computer vision. Annotating the quadratic number of pairwise relations between training images is simply not feasible, while unsupervised inference is prone to noise, thus leaving the vast majority of these relations to be unreliable. To nevertheless find those relations which can be reliably utilized for learning, we follow a divide-and-conquer strategy: We find reliable similarities by extracting compact groups of images and reliable dissimilarities by partitioning these groups into subsets, converting the complicated overall problem into few reliable local subproblems. For each of the subsets we obtain a representation by learning a mapping to a target feature space so that their reliable relations are kept. Transitivity relations between the subsets are then exploited to consolidate the local solutions into a concerted global representation. While iterating between grouping, partitioning, and learning, we can successively use more and more reliable relations which, in turn, improves our image representation. In experiments, our approach shows state-of-the-art performance on unsupervised classification on ImageNet with 46.0% and competes favorably on different transfer learning tasks on PASCAL VOC.

Representation Learning
Unsupervised Learning
arXiv
Project page
Code

Divide and Conquer the Embedding Space for Metric Learning
Sanakoyeu, A, Tschernezki, V, Büchler, U and Ommer, B
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019

Learning the embedding space, where semantically similar objects are located close together and dissimilar objects far apart, is a cornerstone of many computer vision applications. Existing approaches usually learn a single metric in the embedding space for all available data points, which may have a very complex non-uniform distribution with different notions of similarity between objects, e.g. appearance, shape, color or semantic meaning. Approaches for learning a single distance metric often struggle to encode all different types of relationships and do not generalize well. In this work, we propose a novel easy-to-implement divide and conquer approach for deep metric learning, which significantly improves the state-of-the-art performance of metric learning. Our approach utilizes the embedding space more efficiently by jointly splitting the embedding space and data into K smaller sub-problems. It divides both, the data and the embedding space into K subsets and learns K separate distance metrics in the non-overlapping subspaces of the embedding space, defined by groups of neurons in the embedding layer of the neural network. The proposed approach increases the convergence speed and improves generalization since the complexity of each sub-problem is reduced compared to the original one. We show that our approach outperforms the state-of-the-art by a large margin in retrieval, clustering and re-identification tasks on CUB200-2011, CARS196, Stanford Online Products, Inshop Clothes and PKU VehicleID datasets.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

MIC: Mining Interclass Characteristics for Improved Metric Learning
Brattoli, B, Roth, K and Ommer, B
IEEE International Conference on Computer Vision (ICCV) 2019

Metric learning seeks to embed images of objects such that class-defined relations are captured by the embedding space. However, variability in images is not just due to different depicted object classes, but also depends on other latent characteristics such as viewpoint or illumination. In addition to these structured properties, random noise further obstructs the visual relations of interest. The common approach to metric learning is to enforce a representation that is invariant under all factors but the ones of interest. In contrast, we propose to explicitly learn the latent characteristics that are shared by and go across object classes. We can then directly explain away structured visual variability, rather than assuming it to be unknown random noise. We propose a novel surrogate task to learn visual characteristics shared across classes with a separate encoder. This encoder is trained jointly with the encoder for class information by reducing their mutual information. On five standard image retrieval benchmarks the approach significantly improves upon the state-of-the-art.

Visual Similarity Learning
Deep Metric Learning
arXiv
Project page
Code

Deep Unsupervised Learning of Visual Similarities
Sanakoyeu, A, Bautista, M and Ommer, B
Pattern Recognition (PR) 78, 2018

Unsupervised visual similarity learning is framed as a surrogate classification task.Use weak estimates of local similarities to group samples into compact cliques.Train a ConvNet to learn visual similarities by learning to categorize cliques.Optimization problem to sample training minibatches without conflicting relations.Competitive performance on detailed posture analysis and object classification. Exemplar learning of visual similarities in an unsupervised manner is a problem of paramount importance to computer vision. In this context, however, the recent breakthrough in deep learning could not yet unfold its full potential. With only a single positive sample, a great imbalance between one positive and many negatives, and unreliable relationships between most samples, training of Convolutional Neural networks is impaired. In this paper we use weak estimates of local similarities and propose a single optimization problem to extract batches of samples with mutually consistent relations. Conflicting relations are distributed over different batches and similar samples are grouped into compact groups. Learning visual similarities is then framed as a sequence of categorization tasks. The CNN then consolidates transitivity relations within and between groups and learns a single representation for all samples without the need for labels. The proposed unsupervised approach has shown competitive performance on detailed posture analysis and object classification.

Visual Similarity Learning
Unsupervised Learning
arXiv
Project page
Code

Unsupervised Similarity Learning using Partially Ordered Sets
Bautista, M, Sanakoyeu, A and Ommer, B
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017

Unsupervised learning of visual similarities is of paramount importance to computer vision, particularly due to lacking training data for fine-grained similarities. Deep learning of similarities is often based on relationships between pairs or triplets of samples. Many of these relations are unreliable and mutually contradicting, implying inconsistencies when trained without supervision information that relates different tuples or triplets to each other. To overcome this problem, we use local estimates of reliable (dis-)similarities to initially group samples into compact surrogate classes and use local partial orders of samples to classes to link classes to each other. Similarity learning is then formulated as a partial ordering task with soft correspondences of all samples to classes. Adopting a strategy of self-supervision, a CNN is trained to optimally represent samples in a mutually consistent manner while updating the classes. The similarity learning and grouping procedure are integrated in a single model and optimized jointly. The proposed unsupervised approach shows competitive performance on detailed pose estimation and object classification.

Visual Similarity Learning
Unsupervised Learning
arXiv
Project page
Code

Unsupervised Video Understanding by Reconciliation of Posture Similarities
Milbich, T, Bautista, M, Sutter, E and Ommer, B
IEEE International Conference on Computer Vision (ICCV) 2017

Understanding human activity and being able to explain it in detail surpasses mere action classification by far in both complexity and value. The challenge is thus to describe an activity on the basis of its most fundamental constituents, the individual postures and their distinctive transitions. Supervised learning of such a fine-grained representation based on elementary poses is very tedious and does not scale. Therefore, we propose a completely unsupervised deep learning procedure based solely on video sequences, which starts from scratch without requiring pre-trained networks, predefined body models, or keypoints. A combinatorial sequence matching algorithm proposes relations between frames from subsets of the training data, while a CNN is reconciling the transitivity conflicts of the different subsets to learn a single concerted pose embedding despite changes in appearance across sequences. Without any manual annotation, the model learns a structured representation of postures and their temporal development. The model not only enables retrieval of similar postures but also temporal super-resolution. Additionally, based on a recurrent formulation, next frames can be synthesized.

Visual Similarity Learning
Unsupervised Learning
arXiv
Project Page
Code

CliqueCNN: Deep Unsupervised Exemplar Learning
Bautista, M, Sanakoyeu, A, Sutter, E and Ommer, B
Conference on Advances in Neural Information Processing Systems (NeurIPS) 2016

Exemplar learning is a powerful paradigm for discovering visual similarities in an unsupervised manner. In this context, however, the recent breakthrough in deep learning could not yet unfold its full potential. With only a single positive sample, a great imbalance between one positive and many negatives, and unreliable relationships between most samples, training of convolutional neural networks is impaired. Given weak estimates of local distance we propose a single optimization problem to extract batches of samples with mutually consistent relations. Conflicting relations are distributed over different batches and similar samples are grouped into compact cliques. Learning exemplar similarities is framed as a sequence of clique categorization tasks. The CNN then consolidates transitivity relations within and between cliques and learns a single representation for all samples without the need for labels. The proposed unsupervised approach has shown competitive performance on detailed posture analysis and object classification.

Visual Similarity Learning
UnsupervisedLearning
arXiv
Project page
Code