Table of Contents

Image Retrieval

The site is used to collect the findings and insights for our new network.

Scope

For now, the focus is clearly on the model and the quality of the retrieved images and not the scaling of the system. Thus, product quantization is for now out of scope, until we find a new model that allows us to decompose a product into the visual concepts.

Issue: Model vs. Non-Model Images

If the query image is a model image, the retrieves images are also model images and thus, the similarity score is partly driven the by the “human features” of the image. To mitigate this, the average 'human features“ are calculated.

An example is Image Retrieval in the Wild CVPR 2020:
Video: https://www.youtube.com/watch?v=6nLnUAw23u4
Slides: https://matsui528.github.io/cvpr2020_tutorial_retrieval/

But there are different ways to subtract the model features. Very likely the mean vector should be stored per category, since the silhouette of model images likely depends on the category itself.

Additional References:
Website of the author http://yusukematsui.me/

Conditional Similarity

https://vision.cornell.edu/se3/wp-content/uploads/2017/04/CSN_CVPR-1.pdf

The rough idea is to learn to disentangle different aspects in the embedding space. A triplet loss is used to define what should be more similar with respect to a certain aspect.

However, triplets require an oracle to build the triplets and quite often the learning signal vanishes since the triplets are already separated by the margin. Hard negatives might be a solution, but is not for free.

Retrieval

AP, Average Precision is usually the metric, but it is not differentiable. Thus, we need proxies:
https://arxiv.org/abs/2007.12163 (SmoothAP)
https://arxiv.org/abs/2110.01445 (Robust and Decomposable Average Precision
for Image Retrieval)

But since those metrics are for rankings, a list is required for training which is usually done by click-through data. Without this data, manual labels are required.

Survey paper: https://arxiv.org/pdf/2101.11282.pdf

Combining local and global features: https://arxiv.org/pdf/2001.05027.pdf

Training vision transformers for image retrieval: https://arxiv.org/pdf/2102.05644.pdf

Combine text + images: https://publications.idiap.ch/attachments/papers/2021/Liu_ICCV_2021.pdf

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning: https://arxiv.org/abs/2103.15454