====== Image Retrieval ======

The site is used to collect the findings and insights for our new network.


===== Scope =====

For now, the focus is clearly on the model and the quality of the retrieved images and not the scaling of the system. Thus, product quantization is for now out of scope, until we find a new model that allows us to decompose a product into the visual concepts.


===== Issue: Model vs. Non-Model Images =====

If the query image is a model image, the retrieves images are also model images and thus, the similarity score is partly driven the by the “human features” of the image. To mitigate this, the average 'human features“ are calculated.

An example is Image Retrieval in the Wild CVPR 2020:\\
Video: https://www.youtube.com/watch?v=6nLnUAw23u4\\
Slides: https://matsui528.github.io/cvpr2020_tutorial_retrieval/

But there are different ways to subtract the model features. Very likely the mean vector should be stored per category, since the silhouette of model images likely depends on the category itself.

Additional References:\\
Website of the author http://yusukematsui.me/


===== Conditional Similarity =====

https://vision.cornell.edu/se3/wp-content/uploads/2017/04/CSN_CVPR-1.pdf

The rough idea is to learn to disentangle different aspects in the embedding space. A triplet loss is used to define what should be more similar with respect to a certain aspect.

However, triplets require an oracle to build the triplets and quite often the learning signal vanishes since the triplets are already separated by the margin. Hard negatives might be a solution, but is not for free.


===== Retrieval =====

AP, Average Precision is usually the metric, but it is not differentiable. Thus, we need proxies:\\
https://arxiv.org/abs/2007.12163 (SmoothAP)\\
https://arxiv.org/abs/2110.01445 (Robust and Decomposable Average Precision\\
for Image Retrieval)

But since those metrics are for rankings, a list is required for training which is usually done by click-through data. Without this data, manual labels are required.

Survey paper: https://arxiv.org/pdf/2101.11282.pdf

Combining local and global features: https://arxiv.org/pdf/2001.05027.pdf

Training vision transformers for image retrieval: https://arxiv.org/pdf/2102.05644.pdf

Combine text + images: https://publications.idiap.ch/attachments/papers/2021/Liu_ICCV_2021.pdf

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning: https://arxiv.org/abs/2103.15454