Table of Contents

Visual Codebooks

Git: https://git.picalike.corpex-kunden.de/incubator/image-codebooks

The idea is to learn latent concepts of images via codebooks / latents. Those should be useful to reconstruct images which forces them to understand the underlying concept.

Later, those codebooks can be grouped and related to find similar products or to relate a user with a user, or a user with a product, or a product with a product.

Our Approaches

First, we started using the VQ-VAE architecture to learn discretized codebooks to represent images. Some remarks about this model:

<HTML><ol></HTML>

Second, we switched to the MAE architecture, which uses a masked image as input to a transformer-based autoencoder, which has to reconstruct the entire unmasked image.

<HTML><ol></HTML>

Variational Autoencoder:

<HTML><ol></HTML>

Overall, using autoencoders to get a fixed length representation works when reconstruction is our goal. But extracting high-level concepts in a unsupervised manner proved to be very challenging.

Conclusion

Learning image concepts in an unsupervised manner is very hard. Even though there are papers that create astonishing image reconstructions, they are usually very large and trained for a very long time. The mean-squared-error objective is ill-suited for our needs because a blurry image is a local minima. Perceptual losses, on the other hand, can introduce a lot of artifacts. More recently, diffusion models have taken over the generative image models scene. They avoid the blurriness problem altogether, however they do not provide latent encodings or concepts of an image.

Why is it so hard? Likely because an image is composed of a huge amount of information. Even when our models were able to more or less reconstruct the image, the latent vector was still mostly useless to capture abstract concepts. The model has to capture all details of an image, while we as customers are only interested in a small subset of them. Therefore, without supervision, it's unlikely that this kind of training works in the future.

The successor of this project is v0.1-recommender-system.

References:
https://ml.berkeley.edu/blog/posts/vq-vae/ (VQ VAE)
https://ml.berkeley.edu/blog/posts/dalle2/ (DALL-E)
https://arxiv.org/abs/2111.12710 (PECO)
https://arxiv.org/abs/1711.00937 (Neural Discrete Representation Learning)
https://arxiv.org/abs/1606.06582 (Layer-wise reconstructions)
https://arxiv.org/abs/1906.00446 (VQ VAE 2)
https://arxiv.org/abs/2110.04627 (VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN)

Perceptual Loss + VAE:
https://arxiv.org/abs/2001.03444.pdf (Image Autoencoder Embeddings with Perceptual Loss)

Code:
https://github.com/deepmind/sonnet/blob/v2/sonnet/src/nets/vqvae.py
https://github.com/zalandoresearch/pytorch-vq-vae
https://colab.research.google.com/github/zalandoresearch/pytorch-vq-vae/blob/master/vq-vae.ipynb#scrollTo=jZFxblfP8EvR
https://github.com/MishaLaskin/vqvae/blob/master/models/quantizer.py
https://arxiv.org/abs/2111.06377 (Masked Auto-Encoders)
https://www.jeremyjordan.me/variational-autoencoders/ https://openreview.net/forum?id=Sy2fzU9gl https://arxiv.org/pdf/2001.03444.pdf https://arxiv.org/abs/1512.00570

Zoo:
https://github.com/AntixK/PyTorch-VAE
https://github.com/guspih/Perceptual-Autoencoders