Visual Codebooks
- Our Approaches
- Conclusion

Visual Codebooks

Git: https://git.picalike.corpex-kunden.de/incubator/image-codebooks

The idea is to learn latent concepts of images via codebooks / latents. Those should be useful to reconstruct images which forces them to understand the underlying concept.

Later, those codebooks can be grouped and related to find similar products or to relate a user with a user, or a user with a product, or a product with a product.

Our Approaches

First, we started using the VQ-VAE architecture to learn discretized codebooks to represent images. Some remarks about this model:

discretized vectors easily collapse, forcing the model to use only a handful of the available codebook space.
random restarting of codebooks solves this issue, but at the cost of lower information efficiency, latents are much similar to each other.
Only MSE loss yields blurry images unless the grid size is really small.
Finally, even with reasonable reconstructions, the codebooks fail to capture meaningful information for us. They seem to capture concepts about shape and color, which are too low level.<HTML></ol></HTML>

Second, we switched to the MAE architecture, which uses a masked image as input to a transformer-based autoencoder, which has to reconstruct the entire unmasked image.

This model is hard and slow to train.
The reconstructions do get better and better. However, again, to extract more high-level features we fed the model the whole image and provided only the CLS token from this forward pass to the decoder. Hopefully, there would be enough information there to derive high-level concepts.
This approach proved very challenging to work with, and we switched to a plain variational autoencoder (VAE).<HTML></ol></HTML>

Variational Autoencoder:

We used beta-VAE as our baseline.
Using only MSE loss yields very blurry images.
Thus, we used perceptual loss (ResNet18 features before pooling) to enhance the reconstructions.
Training a full convolutional network at the same time as the variational layer is difficult and very slow. Thus, we pretrained a vanilla autoencoder with dropout (0.25) to get good pretrained convolutional filters.
This works and reconstructions do get better, but they are only good for the most common category.
To work around this issue, we adapted CVAE (conditional variational autoencoder) using the same settings.
Again, reconstructions get better with time. Using KL divergence to find nearest neighbors according to latent distributions works well (specially if using only the latents with variance below median variance) and produces reasonable recommendations. Still, when adding products together the recommendations become more and more nonsensical.
This seems to happen because each latent factor encodes a feature that is relevant for reconstruction (e.g. angles between components, shades of light, widths) but do not contribute significantly to better recommendations.<HTML></ol></HTML>

Overall, using autoencoders to get a fixed length representation works when reconstruction is our goal. But extracting high-level concepts in a unsupervised manner proved to be very challenging.

Conclusion

Learning image concepts in an unsupervised manner is very hard. Even though there are papers that create astonishing image reconstructions, they are usually very large and trained for a very long time. The mean-squared-error objective is ill-suited for our needs because a blurry image is a local minima. Perceptual losses, on the other hand, can introduce a lot of artifacts. More recently, diffusion models have taken over the generative image models scene. They avoid the blurriness problem altogether, however they do not provide latent encodings or concepts of an image.

Why is it so hard? Likely because an image is composed of a huge amount of information. Even when our models were able to more or less reconstruct the image, the latent vector was still mostly useless to capture abstract concepts. The model has to capture all details of an image, while we as customers are only interested in a small subset of them. Therefore, without supervision, it's unlikely that this kind of training works in the future.

The successor of this project is v0.1-recommender-system.

References:
https://ml.berkeley.edu/blog/posts/vq-vae/ (VQ VAE)
https://ml.berkeley.edu/blog/posts/dalle2/ (DALL-E)
https://arxiv.org/abs/2111.12710 (PECO)
https://arxiv.org/abs/1711.00937 (Neural Discrete Representation Learning)
https://arxiv.org/abs/1606.06582 (Layer-wise reconstructions)
https://arxiv.org/abs/1906.00446 (VQ VAE 2)
https://arxiv.org/abs/2110.04627 (VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN)

Perceptual Loss + VAE:
https://arxiv.org/abs/2001.03444.pdf (Image Autoencoder Embeddings with Perceptual Loss)

Code:
https://github.com/deepmind/sonnet/blob/v2/sonnet/src/nets/vqvae.py
https://github.com/zalandoresearch/pytorch-vq-vae
https://colab.research.google.com/github/zalandoresearch/pytorch-vq-vae/blob/master/vq-vae.ipynb#scrollTo=jZFxblfP8EvR
https://github.com/MishaLaskin/vqvae/blob/master/models/quantizer.py
https://arxiv.org/abs/2111.06377 (Masked Auto-Encoders)
https://www.jeremyjordan.me/variational-autoencoders/ https://openreview.net/forum?id=Sy2fzU9gl https://arxiv.org/pdf/2001.03444.pdf https://arxiv.org/abs/1512.00570

Zoo:
https://github.com/AntixK/PyTorch-VAE
https://github.com/guspih/Perceptual-Autoencoders

Table of Contents

Visual Codebooks

Our Approaches

Conclusion