Table of Contents
AI dataset overview
We are using different kind of datasets. The datasets as well as code/experiments/models are on two environments - two users on grapex, the GPU server we have right now:
aipex: matheus@grapex.duckdns.org grapex: picalike@grapex.duckdns.org
Internal
Solr
There is one dataset that is called the 'solr' because of the folder name.
It is located at grapex:/home/picalike/sk_datasets/dataset
It consists of the data part, as a memmap: trainset and testset. The shape of the data is:
x_train = (185736, 248, 248, 3) x_test = (48083, 248, 248)
and the meta data: values.json
The JSON itself contains a position / offset, a binary version of the features (128 dim) “shape”, a single category 'categories' and a list of attributes 'attributes'. The source of this dataset is unknown as of February 2023. Still, it has plenty of attribute annotations which are of reasonable quality and almost 200k images.
The features itself are not used, just the annotations. Where those features come from is unknown as of February 2023.
Data from attribute QA service
Please check the QA service first. The QA is manually done, and the results from this QA can be fetched to create a dataset.
You should provide an attribute group (OSA attribute names and groups) to the group_labels endpoint (please check here). The returned data is a list of URLs, an attribute, and a 0/1 visible flag. 0 means the attribute is not visible in the image, 1 means it is visible. Important: the QA service only presents images already labelled with a certain attribute for QA under that attribute. This means there are MANY images with attribute green for instance WITHOUT the green flag and NOT PRESENT when fetching labels for colour. Therefore, the 1/0 flag can be used to train classifiers, but it's not possible to assume that all positive instances are labeled as 1.
How can this be used? Any classifier can be built on top of this data. We have currently a few-shot classifier implemented on ai-similarity-model/train/train_fewshot.py
In this script there are instructions on how to download the data and train the classifier for each attribute group. Please note this is currently only possible for mutually exclusive attribute groups, that is, groups in which two labels are possible simultaneously. These classifier may be used to annotate a much larger amount of data if needed, for any purposes.
External
Deep Fashion
Files: grapex:/home/picalike/sk_datasets/deepfashion
There are different deepfashion datasets. In this case, we're using the Category and Attribute Prediction Benchmark dataset. It contains varied annotations, but we are only using the category and the attribute information. The list of attributes can be found in the file 'list_attr_cloth.txt'. It exists a low-resolution and a high-resolution version of the images.
Please note that the other deepfashion datasets may also be valuable.
Quality: there are some all-white images and some with a very low resolution. Furthermore, it is likely that variations are present on image and that the background is quite complex. We have, in summary, tried very similar approaches using the iMaterialist and the Deepfashion datasets. Please check the iMaterialist entry below for more details.
Imaterialist
Downloaded to: aipex:/home/matheus/imaterialist
The annotations are in train.json and validation.json. The labels are listed in 'label_imaterialist.csv' An image can contain more than one label (attribute):
{ "labelId": ["18", "66", "22"], "imageId": "9896" }
Each image is annotated with an arbitrary number of attributes. This dataset was used in a Kaggle competition to predict attributes. The standard approach is to use independent predictors with sigmoid output (0-1 range) to predict each attribute. However, in our case, we tried to use it for image retrieval (similarity search). The problems were that a classification loss does not organize the embedding space consistently, making the embedded features useless for retrieval. Furthermore, images with a high label overlap share attributes but are not necessarily similar visually. Thus, our attempts to use this dataset for similarity purposes failed. Still, it's a valuable source of data for the future.
Dataset formats
Bucketized dataset
This format was chosen because it can still work with a very large number of images. It's said that folders shouldn't contain a very large number of files, so the filenames are hashed and the first two digits of the hash are used as a “bucket”. The image is saved to a subfolder with this bucket value as name.
This has become the de facto standard format for internal Picalike data (especially images from the OSA mongo) since the ai-similarity-model project began. For example, on grapex:/home/matheus/ai_sim_model you'll find folders with shop names. Those folders are in fact bucketized datasets with images.
How to create your bucketized dataset?
There are many ways to create one because the metadata you want to fetch differs from case to case. Still, there's a template on the ai-similarity-model repository, on scripts_dataset/bucketized_dataset.py. Unfortunately, since the requirements change constantly, there is no “one solution fits all” here. Use this script as a template.
The choice of filenames is important, since depending on the chosen filename it might not be possible to reverse lookup the mongo entry (e.g. hashed filenames), and in this case a CSV os SQLite database containing these references is advised.
mytheresa_de_crawler
This shop has clean images (~130k images) with clean product names and short_teases. For this reason, it has become almost a standard dataset to train image models. It has been used in different contexts, but there's no one-script-fits-all solution since the metadata and the data format always change. What is kept is the bucketized dataset format to save the images.