Table of Contents

V4 Cookbooks

After the incident that a customer got almost no v4 features, we had to do an in-depth analysis that revealed that there is a lot of hidden knowledge. The aim of this recipes is to share this knowledge.

Old Image Cloud

Despite the fact that a cache entry should be always non-zero in size, there existed plenty of empty slots. These slots prevent that a correct v4 feature extraction can be performed. The condition can be detected with these steps at cloud01:

cd /home/picalike/var/cache

find ./ -type f -size 0 -print

If there are any matches, they should be removed: find ./ -type f -size 0 -delete -print

A symptom is also that there are errors in the logs:

ERROR [image_cloud] {cloud01.picalike.corpex-kunden.de} download for 'https://a.cdnsh.de/i/sheego/O_8822312_94?w=512' failed: cannot identify image file <StringIO.StringIO instance at 0x7fef95a61b48>

This happens whenever a file is not a valid image which is true for the empty string but also for http responses that are not image data like text/plain 'File not found'.

It should be noted that the cache is only for https images, since otherwise polipo caches the contents. However, there is a trend that customers use https-only.

Feature Re-Extraction

If a customer is in an inconsistent state it might be required to kick-off the feature extraction again. Maybe stalled cache slots might be also removed with the recipe above. The toolchain lives on

index-prelive: /home/picalike/bin/refresh_customer_live.sh $URL_FILE

The input to the script is a list of image URLs, one URL per line. The script extracts features for shape, color and labels and further stores them in the feature DB.

After this step has been successful, a re-export to the lilly should show no “no v4 features” for this customer any longer.

This recipe can be also used when all features of a customer are faulty, then the features should be deleted first. For this purpose, there is also a script at the same place: /home/picalike/bin/delete_features

The syntax is as follows: ./delete_features ${FEATDB_HOST} “shapenet.out.5” “$1”,

where FEATDB_HOST is right now 'http://cloud01.picalike.corpex-kunden.de:5001', the next parameter is the network spec and the last is again a list of URLs. The features must be deleted for every used network spec, see refresh_customer_xx.sh for a complete example.

It might be also possible that faulty features are already stored in the feature database and then it is required to delete the features first and then to re-extract them. This can be done with the refresh_customer_with_delete.sh script that deletes all features given by the URL list first.

After refreshing the customer change to SG02 and re-export the shop:

python /home/picalike/.local/bin/lilly_data_from_mongo /mnt/storage/var/etc/v3/lilly_data_from_mongo.json <UID>

python /mnt/storage/var/live/lilly_export/bin/force_sync.py /mnt/storage/var/etc/v3/lilly_data_from_mongo.json <UID>

Central Log

The v3/v4 hybrid uses a central logging where all components and hosts are sending it to a single instance. This log 'materializes' at index03: /mnt/storage/var/log/central.log It contains information about the feed update, import, enrichment, export and intermediate steps. So in case there is a problem that is v3-related, it is always a good idea to start here.

Tracking of Missing Features

There is a script located at index03 'check_no_v4.py' in ~/bin that can be used to accumulate the number of missing features per domain. This does not map exactly to customers but still gives a good overview what is going on. There is also an option to generate an URL list required by the refresh script:

python3 ./check_no_v4.py --output /tmp/v4_urls --url-list "ateliergs|witt|hirmer"  < /mnt/storage/var/log/central.log

The script only scans log entries from today and if the URL matches any given string, the URL is written to v4_urls. The code code can be found as a git snippet: https://git.picalike.corpex-kunden.de/snippets/5