Table of Contents

Solr Updater

Creates Solr documents (Bedeutung der Felder) containing metadata, features, trend score and cluster info.
More information: V5 Similarity-API/Solr Performace Tests

Host

http://pci01.picalike.corpex-kunden.de:12345

Git

https://git.picalike.corpex-kunden.de/picalike/solr_updater

Usage

The solr updater receives commands and sends stats to the shop conveyor belt through port 12345. Some other commands can be send in exceptional cases through the FastAPI interface at http://pci01.picalike.corpex-kunden.de:12345/docs (not yet implemented).

Celery tasks

update_shop

This is the main task: it creates a ShopUpdater and uses it to update all Solr documents for a given shop.
The products are processed in chunks. Every chunk contains metadata, product trend (for crawlers), history, features and cluster information (for feeds) that are available in the respective collections. This is to reduce the number of queries to the MongoDB and requests to osa_cluster.
For every product in the chunk, the ShopUpdater tries to create a Solr document and, if successful, such a document is appended to the batch that is then written into Solr at the end of each chunk.

This task also sends started, failed and done messages to the Shop Conveyor Belt.
The failed case is decided by a function named has_failed(). Please refer to the in-code documentation because it may be that this will be changed later (and, if you change it, please update it accordingly).

Shop Updater

An instance of this object is created for each shop: it handles shop processing and stats.

get_data_chunk()

A fixed-size chunk of sorted (by picalike_id) products is retrieved from the metadb, skipping the products that were already processed. The ids in the chunk are used to query other dbs:

process_item()

This method attempts to build a Solr document. Each field in the pydantic definition Solr_doc is filled either directly with information from the data chunks or by calling a function that prepares such information.
Among those function, the two most complex are described in the following.

get_image_info()

This function selects the main image and returns information about it. If only one image is available, then this is returned. If many images are available, we only consider those with the same category prefix as the picalike category of the product (which is given by category mapping). Among those, we select the image with the highest score in the most frequent category.

The returned attributes are all attributes that appear in ANY image. If one image has an attribute, the product will have the attribute.

get_cluster_info()

Cluster info comes from different collections from the reports db. If each of those collections contains info about a certain product, it will be stored in the clusters_chunk, otherwise, a request to osa_cluster ensures that the relevant data is calculated, stored in the appropriate collections and returned to get_cluster_info().