====== Solr Updater ======

Creates Solr documents ([[v5_solr_schema|Bedeutung der Felder]]) containing metadata, features, trend score and cluster info.\\
More information: [[solr_performace_test|V5 Similarity-API/Solr Performace Tests]]


===== Host =====

http://pci01.picalike.corpex-kunden.de:12345


===== Git =====

https://git.picalike.corpex-kunden.de/picalike/solr_updater


===== Usage =====

The solr updater receives commands and sends stats to the [[shop_conveyor_belt|shop conveyor belt]] through port 12345. Some other commands can be send in exceptional cases through the FastAPI interface at http://pci01.picalike.corpex-kunden.de:12345/docs (not yet implemented).


===== Celery tasks =====


==== update_shop ====

This is the main task: it creates a ''%%ShopUpdater%%'' and uses it to update all Solr documents for a given shop.\\
The products are processed in chunks. Every chunk contains metadata, product trend (for crawlers), history, features and cluster information (for feeds) that are available in the respective collections. This is to reduce the number of queries to the MongoDB and requests to ''%%osa_cluster%%''.\\
For every product in the chunk, the ''%%ShopUpdater%%'' tries to create a Solr document and, if successful, such a document is appended to the batch that is then written into Solr at the end of each chunk.

This task also sends ''%%started%%'', ''%%failed%%'' and ''%%done%%'' messages to the [[shop_conveyor_belt|Shop Conveyor Belt]].\\
The ''%%failed%%'' case is decided by a function named ''%%has_failed()%%''. Please refer to the in-code documentation because it may be that this will be changed later (and, if you change it, please update it accordingly).


===== Shop Updater =====

An instance of this object is created for each shop: it handles shop processing and stats.


==== get_data_chunk() ====

A fixed-size chunk of sorted (by ''%%picalike_id%%'') products is retrieved from the ''%%metadb%%'', skipping the products that were already processed. The ids in the chunk are used to query other dbs:


  * features (return attributes, category with highest score, feaut
  * history (where ''%%session%%'' is not older than ''%%H_DAYS%%'', but we just need the last two)
  * trends (for crawlers only, last session)
  * cluster (for feeds only)


==== process_item() ====

This method attempts to build a Solr document. Each field in the pydantic definition ''%%Solr_doc%%'' is filled either directly with information from the data chunks or by calling a function that prepares such information.\\
Among those function, the two most complex are described in the following.


==== get_image_info() ====

This function selects the main image and returns information about it. If only one image is available, then this is returned. If many images are available, we only consider those with the same category prefix as the picalike category of the product (which is given by category mapping). Among those, we select the image with the highest score in the most frequent category.

The returned attributes are all attributes that appear in ANY image. If one image has an attribute, the product will have the attribute.


==== get_cluster_info() ====

Cluster info comes from different collections from the ''%%reports%%'' db. If each of those collections contains info about a certain product, it will be stored in the ''%%clusters_chunk%%'', otherwise, a request to ''%%osa_cluster%%'' ensures that the relevant data is calculated, stored in the appropriate collections and returned to ''%%get_cluster_info()%%''.