====== Simtwins ======
GIT: https://git.picalike.corpex-kunden.de/incubator/v5-sim-api
This service checks for every product of each feed shop for all its competitor shops if products exists that are 'very similar' (above a certain similarity threshold) and stores this information in the simtwins table of the live v5 backend database and makes it accessible via the v5_sim API.
That means for each product of a feed we separately check each competitor once per week i.e. if there are 3 competitors for a reference shop, the product appears in the table at least 3 times per week.
Two statements are made during the process:
* is there a twin? this means there exists a current product in the competitor shop of the same picalike gender and category that is above a similarity threshold of 0.95.
* is there a duplicate? this is the case if the above points are matched and the products also share the picalike brand
===== Clustering =====
Similar to the new interest scores, the similarity function is using an new cluster scheme. The update is automatically done for prelive / live after the prelive import finished.
The responsible script is ''%%sim_hash_update.sh%%'' (v5_extractor) and the table in psql is named ''%%sim_clusters%%'' not be confused with simtwin_clusters!
//Since usually a materialized view is used missing products in sim_clusters also mean that no simtwin or interest score calculation is possible.//
===== Modes =====
To implement all requirements, we have different kind of data:
* pure simtwins which means a sim query with threshold=0.95 and the usual filters (table ''%%simtwins%%'')
* to allow bucketizing, we need a lower threshold, 0.85, and the actual score (table ''%%simtwins_buckets%%'')
* just store the cluster size, with threshold=0.95 (table ''%%simtwins_clusters%%'')
The solution is not very clever and uses three times of space, but was necessary to deliver a service timely.
===== Operations =====
The update of the simtwins and simtwins_counts tables is triggered each das via cronjob on dev02, ''%%update_simtwins.sh%%''. The data is stored in the psql02 live database and finally copied to psql prelive via the script ''%%push_simtwins.py%%''. The simtwins are not copied to the report engine any more as OSA asks the simtwins API for twin information and uses this later on in report engine queries.
===== API =====
There are three endpoints to investigate simtwins data:
* put in a product and competitor and get the twin product (if existent) + info if its a duplicate
* give a reference shop + competitor + optional filters and get statistics of number of twins/duplicates
* get all distinct product ids of a (competitor) shop that are twin for any reference shop product → used for the crawler to know which products must be updated more frequently
* return the top-list of products which means those with the largest clusters
The endpoints are part of the v5_sim API and can thus be looked up via http://dev01.picalike.corpex-kunden.de:1113/docs.