====== Simtwins ======

GIT: https://git.picalike.corpex-kunden.de/incubator/v5-sim-api

This service checks for every product of each feed shop for all its competitor shops if products exists that are 'very similar' (above a certain similarity threshold) and stores this information in the simtwins table of the live v5 backend database and makes it accessible via the v5_sim API.

That means for each product of a feed we separately check each competitor once per week i.e. if there are 3 competitors for a reference shop, the product appears in the table at least 3 times per week.

Two statements are made during the process:

<HTML><ol></HTML>
  * is there a twin? this means there exists a current product in the competitor shop of the same picalike gender and category that is above a similarity threshold of 0.95.
  * is there a duplicate? this is the case if the above points are matched and the products also share the picalike brand<HTML></ol></HTML>


===== Clustering =====

Similar to the new interest scores, the similarity function is using an new cluster scheme. The update is automatically done for prelive / live after the prelive import finished.

The responsible script is ''%%sim_hash_update.sh%%'' (v5_extractor) and the table in psql is named ''%%sim_clusters%%'' not be confused with simtwin_clusters!

//Since usually a materialized view is used missing products in sim_clusters also mean that no simtwin or interest score calculation is possible.//


===== Modes =====

To implement all requirements, we have different kind of data:


  * pure simtwins which means a sim query with threshold=0.95 and the usual filters (table ''%%simtwins%%'')
  * to allow bucketizing, we need a lower threshold, 0.85, and the actual score (table ''%%simtwins_buckets%%'')
  * just store the cluster size, with threshold=0.95 (table ''%%simtwins_clusters%%'')

The solution is not very clever and uses three times of space, but was necessary to deliver a service timely.


===== Operations =====

The update of the simtwins and simtwins_counts tables is triggered each das via cronjob on dev02, ''%%update_simtwins.sh%%''. The data is stored in the psql02 live database and finally copied to psql prelive via the script ''%%push_simtwins.py%%''. The simtwins are not copied to the report engine any more as OSA asks the simtwins API for twin information and uses this later on in report engine queries.


===== API =====

There are three endpoints to investigate simtwins data:


  * put in a product and competitor and get the twin product (if existent) + info if its a duplicate
  * give a reference shop + competitor + optional filters and get statistics of number of twins/duplicates
  * get all distinct product ids of a (competitor) shop that are twin for any reference shop product → used for the crawler to know which products must be updated more frequently
  * return the top-list of products which means those with the largest clusters

The endpoints are part of the v5_sim API and can thus be looked up via http://dev01.picalike.corpex-kunden.de:1113/docs.