====== Simtwins ====== GIT: https://git.picalike.corpex-kunden.de/incubator/v5-sim-api This service checks for every product of each feed shop for all its competitor shops if products exists that are 'very similar' (above a certain similarity threshold) and stores this information in the simtwins table of the live v5 backend database and makes it accessible via the v5_sim API. That means for each product of a feed we separately check each competitor once per week i.e. if there are 3 competitors for a reference shop, the product appears in the table at least 3 times per week. Two statements are made during the process:
    * is there a twin? this means there exists a current product in the competitor shop of the same picalike gender and category that is above a similarity threshold of 0.95. * is there a duplicate? this is the case if the above points are matched and the products also share the picalike brand
===== Clustering ===== Similar to the new interest scores, the similarity function is using an new cluster scheme. The update is automatically done for prelive / live after the prelive import finished. The responsible script is ''%%sim_hash_update.sh%%'' (v5_extractor) and the table in psql is named ''%%sim_clusters%%'' not be confused with simtwin_clusters! //Since usually a materialized view is used missing products in sim_clusters also mean that no simtwin or interest score calculation is possible.// ===== Modes ===== To implement all requirements, we have different kind of data: * pure simtwins which means a sim query with threshold=0.95 and the usual filters (table ''%%simtwins%%'') * to allow bucketizing, we need a lower threshold, 0.85, and the actual score (table ''%%simtwins_buckets%%'') * just store the cluster size, with threshold=0.95 (table ''%%simtwins_clusters%%'') The solution is not very clever and uses three times of space, but was necessary to deliver a service timely. ===== Operations ===== The update of the simtwins and simtwins_counts tables is triggered each das via cronjob on dev02, ''%%update_simtwins.sh%%''. The data is stored in the psql02 live database and finally copied to psql prelive via the script ''%%push_simtwins.py%%''. The simtwins are not copied to the report engine any more as OSA asks the simtwins API for twin information and uses this later on in report engine queries. ===== API ===== There are three endpoints to investigate simtwins data: * put in a product and competitor and get the twin product (if existent) + info if its a duplicate * give a reference shop + competitor + optional filters and get statistics of number of twins/duplicates * get all distinct product ids of a (competitor) shop that are twin for any reference shop product → used for the crawler to know which products must be updated more frequently * return the top-list of products which means those with the largest clusters The endpoints are part of the v5_sim API and can thus be looked up via http://dev01.picalike.corpex-kunden.de:1113/docs.