Simtwins
- Clustering
- Modes
- Operations
- API

Simtwins

GIT: https://git.picalike.corpex-kunden.de/incubator/v5-sim-api

This service checks for every product of each feed shop for all its competitor shops if products exists that are 'very similar' (above a certain similarity threshold) and stores this information in the simtwins table of the live v5 backend database and makes it accessible via the v5_sim API.

That means for each product of a feed we separately check each competitor once per week i.e. if there are 3 competitors for a reference shop, the product appears in the table at least 3 times per week.

Two statements are made during the process:

is there a twin? this means there exists a current product in the competitor shop of the same picalike gender and category that is above a similarity threshold of 0.95.
is there a duplicate? this is the case if the above points are matched and the products also share the picalike brand<HTML></ol></HTML>

Clustering

Similar to the new interest scores, the similarity function is using an new cluster scheme. The update is automatically done for prelive / live after the prelive import finished.

The responsible script is sim_hash_update.sh (v5_extractor) and the table in psql is named sim_clusters not be confused with simtwin_clusters!

Since usually a materialized view is used missing products in sim_clusters also mean that no simtwin or interest score calculation is possible.

Modes

To implement all requirements, we have different kind of data:

pure simtwins which means a sim query with threshold=0.95 and the usual filters (table simtwins)
to allow bucketizing, we need a lower threshold, 0.85, and the actual score (table simtwins_buckets)
just store the cluster size, with threshold=0.95 (table simtwins_clusters)

The solution is not very clever and uses three times of space, but was necessary to deliver a service timely.

Operations

The update of the simtwins and simtwins_counts tables is triggered each das via cronjob on dev02, update_simtwins.sh. The data is stored in the psql02 live database and finally copied to psql prelive via the script push_simtwins.py. The simtwins are not copied to the report engine any more as OSA asks the simtwins API for twin information and uses this later on in report engine queries.

API

There are three endpoints to investigate simtwins data:

put in a product and competitor and get the twin product (if existent) + info if its a duplicate
give a reference shop + competitor + optional filters and get statistics of number of twins/duplicates
get all distinct product ids of a (competitor) shop that are twin for any reference shop product → used for the crawler to know which products must be updated more frequently
return the top-list of products which means those with the largest clusters

The endpoints are part of the v5_sim API and can thus be looked up via http://dev01.picalike.corpex-kunden.de:1113/docs.

Table of Contents

Simtwins

Clustering

Modes

Operations

API