GIT: https://git.picalike.corpex-kunden.de/incubator/v5-sim-api
This service checks for every product of each feed shop for all its competitor shops if products exists that are 'very similar' (above a certain similarity threshold) and stores this information in the simtwins table of the live v5 backend database and makes it accessible via the v5_sim API.
That means for each product of a feed we separately check each competitor once per week i.e. if there are 3 competitors for a reference shop, the product appears in the table at least 3 times per week.
Two statements are made during the process:
<HTML><ol></HTML>
Similar to the new interest scores, the similarity function is using an new cluster scheme. The update is automatically done for prelive / live after the prelive import finished.
The responsible script is sim_hash_update.sh
(v5_extractor) and the table in psql is named sim_clusters
not be confused with simtwin_clusters!
Since usually a materialized view is used missing products in sim_clusters also mean that no simtwin or interest score calculation is possible.
To implement all requirements, we have different kind of data:
simtwins
)simtwins_buckets
)simtwins_clusters
)The solution is not very clever and uses three times of space, but was necessary to deliver a service timely.
The update of the simtwins and simtwins_counts tables is triggered each das via cronjob on dev02, update_simtwins.sh
. The data is stored in the psql02 live database and finally copied to psql prelive via the script push_simtwins.py
. The simtwins are not copied to the report engine any more as OSA asks the simtwins API for twin information and uses this later on in report engine queries.
There are three endpoints to investigate simtwins data:
The endpoints are part of the v5_sim API and can thus be looked up via http://dev01.picalike.corpex-kunden.de:1113/docs.