The idea is to further cluster categories into buckets to avoid wasteful computations for top-k similarity queries if no filter is given.
For example, there are roughly 1M dresses and if big competitors are set, it is possible that 200K distances need to be determined to just return the limit, which is around 12, products.
Everything is on grapex:
/home/picalike/tdog_sandbox/sim_buckets
The folder is a local git and he only external dependency is the psql01 database to fetch the initial data (data.py)
Describe a category by random sampling K products and determine for each of them the product that is maximally distant from it. Then sort the pairs by size and keep NUM_BUCKETS of them. Those 'bases' are used to determine if a reference is closer to A=0 or B=1 which leads to binary string of num_buckets bits. The last and optional step is to convert the binary string into a integer, combined with the cluster ID (softmax) from the product.
TODO: there are still very large buckets, probably for popular products, but usually the products are uniformly
distributed across the buckets and the quality suffices with respect to the full brute-force approach.
Keywords: category similarity buckets discretization groupings hash hamming