V5 Similarity-API/Solr Performace Tests

solr_feature_search Git (dev branch) under /solr_performace_test:

How to:

<HTML><ol></HTML>

Files:

All results will go to print (best log).
solr_performace_test.py will run over each shop and query a random product a few times with the configuration defined in the code. the configuration we use in the live system right now (“cluster”:-3, “auto_reduct_cluster_until”:-1) performs fine-ish:
name, num products, avg ms query
s24_de_feed 1054684 4414
baur_onmacon_de_feed 804669 3538
spider_customer_de_feed 280305 1652

bonprix_de_crawler 41997 241

takko_de_crawler 3649 47

The recommended configration (“cluster”:-3, “auto_reduct_cluster_until”:15000) performs better:
(recommended by me. now.) (more on auto_reduct_cluster_until later.)
s24_de_feed 1054684 382
baur_onmacon_de_feed 804669 330
spider_customer_de_feed 280305 145
schwab_de_crawler 253893 191
about_you_de_crawler 218957 161
herrenausstatter_onmacon_de_feed 201719 138
galeriakaufhof_de_crawler 112738 114
otto_de_crawler 98330 125
zalando_de_crawler 80584 143

check_clustering_performace.py will check how accurate a configuration is. results will also be printed. like this:

 Run 66 took 41.593730211257935 seconds
 _avg c0#-1 100.0 % 30491 ms
 _avg c-3#-1 100.0 % 22335 ms (note, on a bigger test this will be around ~99.5%)
 _avg c-3#15000 85.61 % 2396 ms

c-3#-1 is the live configuration, c-3#15000 the recommended configuration, c0#-1 the brute force method.
The data here has to be interpreted as follows:
the number behind the c is ths cluster config, here “-3” and “0”.
after the # comes the auto_reduct_cluster_until config, here -1 and 15000.
after that comes the accuracy and the average query speed on all produtcs (around 4 million)

Inner workings, cluster and auto_reduct_cluster_until:
this is implemened in the solr_client in picalike_v5.
each product has 5 gmm cluster (in order). negativ numbers in the cluster-parameter mean, that only products will be considered, that have an overlap in the top n clusters with the reference product. example with n=3 (cluster=-3):
ref has clusters: [32,1,12,…]
prod a has: [12,55,3,..]
prod a will be considered, because the overlap is >=1.
positiv cluster value mean the cluster have to match exactly on the top n. cluster=1:
prod a will not be considered because 32!=12.
cluster=0 just means ignore clusters and consider everything.

auto_reduct_cluster_until is a number N (should have been “reduce” but if you repeated a typo often enough it just stays the was it is). If the cluster is set to a value between -5 and -2, the following happend:

<HTML><ol></HTML>