V5 Similarity-API/Solr Performace Tests

solr_feature_search Git (dev branch) under /solr_performace_test:

How to:

start a solr instance local (solr_feature_search → start_local_test.sh)
start a sim api instance local (similarity_api → start_local_test.sh)
copy data from live (use copy_data_from_solr01.py if you want)
Do tests<HTML></ol></HTML>

Files:

check_clustering_performace.py

check the accuracy of a clustering configuration and the speed (care for caching)
copy_data_from_solr01.py

copy data from live to local
solr_performace_test.py

check the speed of a clustering configuration per shop

All results will go to print (best log).
solr_performace_test.py will run over each shop and query a random product a few times with the configuration defined in the code. the configuration we use in the live system right now (“cluster”:-3, “auto_reduct_cluster_until”:-1) performs fine-ish:
name, num products, avg ms query
s24_de_feed 1054684 4414
baur_onmacon_de_feed 804669 3538
spider_customer_de_feed 280305 1652
…
bonprix_de_crawler 41997 241
…
takko_de_crawler 3649 47

The recommended configration (“cluster”:-3, “auto_reduct_cluster_until”:15000) performs better:
(recommended by me. now.) (more on auto_reduct_cluster_until later.)
s24_de_feed 1054684 382
baur_onmacon_de_feed 804669 330
spider_customer_de_feed 280305 145
schwab_de_crawler 253893 191
about_you_de_crawler 218957 161
herrenausstatter_onmacon_de_feed 201719 138
galeriakaufhof_de_crawler 112738 114
otto_de_crawler 98330 125
zalando_de_crawler 80584 143

check_clustering_performace.py will check how accurate a configuration is. results will also be printed. like this:

 Run 66 took 41.593730211257935 seconds
 _avg c0#-1 100.0 % 30491 ms
 _avg c-3#-1 100.0 % 22335 ms (note, on a bigger test this will be around ~99.5%)
 _avg c-3#15000 85.61 % 2396 ms

c-3#-1 is the live configuration, c-3#15000 the recommended configuration, c0#-1 the brute force method.
The data here has to be interpreted as follows:
the number behind the c is ths cluster config, here “-3” and “0”.
after the # comes the auto_reduct_cluster_until config, here -1 and 15000.
after that comes the accuracy and the average query speed on all produtcs (around 4 million)

Inner workings, cluster and auto_reduct_cluster_until:
this is implemened in the solr_client in picalike_v5.
each product has 5 gmm cluster (in order). negativ numbers in the cluster-parameter mean, that only products will be considered, that have an overlap in the top n clusters with the reference product. example with n=3 (cluster=-3):
ref has clusters: [32,1,12,…]
prod a has: [12,55,3,..]
prod a will be considered, because the overlap is >=1.
positiv cluster value mean the cluster have to match exactly on the top n. cluster=1:
prod a will not be considered because 32!=12.
cluster=0 just means ignore clusters and consider everything.

auto_reduct_cluster_until is a number N (should have been “reduce” but if you repeated a typo often enough it just stays the was it is). If the cluster is set to a value between -5 and -2, the following happend:

check how many products have to be considered for the given parameter.
is the number of products > N, increase the cluster parameter and try again (-4 to -3 for example)
if -3 results in number of products > N, repreat until cluster=-1
OR if -3 results in number of products ⇐ N, go back to -4 and use it.
since -1 is the hardest filter that is going to be used, the accuracy of that clustering level is our worst cast. it's about 85% (see above). and let's be honest, if you are searching in a pool of over >15k (or even > 1000k) products, losing 15% will not kill you.<HTML></ol></HTML>