====== V5 Similarity-API/Solr Performace Tests ======

[[https://git.picalike.corpex-kunden.de/picalike/solr_feature_search|solr_feature_search Git]] (dev branch) under /solr_performace_test:

How to:

<HTML><ol></HTML>
  * start a solr instance local (solr_feature_search → start_local_test.sh)
  * start a sim api instance local (similarity_api → start_local_test.sh)
  * copy data from live (use copy_data_from_solr01.py if you want)
  * Do tests<HTML></ol></HTML>

Files:


  * check_clustering_performace.py


  * check the accuracy of a clustering configuration and the speed (care for caching)
  * copy_data_from_solr01.py


  * copy data from live to local
  * solr_performace_test.py


  * check the speed of a clustering configuration per shop

All results will go to print (best log).\\
**solr_performace_test.py** will run over each shop and query a random product a few times with the configuration defined in the code. the configuration we use in the live system right now (“cluster”:-3, “auto_reduct_cluster_until”:-1) performs fine-ish:\\
name, num products, avg ms query\\
s24_de_feed 1054684 4414\\
baur_onmacon_de_feed 804669 3538\\
spider_customer_de_feed 280305 1652\\
…\\
bonprix_de_crawler 41997 241\\
…\\
takko_de_crawler 3649 47\\


The recommended configration (“cluster”:-3, “auto_reduct_cluster_until”:15000) performs better:\\
(recommended by me. now.) (more on auto_reduct_cluster_until later.)\\
s24_de_feed 1054684 382\\
baur_onmacon_de_feed 804669 330\\
spider_customer_de_feed 280305 145\\
schwab_de_crawler 253893 191\\
about_you_de_crawler 218957 161\\
herrenausstatter_onmacon_de_feed 201719 138\\
galeriakaufhof_de_crawler 112738 114\\
otto_de_crawler 98330 125\\
zalando_de_crawler 80584 143\\


**check_clustering_performace.py** will check how accurate a configuration is. results will also be printed. like this:

<code>
 Run 66 took 41.593730211257935 seconds
 _avg c0#-1 100.0 % 30491 ms
 _avg c-3#-1 100.0 % 22335 ms (note, on a bigger test this will be around ~99.5%)
 _avg c-3#15000 85.61 % 2396 ms
</code>
c-3#-1 is the live configuration, c-3#15000 the recommended configuration, c0#-1 the brute force method.\\
The data here has to be interpreted as follows:\\
the number behind the c is ths cluster config, here “-3” and “0”.\\
after the # comes the auto_reduct_cluster_until config, here -1 and 15000.\\
after that comes the accuracy and the average query speed on all produtcs (around 4 million)

**Inner workings, cluster and auto_reduct_cluster_until:**\\
this is implemened in the solr_client in picalike_v5.\\
each product has 5 gmm cluster (in order). negativ numbers in the cluster-parameter mean, that only products will be considered, that have an overlap in the top n clusters with the reference product. example with n=3 (cluster=-3):\\
ref has clusters: [32,1,12,…]\\
prod a has: [12,55,3,..]\\
prod a will be considered, because the overlap is >=1.\\
positiv cluster value mean the cluster have to match exactly on the top n. cluster=1:\\
prod a will not be considered because 32!=12.\\
cluster=0 just means ignore clusters and consider everything.

auto_reduct_cluster_until is a number N (should have been “reduce” but if you repeated a typo often enough it just stays the was it is). If the cluster is set to a value between -5 and -2, the following happend:\\


<HTML><ol></HTML>
  * check how many products have to be considered for the given parameter.
  * is the number of products > N, increase the cluster parameter and try again (-4 to -3 for example)
  * if -3 results in number of products > N, repreat until cluster=-1
  * OR if -3 results in number of products ⇐ N, go back to -4 and use it.
  * since -1 is the hardest filter that is going to be used, the accuracy of that clustering level is our worst cast. it's about 85% (see above). and let's be honest, if you are searching in a pool of over >15k (or even > 1000k) products, losing 15% will not kill you.<HTML></ol></HTML>