====== V5 Similarity-API/Solr Performace Tests ====== [[https://git.picalike.corpex-kunden.de/picalike/solr_feature_search|solr_feature_search Git]] (dev branch) under /solr_performace_test: How to:
    * start a solr instance local (solr_feature_search → start_local_test.sh) * start a sim api instance local (similarity_api → start_local_test.sh) * copy data from live (use copy_data_from_solr01.py if you want) * Do tests
Files: * check_clustering_performace.py * check the accuracy of a clustering configuration and the speed (care for caching) * copy_data_from_solr01.py * copy data from live to local * solr_performace_test.py * check the speed of a clustering configuration per shop All results will go to print (best log).\\ **solr_performace_test.py** will run over each shop and query a random product a few times with the configuration defined in the code. the configuration we use in the live system right now (“cluster”:-3, “auto_reduct_cluster_until”:-1) performs fine-ish:\\ name, num products, avg ms query\\ s24_de_feed 1054684 4414\\ baur_onmacon_de_feed 804669 3538\\ spider_customer_de_feed 280305 1652\\ …\\ bonprix_de_crawler 41997 241\\ …\\ takko_de_crawler 3649 47\\ The recommended configration (“cluster”:-3, “auto_reduct_cluster_until”:15000) performs better:\\ (recommended by me. now.) (more on auto_reduct_cluster_until later.)\\ s24_de_feed 1054684 382\\ baur_onmacon_de_feed 804669 330\\ spider_customer_de_feed 280305 145\\ schwab_de_crawler 253893 191\\ about_you_de_crawler 218957 161\\ herrenausstatter_onmacon_de_feed 201719 138\\ galeriakaufhof_de_crawler 112738 114\\ otto_de_crawler 98330 125\\ zalando_de_crawler 80584 143\\ **check_clustering_performace.py** will check how accurate a configuration is. results will also be printed. like this: Run 66 took 41.593730211257935 seconds _avg c0#-1 100.0 % 30491 ms _avg c-3#-1 100.0 % 22335 ms (note, on a bigger test this will be around ~99.5%) _avg c-3#15000 85.61 % 2396 ms c-3#-1 is the live configuration, c-3#15000 the recommended configuration, c0#-1 the brute force method.\\ The data here has to be interpreted as follows:\\ the number behind the c is ths cluster config, here “-3” and “0”.\\ after the # comes the auto_reduct_cluster_until config, here -1 and 15000.\\ after that comes the accuracy and the average query speed on all produtcs (around 4 million) **Inner workings, cluster and auto_reduct_cluster_until:**\\ this is implemened in the solr_client in picalike_v5.\\ each product has 5 gmm cluster (in order). negativ numbers in the cluster-parameter mean, that only products will be considered, that have an overlap in the top n clusters with the reference product. example with n=3 (cluster=-3):\\ ref has clusters: [32,1,12,…]\\ prod a has: [12,55,3,..]\\ prod a will be considered, because the overlap is >=1.\\ positiv cluster value mean the cluster have to match exactly on the top n. cluster=1:\\ prod a will not be considered because 32!=12.\\ cluster=0 just means ignore clusters and consider everything. auto_reduct_cluster_until is a number N (should have been “reduce” but if you repeated a typo often enough it just stays the was it is). If the cluster is set to a value between -5 and -2, the following happend:\\
    * check how many products have to be considered for the given parameter. * is the number of products > N, increase the cluster parameter and try again (-4 to -3 for example) * if -3 results in number of products > N, repreat until cluster=-1 * OR if -3 results in number of products ⇐ N, go back to -4 and use it. * since -1 is the hardest filter that is going to be used, the accuracy of that clustering level is our worst cast. it's about 85% (see above). and let's be honest, if you are searching in a pool of over >15k (or even > 1000k) products, losing 15% will not kill you.