User Tools

Site Tools


v5_int_scores

Interest Scores V1

The interest scores are stored in the psql2 live database and each week uploaded to the reporting DB. The procedure is described in the v5_sim_api git and for now is done with two scripts:

<HTML><ol></HTML>

  • calc_intscores.py
  • push_intscores.py<HTML></ol></HTML>

Interest Scores V2

There is a migration from the V1 version to the current version. The new procedure consists of an “inference” step to determine the clusters of all products and the procedure to determine the interest score.

Inference

The script to perform this step is in the v5_extractor [1] git, since it is a kind of feature extraction. The deployed version is located at dev02:

/home/picalike/v5/v5_extractor/sim_hash

and is triggered by the /home/picalike/v5/scripts/refresh_daily_prelive.sh script, since the step needs a finished import both for live and prelive. The script determine all 'unhashed' products and perform the cluster step for them. The result goes into the relation sim_clusters.

[1] https://git.picalike.corpex-kunden.de/incubator/v5-extractor/-/blob/master/sim_hash/scripts/simi_hash.py

Calculation

The new method uses the relation 'simshot', not 'oneshot' and is now a materialized view that is refreshed after the import is done, also in the refresh_daily_prelive.sh script.

To perform a full int score calculation for all feeds with competitors, just call:

python3 scripts/db_intscores.py --db-uri postgresql://docker:live_sfHjZ0i6GYKc2hIh@v220201062212128885.bestsrv.de:5401/products --full

The actual procedure is very similar to the old procedure and is very often 100% backward compatible. The only used filters are:

  • picalike_category
  • picalike_gender
  • updated_weeks is set to the previous week
  • similarity threshold of 0.85 (of 512 bits)

The results will be stored in the new_intscores that is 100% compatible with the old table interest_score.

Database

The table to store the values is interest_score.

Backup

Just in case, a simple dump_script is available at psql02 to export the scores from the database as a sequence of COPY commands.

python3 scripts/dump_intscores.py > backup/intscores_16dec2020.sql

Cronjob

dev02 → contrab -l

The interest scores a calculated daily after the live import. Usually this yields only a small number of new interest scores as most of them are calculated on monday in the weekly refresh script (new calendar week means new timespan for interest scores, so for all products new interest scores are calculated here)

scripts:

/home/picalike/v5/scripts/refresh_weekly.sh
/home/picalike/v5/scripts/refresh_daily_live.sh

The upload script checks for new int scores and every x hours and uploads on demand.

/home/picalike/v5/scripts/upload_int_scores.sh

Possible Issues

If the upload of int scores is cancelled like in the case below, nothing has to be done, since the next upload will automatically import the missing scores.

Traceback (most recent call last):
  File "scripts/push_intscores.py", line 169, in <module>
    main()
  File "scripts/push_intscores.py", line 155, in main
    num_inserted = incremental_transfer(args.source_uri, target_uri, args.report_date, verbose=args.verbose)
  File "scripts/push_intscores.py", line 65, in incremental_transfer
    target_cur.executemany("INSERT INTO interest_scores_import (picalike_id,report_date,interest_score) VALUES(%s, %s, %s) ON CONFLICT DO NOTHING", inserts)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

However, for the calculation of interest scores, the next call is at the next day, so manually starting it again should be thought of.

keywords: v5 osa interest scores psql02

v5_int_scores.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1