====== dHash / Placeholder Service ======
The goal is to identify placeholder or dummy images per shop with counting dhash values
git: https://git.picalike.corpex-kunden.de/incubator/dhash-dedup
service: http://report-db01.picalike.corpex-kunden.de:18390
====== Requirements ======
Aside from dependencies listed on requirements.txt, this service fetches data from Mongo DB, meta_db, meta_db collection. Before running the docker image, internal databases with image dhashes for all shops must be generated.
To generate internal databases, run scripts/cron/process_all_shops.sh or scripts/process_shops.py for more options. This may take many hours if starting from scratch.
For testing purposes, please run:
python3 scripts/process_shops.py --shops-file tests/test_shop.txt --out-folder "data/databases" --force --find-placeholders
which will create a database only for one small shop ('soliver_de_crawler').
====== API endpoints ======
There is only one endpoint + health:
* get_placeholders: takes as input shop_name (i.e. 'soliver_de_crawler') and returns a list of known placeholders URLs for that shop. The list might be empty.
The internal hash databases are filled once a week and new placeholders are found, if there are any. After QA, they are included in the list of known placeholders.
Keywords: phash dhash dedup near duplicate
===== QA process =====
After each cron job, new placeholders for a shop are stored but not returned in the get_placeholders endpoint yet. They need to go through QA.
To perform QA, go to report-db01:~/dhash/ and run:
export PYTHONPATH=.
python3 scripts/qa.py --data-folder 'data/databases'
If the script says an HTML was generated, there are new placeholders. Please inspect placeholders.html (in the same directory) and follow the instructions on the screen to perform QA. You'll have to input shop names and URLs of the placeholders you don't want to include (“bad” placeholders), all of which are found on the inspection html. When you finish, just leave fields empty (according to instructions on screen) and you'll be prompted to confirm that you're done. Finally, all other placeholders will be included as valid.
**If you quit in the middle of the script**: no problem, just run it again with:
python3 scripts/qa.py --data-folder 'data/databases' --qa-only
and you'll begin QAing again.
=== Updating the data ===
Updates to the local SQLite databases are done automatically via cronjob, calling the scripts/cron/process_all_shops.sh script. The data comes from the OSA mongo (meta_db_collection).