Something like the status quo of the migration of the v3 sim to the new v5 world.
THIS IS WORK IN PROGRESS AND SUBJECT TO CHANGE AT ANY TIME
The pipeline does not operate on CSV files, but on feeds in the old v3 Mongo
host: mongodb0{1, 2}.live.picalike.corpex-kunden.de:27017 database: picalike3 username: picalike3 password: [get me someplace else] connection string: mongodb://picalike3:<password>@mongodb01.live.picalike.corpex-kunden.de:27017,mongodb02.live.picalike.corpex-kunden.de:27017/picalike3?authSource=picalike3&replicaSet=picalike-live0
<HTML><ol></HTML>
We start with some ic collection, like ic_3486 which is sportscheck. Every v3 feed needs to be converted into a distinct v5 shop ID. For now, we use the format: ic_{digits} → {$name_v3_de_feed}. An example of such a mapping is in v5_color_extractor:v3_image_urls.py.
The actual transformation step is done with the v3_v5_feed.py (v5_color_extractor). It requires a valid mongo_live.json setting which can be found at index03.
Then it is just one call:
mkdir -p /tmp/feeds; python3 ./v3_v5_feed.py --output-path /tmp/feeds/
The resulting jsons files in the folder can be directly used by the import_data script
The goal of the shop extractor is to send extraction requests to different extractor instances and communicate the extraction status to the shop conveyor belt. It also ensures that all ‘recent’ shop images have features for a given shop. This is a work in progress and was never fully tested nor deployed.
Link: GPU machines
The base folder is here:
/home/picalike/v5
The environment is a Python 3.8 environment that is managed via
pip3 install --user
But be careful since the environment is shared with other users. Furthermore, the access to grapex is limited.
The v5_extractor service does not support v3 feeds yet and thus, a dedicated bulk extractor is used. The situation is further complicated because due to lack of hardware, grapex is used for the extraction.
Color:
ssh grapex cd /home/picalike/v5/v3_colors ./color_enrichment.sh
which expects color_urls.distinct in the v3_mongo folder
Shape:
ssh grapex cd /home/picalike/v5/v5_extractor ./export_features.sh
which expects urls.jsons in the same folder
Both files are generated with the v3_image_urls.py script from v5_color_extractor
The import_data.py script is located in the v5_backend:
python3 scripts/import_data.py --db_uri postgresql://docker:pgsql@localhost:5401/products --source_uri /tmp/feeds --shop_ids ags_v3_de_feed hirmergg_v3_de_feed madeleine_v3_de_feed sheego_v3_de_feed sportscheck_v3_de_feed witt_v3_de_feed
The call uses a local backend database and a one-json-per-line file as input. All v3 feeds should have an extra prefix _v3_.
HINT: only products with existing shape features are imported. the lookup is done via –replica_uri and the database is filled by the feature extraction step.
And now the v3 color features:
scripts/import_v3_color.py --db-uri postgresql://docker:pgsql@localhost:5401/products --shop-ids ags_v3_de_feed hirmergg_v3_de_feed madeleine_v3_de_feed sheego_v3_de_feed sportscheck_v3_de_feed witt_v3_de_feed ==== Create V3 SIM (materialized) view ==== <code> cd /home/$USER/repos/v5_backend export PYTHONPATH=. python3 scripts/dump_schema.py --v3-only | psql postgresql://docker:pgsql@localhost:5401/products
This step allows the v5_sim service to access and use the data.
After imports, you need to refresh the view:
refresh materialized view CONCURRENTLY v3_sim ;