====== Migration of v3 sim/trend Customers ====== Something like the status quo of the migration of the v3 sim to the new v5 world. **THIS IS WORK IN PROGRESS AND SUBJECT TO CHANGE AT ANY TIME** ===== GIT repos ===== * https://git.picalike.corpex-kunden.de/incubator/v5-color-extractor * https://git.picalike.corpex-kunden.de/incubator/v5-backend * https://git.picalike.corpex-kunden.de/incubator/v5-sim-api * https://git.picalike.corpex-kunden.de/incubator/shop-extractor ===== Alpha Pipeline ===== The pipeline does not operate on CSV files, but on feeds in the old v3 Mongo host: mongodb0{1, 2}.live.picalike.corpex-kunden.de:27017 database: picalike3 username: picalike3 password: [get me someplace else] connection string: mongodb://picalike3:@mongodb01.live.picalike.corpex-kunden.de:27017,mongodb02.live.picalike.corpex-kunden.de:27017/picalike3?authSource=picalike3&replicaSet=picalike-live0
    * Convert a ic_{dingens} mongo collection to a v5-compatible input feed * Ensure that all color/shape features are present * Import the feed into v5 backend (only shape features are used) * Import the color features for all imported v3 feeds
==== Feed v3->v5 ==== We start with some ic collection, like ic_3486 which is sportscheck. Every v3 feed needs to be converted into a distinct v5 shop ID. For now, we use the format: ic_{digits} → {$name_v3_de_feed}. An example of such a mapping is in v5_color_extractor:v3_image_urls.py. The actual transformation step is done with the v3_v5_feed.py (v5_color_extractor). It requires a valid mongo_live.json setting which can be found at index03. Then it is just one call: mkdir -p /tmp/feeds; python3 ./v3_v5_feed.py --output-path /tmp/feeds/ The resulting jsons files in the folder can be directly used by the import_data script ==== Shop Extractor ==== The goal of the shop extractor is to send extraction requests to different extractor instances and communicate the extraction status to the shop conveyor belt. It also ensures that all ‘recent’ shop images have features for a given shop. This is a work in progress and was never fully tested nor deployed. ==== Grapex Environment ==== Link: [[gpu_machines|GPU machines]] The base folder is here: /home/picalike/v5 The environment is a Python 3.8 environment that is managed via pip3 install --user But be careful since the environment is shared with other users. Furthermore, the access to grapex is limited. ==== [optional] Feature Enrichment: v3 color / v5 shape ==== The v5_extractor service does not support v3 feeds yet and thus, a dedicated bulk extractor is used. The situation is further complicated because due to lack of hardware, grapex is used for the extraction. Color: ssh grapex cd /home/picalike/v5/v3_colors ./color_enrichment.sh which expects color_urls.distinct in the v3_mongo folder\\ Shape: ssh grapex cd /home/picalike/v5/v5_extractor ./export_features.sh which expects urls.jsons in the same folder\\ Both files are generated with the v3_image_urls.py script from v5_color_extractor ==== Import v3 feeds into the v5 ==== The import_data.py script is located in the v5_backend: python3 scripts/import_data.py --db_uri postgresql://docker:pgsql@localhost:5401/products --source_uri /tmp/feeds --shop_ids ags_v3_de_feed hirmergg_v3_de_feed madeleine_v3_de_feed sheego_v3_de_feed sportscheck_v3_de_feed witt_v3_de_feed The call uses a local backend database and a one-json-per-line file as input. All v3 feeds should have an extra prefix _v3_. HINT: only products with existing shape features are imported. the lookup is done via –replica_uri and the database is filled by the feature extraction step. And now the v3 color features: scripts/import_v3_color.py --db-uri postgresql://docker:pgsql@localhost:5401/products --shop-ids ags_v3_de_feed hirmergg_v3_de_feed madeleine_v3_de_feed sheego_v3_de_feed sportscheck_v3_de_feed witt_v3_de_feed ==== Create V3 SIM (materialized) view ==== cd /home/$USER/repos/v5_backend export PYTHONPATH=. python3 scripts/dump_schema.py --v3-only | psql postgresql://docker:pgsql@localhost:5401/products This step allows the v5_sim service to access and use the data. After imports, you need to refresh the view: refresh materialized view CONCURRENTLY v3_sim ;