Migration of v3 sim/trend Customers

Something like the status quo of the migration of the v3 sim to the new v5 world.

THIS IS WORK IN PROGRESS AND SUBJECT TO CHANGE AT ANY TIME

GIT repos

Alpha Pipeline

The pipeline does not operate on CSV files, but on feeds in the old v3 Mongo

host: mongodb0{1, 2}.live.picalike.corpex-kunden.de:27017
database: picalike3
username: picalike3
password: [get me someplace else]

connection string: mongodb://picalike3:<password>@mongodb01.live.picalike.corpex-kunden.de:27017,mongodb02.live.picalike.corpex-kunden.de:27017/picalike3?authSource=picalike3&replicaSet=picalike-live0

Convert a ic_{dingens} mongo collection to a v5-compatible input feed
Ensure that all color/shape features are present
Import the feed into v5 backend (only shape features are used)
Import the color features for all imported v3 feeds<HTML></ol></HTML>

Feed v3->v5

We start with some ic collection, like ic_3486 which is sportscheck. Every v3 feed needs to be converted into a distinct v5 shop ID. For now, we use the format: ic_{digits} → {$name_v3_de_feed}. An example of such a mapping is in v5_color_extractor:v3_image_urls.py.

The actual transformation step is done with the v3_v5_feed.py (v5_color_extractor). It requires a valid mongo_live.json setting which can be found at index03.

Then it is just one call:

mkdir -p /tmp/feeds; python3 ./v3_v5_feed.py --output-path /tmp/feeds/

The resulting jsons files in the folder can be directly used by the import_data script

Shop Extractor

The goal of the shop extractor is to send extraction requests to different extractor instances and communicate the extraction status to the shop conveyor belt. It also ensures that all ‘recent’ shop images have features for a given shop. This is a work in progress and was never fully tested nor deployed.

Grapex Environment

Link: GPU machines

The base folder is here:

/home/picalike/v5

The environment is a Python 3.8 environment that is managed via

pip3 install --user

But be careful since the environment is shared with other users. Furthermore, the access to grapex is limited.

[optional] Feature Enrichment: v3 color / v5 shape

The v5_extractor service does not support v3 feeds yet and thus, a dedicated bulk extractor is used. The situation is further complicated because due to lack of hardware, grapex is used for the extraction.

Color:

ssh grapex
cd /home/picalike/v5/v3_colors
./color_enrichment.sh

which expects color_urls.distinct in the v3_mongo folder

Shape:

ssh grapex
cd /home/picalike/v5/v5_extractor
./export_features.sh

which expects urls.jsons in the same folder

Both files are generated with the v3_image_urls.py script from v5_color_extractor

Import v3 feeds into the v5

The import_data.py script is located in the v5_backend:

python3 scripts/import_data.py --db_uri postgresql://docker:pgsql@localhost:5401/products --source_uri /tmp/feeds  --shop_ids ags_v3_de_feed hirmergg_v3_de_feed madeleine_v3_de_feed sheego_v3_de_feed sportscheck_v3_de_feed witt_v3_de_feed

The call uses a local backend database and a one-json-per-line file as input. All v3 feeds should have an extra prefix _v3_.

HINT: only products with existing shape features are imported. the lookup is done via –replica_uri and the database is filled by the feature extraction step.

And now the v3 color features:

scripts/import_v3_color.py --db-uri postgresql://docker:pgsql@localhost:5401/products --shop-ids ags_v3_de_feed hirmergg_v3_de_feed madeleine_v3_de_feed sheego_v3_de_feed sportscheck_v3_de_feed witt_v3_de_feed

==== Create V3 SIM (materialized) view ====

<code>
cd /home/$USER/repos/v5_backend
export PYTHONPATH=.
python3 scripts/dump_schema.py --v3-only | psql postgresql://docker:pgsql@localhost:5401/products

This step allows the v5_sim service to access and use the data.

After imports, you need to refresh the view:

refresh materialized view CONCURRENTLY v3_sim ;

Picalike Dokuwiki Archive

Table of Contents