Table of Contents

Vogue and Tagwalk Scraper

GIT: https://git.picalike.corpex-kunden.de/incubator/vogue-scraper

FIXME: it is currently broken due to website changes

Quick setup

Use build.sh to build a docker container and run run.sh or run_local.sh if running locally. Calling this script will run the entire pipeline (crawling + extracting object colors) and store results in the mongo collection. This will likely take a long time (> 24h).

Remarks

<HTML><ol></HTML>

Cronjob

The service is triggered as a cronjob on dev02. It's disabled right now but it can be activated adding this command to the crontab jobs file:

  # -- CATWALK VOGUE/TAGWALK CRAWLER --
  0 6 * * 1 /home/picalike/docker_bin/vogue_scraper/start_vogue_scraper.sh

Details

The goal of the project is twofold:

1) Crawl vogue and tagwalk websites to collect all the image links

This can be done locally calling bin/crawl_vogue.sh or bin/crawl_tagwalk.sh directly.

These scripts check if there's a previous crawl log available and, if so, load previously crawled URLs to prevent crawling them again.

Crawling is done at two steps: 1) crawling page URLs from the website 2) crawling image URLs from each valid page and storing them in a mongo db

Both steps are performed when calling these scripts by scripts/crawl_(vogue|tagwalk).py script. A crawling log is stored at data/vogue_crawl.log for inspection.

Please check 'scripts crawl_(vogue|tagwalk).py -h' for available options, but default parameters should work. Use of –save-json is encouraged to make subsequent crawls much faster.

Use –force if you want to crawl URLs that are stored as errors in the error collection. Without it, once an URL is stored as an error it won't be crawled again.

For detailed information about parameters and inner logic, check src/crawlers.py for specific crawling logic and src/scraper.py for the generic scraping logic.

The result is written into a mongo collection that is used as the input for step (2).

2) Extract objects & colors from all images

After crawling image URLs, run bin/extraction_pipeline.sh to execute the entire extraction pipeline.

The pipeline works as follows:

<HTML><ol></HTML>

External services

<HTML><ol></HTML>

Output data

Important: “objects” might be empty. Therefore it may be useful to filter using:

  db.collection.find({"objects.0": $exists: true}})

There's an index to speed this up.

A sample of the data structure produced as output (currently saved to object_colors_merge collection):

  {
      "_id" : ObjectId("620679ad075cc3e802ac9294"),
      "id" : "https://assets.vogue.com/photos/61e1b65e52a47864beb085f3/master/w_2560%2Cc_limit/00001-A-Cold-Wall-Menswear-Fall-2022-Credit-Brand.jpg",
      "objects" : [ 
          {
              "name" : "00001-A-Cold-Wall-Menswear-Fall-2022-Credit-Brand_00_01_attrshoe.jpg",
              "num" : 1,
              "category" : "shoe",
              "Colors" : {
                  "#a2b9c2" : 0.000335233,
                  "#af9294" : 5.5872166e-05,
                  "#bed3bb" : 5.5872166e-05,
                  "#3a395f" : 0.0025701195,
                  "#66829a" : 0.32171193,
                  "#2e3d30" : 0.12275115,
                  "#5b4f3b" : 5.5872166e-05,
                  "#06b7e" : 0.0015644206,
                  "#46515a" : 0.4865348,
                  "#4f6b58" : 0.058721647,
                  "#82776b" : 0.005643089
              }
          },
      ],
      "attributes" : [ 
          "lapel", 
          "neckline", 
          "zipper", 
          "collar", 
          "epaulette", 
          "sleeve", 
          "flower", 
          "hood"
      ],
      "url_reference" : "https://www.vogue.com/fashion-shows/fall-2022-menswear/a-cold-wall",
      "designer" : "A-Cold-Wall",
      "season" : "FALL 2022 MENSWEAR",
      "source" : "vogue"
  }