User Tools

Site Tools


vogue-scraper

Vogue and Tagwalk Scraper

GIT: https://git.picalike.corpex-kunden.de/incubator/vogue-scraper

FIXME: it is currently broken due to website changes

Quick setup

Use build.sh to build a docker container and run run.sh or run_local.sh if running locally. Calling this script will run the entire pipeline (crawling + extracting object colors) and store results in the mongo collection. This will likely take a long time (> 24h).

Remarks

<HTML><ol></HTML>

  • As of February 2023, the vogue crawler is no longer working. It needs maintenance, but since this is part of OSA,it is not a priority right now.
  • The tagwalk website started to demand a login for access (on August/2022). Therefore only Vogue is crawled, but the code to crawl Tagwalk is still available.
  • The Vogue website has changed multiple times. When a big change happens, the crawler must be adapted/rewritten with the new website logic. This project is not part of a live service since it's part of OSA, therefore do not assume that the crawler is working before going live with it.<HTML></ol></HTML>

Cronjob

The service is triggered as a cronjob on dev02. It's disabled right now but it can be activated adding this command to the crontab jobs file:

  # -- CATWALK VOGUE/TAGWALK CRAWLER --
  0 6 * * 1 /home/picalike/docker_bin/vogue_scraper/start_vogue_scraper.sh

Details

The goal of the project is twofold:

1) Crawl vogue and tagwalk websites to collect all the image links

This can be done locally calling bin/crawl_vogue.sh or bin/crawl_tagwalk.sh directly.

These scripts check if there's a previous crawl log available and, if so, load previously crawled URLs to prevent crawling them again.

Crawling is done at two steps: 1) crawling page URLs from the website 2) crawling image URLs from each valid page and storing them in a mongo db

Both steps are performed when calling these scripts by scripts/crawl_(vogue|tagwalk).py script. A crawling log is stored at data/vogue_crawl.log for inspection.

Please check 'scripts crawl_(vogue|tagwalk).py -h' for available options, but default parameters should work. Use of –save-json is encouraged to make subsequent crawls much faster.

Use –force if you want to crawl URLs that are stored as errors in the error collection. Without it, once an URL is stored as an error it won't be crawled again.

For detailed information about parameters and inner logic, check src/crawlers.py for specific crawling logic and src/scraper.py for the generic scraping logic.

The result is written into a mongo collection that is used as the input for step (2).

2) Extract objects & colors from all images

After crawling image URLs, run bin/extraction_pipeline.sh to execute the entire extraction pipeline.

The pipeline works as follows:

<HTML><ol></HTML>

  • image URLs are read from the read collection
  • each URL is sent to the object detector API, which returns a zip with each object in a different image
  • if the result is valid, these images are uploaded using the upload_image endpoint
  • for images with a succesfull upload, the URL is sent to the color extractor, which returns a dictionary with colors and their weights
  • a dictionary containing objects, attributes and colors is stored at the write collection<HTML></ol></HTML>

External services

Output data

Important: “objects” might be empty. Therefore it may be useful to filter using:

  db.collection.find({"objects.0": $exists: true}})

There's an index to speed this up.

A sample of the data structure produced as output (currently saved to object_colors_merge collection):

  {
      "_id" : ObjectId("620679ad075cc3e802ac9294"),
      "id" : "https://assets.vogue.com/photos/61e1b65e52a47864beb085f3/master/w_2560%2Cc_limit/00001-A-Cold-Wall-Menswear-Fall-2022-Credit-Brand.jpg",
      "objects" : [ 
          {
              "name" : "00001-A-Cold-Wall-Menswear-Fall-2022-Credit-Brand_00_01_attrshoe.jpg",
              "num" : 1,
              "category" : "shoe",
              "Colors" : {
                  "#a2b9c2" : 0.000335233,
                  "#af9294" : 5.5872166e-05,
                  "#bed3bb" : 5.5872166e-05,
                  "#3a395f" : 0.0025701195,
                  "#66829a" : 0.32171193,
                  "#2e3d30" : 0.12275115,
                  "#5b4f3b" : 5.5872166e-05,
                  "#06b7e" : 0.0015644206,
                  "#46515a" : 0.4865348,
                  "#4f6b58" : 0.058721647,
                  "#82776b" : 0.005643089
              }
          },
      ],
      "attributes" : [ 
          "lapel", 
          "neckline", 
          "zipper", 
          "collar", 
          "epaulette", 
          "sleeve", 
          "flower", 
          "hood"
      ],
      "url_reference" : "https://www.vogue.com/fashion-shows/fall-2022-menswear/a-cold-wall",
      "designer" : "A-Cold-Wall",
      "season" : "FALL 2022 MENSWEAR",
      "source" : "vogue"
  }
vogue-scraper.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1