User Tools

Site Tools


krawla_load_balancer

Krawla Load Balancer

The Load Balancer manages tasks, workers and proxies.

  • Tasks are received via a persistent queue (from the controller)
  • Workers “log in” via a ZeroMQ Router Socket by sending a “ping” message which contains basic info about the worker
  • Proxies are currently hard coded into the Load Balancer

When there are more workers than tasks we perform a worker selection. Otherwise we perform a task selection. For each task that we send out, we need to perform a proxy selection.

Tasks are organized by tokens. Tokens identify a krawla session, they contain the shop_id, a hashed config key and a session counter.

Git: https://git.picalike.corpex-kunden.de/krawla/utils

Worker Selection

Currently we perform a task selection for each idle worker while there are tasks available.

Task Selection

<HTML><ol></HTML>

  • take all tokens
  • calculate the priority for each token
  • starting from the token with the highest priority: try to find a proxy

<HTML><ol></HTML>

  • if no proxy is found, move on to the next token
  • else send task with proxy information to worker<HTML></ol></HTML><HTML></ol></HTML>

priority is calculated taking the following metrics into account:

<HTML><ol></HTML>

  • last sent message
  • age of waiting message
  • length of queue
  • fraction of blocked proxies<HTML></ol></HTML>

for a detailed view of the priority calculation check out src/krawla/utils/lb_master.py (search for: find_task_for_worker) in the git.

Proxy Selection

The proxy assignment is currently done by a class called ProxyProviderHelper in git:/src/krawla/utils/proxy_provider_helper.py.

We keep track of the following information:

  • list of available proxies
  • last usage of proxy per shop_id
  • temporary blocks per shop_id
  • amount of times a proxy has been returned (roughly equivalent to the number of tasks that the proxy had to do)
  • amount of times a proxy has been blocked because of a timeout

When selecting a proxy we consider the following information:

  • shop_id
  • crawl_delay

LB_master <-> LB_client protocol

LB_master → LB_client

LB_master sends tasks that it receives from the controller to the LB_client where it is passed on to a worker.

LB_client → LB_master

LB_client sends all messages from the worker to the LB_master. If the command 'done' or 'error' is received in the LB_master, than the task is considered finished.

Proxy Services

  • PrivateNetKey
  • url: …
  • date: 2020-02
  • advantages: many IPs for a low amount of money
  • disadvantages: timeouts
  • Oxylabs
  • date: 2020-03
  • advantages: connection from any amount of workers
  • disadvantages: expensive
  • Luminati: complex integration
krawla_load_balancer.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1