Table of Contents
Krawla Load Balancer
The Load Balancer manages tasks, workers and proxies.
- Tasks are received via a persistent queue (from the controller)
- Workers “log in” via a ZeroMQ Router Socket by sending a “ping” message which contains basic info about the worker
- Proxies are currently hard coded into the Load Balancer
When there are more workers than tasks we perform a worker selection. Otherwise we perform a task selection. For each task that we send out, we need to perform a proxy selection.
Tasks are organized by tokens. Tokens identify a krawla session, they contain the shop_id, a hashed config key and a session counter.
Worker Selection
Currently we perform a task selection for each idle worker while there are tasks available.
Task Selection
<HTML><ol></HTML>
- take all tokens
- calculate the priority for each token
- starting from the token with the highest priority: try to find a proxy
<HTML><ol></HTML>
- if no proxy is found, move on to the next token
- else send task with proxy information to worker<HTML></ol></HTML><HTML></ol></HTML>
priority is calculated taking the following metrics into account:
<HTML><ol></HTML>
- last sent message
- age of waiting message
- length of queue
- fraction of blocked proxies<HTML></ol></HTML>
for a detailed view of the priority calculation check out src/krawla/utils/lb_master.py
(search for: find_task_for_worker
) in the git.
Proxy Selection
The proxy assignment is currently done by a class called ProxyProviderHelper in git:/src/krawla/utils/proxy_provider_helper.py
.
We keep track of the following information:
- list of available proxies
- last usage of proxy per shop_id
- temporary blocks per shop_id
- amount of times a proxy has been returned (roughly equivalent to the number of tasks that the proxy had to do)
- amount of times a proxy has been blocked because of a timeout
When selecting a proxy we consider the following information:
- shop_id
- crawl_delay
LB_master <-> LB_client protocol
LB_master → LB_client
LB_master sends tasks that it receives from the controller to the LB_client where it is passed on to a worker.
LB_client → LB_master
LB_client sends all messages from the worker to the LB_master. If the command 'done' or 'error' is received in the LB_master, than the task is considered finished.
Proxy Services
- PrivateNetKey
- url: …
- date: 2020-02
- advantages: many IPs for a low amount of money
- disadvantages: timeouts
- Oxylabs
- url: https://oxylabs.io
- date: 2020-03
- advantages: connection from any amount of workers
- disadvantages: expensive
- Luminati: complex integration