====== Krawla Load Balancer ======
The Load Balancer manages tasks, workers and proxies.
* Tasks are received via a persistent queue (from the [[krawla_controller|controller]])
* Workers “log in” via a ZeroMQ Router Socket by sending a “ping” message which contains basic info about the worker
* Proxies are currently hard coded into the Load Balancer
When there are more workers than tasks we perform a **worker selection**. Otherwise we perform a **task selection**. For each task that we send out, we need to perform a **proxy selection**.
Tasks are organized by **tokens**. Tokens identify a krawla session, they contain the shop_id, a hashed config key and a session counter.
Git: https://git.picalike.corpex-kunden.de/krawla/utils
==== Worker Selection ====
Currently we perform a task selection for each idle worker while there are tasks available.
==== Task Selection ====
* take all tokens
* calculate the **priority** for each token
* starting from the token with the highest priority: try to find a proxy
* if no proxy is found, move on to the next token
* else send task with proxy information to worker
**priority** is calculated taking the following metrics into account:
* last sent message
* age of waiting message
* length of queue
* fraction of blocked proxies
for a detailed view of the priority calculation check out ''%%src/krawla/utils/lb_master.py%%'' (search for: ''%%find_task_for_worker%%'') in the git.
==== Proxy Selection ====
The proxy assignment is currently done by a class called ProxyProviderHelper in ''%%git:/src/krawla/utils/proxy_provider_helper.py%%''.
We keep track of the following information:
* list of available proxies
* last usage of proxy per shop_id
* temporary blocks per shop_id
* amount of times a proxy has been returned (roughly equivalent to the number of tasks that the proxy had to do)
* amount of times a proxy has been blocked because of a timeout
When selecting a proxy we consider the following information:
* shop_id
* crawl_delay
==== LB_master <-> LB_client protocol ====
LB_master → LB_client
LB_master sends tasks that it receives from the controller to the LB_client where it is passed on to a worker.
LB_client → LB_master
LB_client sends all messages from the worker to the LB_master. If the command 'done' or 'error' is received in the LB_master, than the task is considered finished.
==== Proxy Services ====
* PrivateNetKey
* url: …
* date: 2020-02
* advantages: many IPs for a low amount of money
* disadvantages: timeouts
* Oxylabs
* url: https://oxylabs.io
* date: 2020-03
* advantages: connection from any amount of workers
* disadvantages: expensive
* Luminati: complex integration