====== Krawla Load Balancer ====== The Load Balancer manages tasks, workers and proxies. * Tasks are received via a persistent queue (from the [[krawla_controller|controller]]) * Workers “log in” via a ZeroMQ Router Socket by sending a “ping” message which contains basic info about the worker * Proxies are currently hard coded into the Load Balancer When there are more workers than tasks we perform a **worker selection**. Otherwise we perform a **task selection**. For each task that we send out, we need to perform a **proxy selection**. Tasks are organized by **tokens**. Tokens identify a krawla session, they contain the shop_id, a hashed config key and a session counter. Git: https://git.picalike.corpex-kunden.de/krawla/utils ==== Worker Selection ==== Currently we perform a task selection for each idle worker while there are tasks available. ==== Task Selection ====
    * take all tokens * calculate the **priority** for each token * starting from the token with the highest priority: try to find a proxy
      * if no proxy is found, move on to the next token * else send task with proxy information to worker
**priority** is calculated taking the following metrics into account:
    * last sent message * age of waiting message * length of queue * fraction of blocked proxies
for a detailed view of the priority calculation check out ''%%src/krawla/utils/lb_master.py%%'' (search for: ''%%find_task_for_worker%%'') in the git. ==== Proxy Selection ==== The proxy assignment is currently done by a class called ProxyProviderHelper in ''%%git:/src/krawla/utils/proxy_provider_helper.py%%''. We keep track of the following information: * list of available proxies * last usage of proxy per shop_id * temporary blocks per shop_id * amount of times a proxy has been returned (roughly equivalent to the number of tasks that the proxy had to do) * amount of times a proxy has been blocked because of a timeout When selecting a proxy we consider the following information: * shop_id * crawl_delay ==== LB_master <-> LB_client protocol ==== LB_master → LB_client LB_master sends tasks that it receives from the controller to the LB_client where it is passed on to a worker. LB_client → LB_master LB_client sends all messages from the worker to the LB_master. If the command 'done' or 'error' is received in the LB_master, than the task is considered finished. ==== Proxy Services ==== * PrivateNetKey * url: … * date: 2020-02 * advantages: many IPs for a low amount of money * disadvantages: timeouts * Oxylabs * url: https://oxylabs.io * date: 2020-03 * advantages: connection from any amount of workers * disadvantages: expensive * Luminati: complex integration