====== Incident Handling ======
There is an emergency plan who is responsible for handling incidents during the given week in https://docs.google.com/spreadsheets/d/11oJ5IeFDJ_d4FGLNWdWtcMBWudWBqfGXMyzXy_u-hio/edit#gid=0
The “alerts” slack channel is used to issue (fatal) errors that needs to be analyzed. The only exception is “krawla-qa-service” since this is emitted by the krawla QA and is not really an alert that can be handled.
===== Alerting =====
There are different ways to send an alert. We have a slack hook that can be used to directly send errors by a service, but there are also alerting scripts.
dev01:~/bin/v5_slack_monitoring.py
contains a list of critical services that are directly monitored via /health or some other command, for database for instance.
==== Responsibilities ====
An incident handler has to monitor the alerts channel for errors during the workday (8h). In the emergency plan there are contacts in case of an incident that affects services visible by customers outside working hours. Corpex is responsible for hosted services, while the handler is responsible for our docker services.
Since we have a lot of services, it is hard to keep track of all of them. But nevertheless, even if the handler is not familiar with a service that emitted an alert, something has to be done!
Steps: - Find out where the service is running (log message) - Find out what kind of service it is (docker ps, map port) - Take a look into the recent logs (docker logs)
If no technical contact for the service is okay a fallback is always to restart the container - docker restart name_of_container
If this does not fix the problem, the goal should be to isolate the problem: - an external service that is required does not respond - check that free hard disk space / RAM - an exception might contain a reference to the code / function / line
It is always useful to identify the git repository of the service, since the README might give hints on issue and what to do in such cases.
The last functional step before an incident can be closed is to “ping” the service again, either by directly connecting to it or using the /health endpoint.
At the end of the process, the incident must be documented in the dokuwiki:
=== Examples of alerts ===
In general it might be useful to consult the Grafana monitoring to see if there are spikes in the CPU / RAM / IO usage at the time of the alert.
(1)
ERROR http://sandy.picalike.corpex-kunden.de:8042 down: http://sandy.picalike.corpex-kunden.de:8042/health returned status code 500
ERROR http://dev01.picalike.corpex-kunden.de:3000 down: http://dev01.picalike.corpex-kunden.de:3000/health returned status code 500
or
(2)
sandy.picalike.corpex-kunden.de]: http://localhost:10007/person/detector: no response -> removed
extractor from the missing worker
How to read this format? It is not unified, but should always contain those information: - The full hostname the service is running on - The port of the service if possible - A textual summary of the problem, or at least the status
For incident (1) we can see that two hosts are involved: sandy and dev01, also the ports of the services are present: 8042, 3000.
How to get more information?
Login to the specific machine and use
docker ps |grep port
this will give you the affected service
How to get the logs? After you have the responsible docker container, you can use
docker logs --tail 1000 -tf name_of_container
this will likely give you a hint, maybe an exception or the error log.
For system services, the responsible devops might be the guy at corpex. Like the “global” MongoDB servers. The logs at /var/log/ are usually not readable for all developers.