====== Incidents: Best Practice ====== * create a ticket (if not easy solvable) * list all affected server(s) * check the Grafana monitoring of the server(s) * also checked affected service(s) * perform a health / running check of service(s) * search for incident in the Wiki: [[incidents|incidents]] * check logs of service(s) * create a logbook while your are investigation and condense the text into the incident wiki format ===== Common services ===== * Port 9000 → live SIM API ===== Netcup machines ===== * docker ps → container currently restarted, all? only this one? * docker logs - fresh timestamps and no errors? * check vpn: ps aux | grep vpn ? empty → restart vpn as root * check vpn: connection to corpex possible? → no: check vpn logs as root * also check unattended-upgrades as root if an update broke a package * if unsure if the internal container state is valid → restart frontend ===== Heartbeat Monitor ===== http://sg01.picalike.corpex-kunden.de:5002/by_service should be checked to see if any “non-laptop” live frontend is in ERROR state. ==== Communication ==== **Always** sent a reply to corpex so they know what is going on. And mention as much details as possible in the incident. Any oddity, minor details, etc. Keywords: docker monitoring corpex netcup