incidents_best_practice
Table of Contents
Incidents: Best Practice
- create a ticket (if not easy solvable)
- list all affected server(s)
- check the Grafana monitoring of the server(s)
- also checked affected service(s)
- perform a health / running check of service(s)
- search for incident in the Wiki: incidents
- check logs of service(s)
- create a logbook while your are investigation and condense the text into the incident wiki format
Common services
- Port 9000 → live SIM API
Netcup machines
- docker ps → container currently restarted, all? only this one?
- docker logs - fresh timestamps and no errors?
- check vpn: ps aux | grep vpn ? empty → restart vpn as root
- check vpn: connection to corpex possible? → no: check vpn logs as root
- also check unattended-upgrades as root if an update broke a package
- if unsure if the internal container state is valid → restart frontend
Heartbeat Monitor
http://sg01.picalike.corpex-kunden.de:5002/by_service
should be checked to see if any “non-laptop” live frontend is in ERROR state.
Communication
Always sent a reply to corpex so they know what is going on. And mention as much details as possible in the incident. Any oddity, minor details, etc.
Keywords: docker monitoring corpex netcup
incidents_best_practice.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1