Incidents: Best Practice

create a ticket (if not easy solvable)
list all affected server(s)
check the Grafana monitoring of the server(s)
also checked affected service(s)
perform a health / running check of service(s)
search for incident in the Wiki: incidents
check logs of service(s)
create a logbook while your are investigation and condense the text into the incident wiki format

Common services

Port 9000 → live SIM API

Netcup machines

docker ps → container currently restarted, all? only this one?
docker logs - fresh timestamps and no errors?
check vpn: ps aux | grep vpn ? empty → restart vpn as root
check vpn: connection to corpex possible? → no: check vpn logs as root
also check unattended-upgrades as root if an update broke a package
if unsure if the internal container state is valid → restart frontend

Heartbeat Monitor

http://sg01.picalike.corpex-kunden.de:5002/by_service

should be checked to see if any “non-laptop” live frontend is in ERROR state.

Communication

Always sent a reply to corpex so they know what is going on. And mention as much details as possible in the incident. Any oddity, minor details, etc.

Keywords: docker monitoring corpex netcup

Picalike Dokuwiki Archive

Table of Contents

Incidents: Best Practice

Common services

Netcup machines

Heartbeat Monitor

Communication