User Tools

Site Tools


incidents_best_practice

Incidents: Best Practice

  • create a ticket (if not easy solvable)
  • list all affected server(s)
  • check the Grafana monitoring of the server(s)
  • also checked affected service(s)
  • perform a health / running check of service(s)
  • search for incident in the Wiki: incidents
  • check logs of service(s)
  • create a logbook while your are investigation and condense the text into the incident wiki format

Common services

  • Port 9000 → live SIM API

Netcup machines

  • docker ps → container currently restarted, all? only this one?
  • docker logs - fresh timestamps and no errors?
  • check vpn: ps aux | grep vpn ? empty → restart vpn as root
  • check vpn: connection to corpex possible? → no: check vpn logs as root
  • also check unattended-upgrades as root if an update broke a package
  • if unsure if the internal container state is valid → restart frontend

Heartbeat Monitor

http://sg01.picalike.corpex-kunden.de:5002/by_service

should be checked to see if any “non-laptop” live frontend is in ERROR state.

Communication

Always sent a reply to corpex so they know what is going on. And mention as much details as possible in the incident. Any oddity, minor details, etc.

Keywords: docker monitoring corpex netcup

incidents_best_practice.txt · Last modified: 2024/04/11 14:23 by 127.0.0.1