Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Kibana stores around a week of logs, so it is a good and powerful resource.

Monitoring

It is important to monitor the application so that problems can be identified and handled before they cause an outage or other impact to the users. There is a Wiki page that outlines who is responsible for monitoring. The important things to monitor are:

  • #chefs-sysdig in rocket.chat: tells you when storage is nearing capacity or application pods are using a lot of CPU

  • Kibana: occasionally check the logs, such as:

    • kubernetes.namespace_name:"a12c97-prod" AND NOT (message:"level : http" OR message:"level : info")

Other items not discussed in the Wiki are:

  • #devops-alerts in rocket.chat: stay familiar with what is going on in OCP

  • #devops-sos in rocket.chat: hear about large OCP problems as they happen

  • Check the database backup pod and verify pod logs to make sure everything ran OK - we should send these to a Teams or rocket.chat channel instead

  • Check the certbot logs - no idea when it decides to renew but we should make sure it works when it does try

  • Check the CHEFS cronjob logs - a user somehow managed to create a form that broke the job. It’s been deleted but we don’t have an easy way to make sure it’s working

  • It’s good to check at least daily that the Patroni cluster replicas are not lagging. This should ideally be part of Sysdig but that’s future work. In the meantime check using patronictl for each of the three namespaces:

    • oc -n <NAMESPACE> exec statefulset/patroni-master -it -- bash -c "patronictl list"