...
If you use up the 128Gi of requests.storage
in the top image with netapp-block-standard
storage, then you cannot use the storage quota in different classes, even though you're below the quota for that class. It's the overall quota that is at the limit.
Events and Logs
There are two basic types of troubleshooting information in OpenShift:
events, which happen to a pod
logs, which happen within a pod
Events
Events happen during the lifetime of a pod and consist of items like:
pulling a container image for a pod
setting up storage for a pod
probe failure for a pod
The events for a namespace are found under the “Home“ menu:
...
Events occur so often that the cluster only stores them for a few hours before they are deleted. Most often when there are events it’s due to a deployment, and many of those events are not of concern. Error events do happen though, and the namespace events is a good place to find them.
If you want to see the events for a specific pod, such as when it’s having problems starting, there is also a way to view the events specific to a pod:
...
Note that if a pod is having problems starting and nothing is obvious in its events, check the namespace events - some don’t show up in the pod!
Logs
Logs are the standard output (stdout) produced by the process that is run by docker for the container in a pod. In other words, for us it is the output of the CHEFS API. You can view the logs in the pod:
...
Sometimes it’s easier to view the raw logs or download a copy:
...
Kibana
If you really want to dig into the logs, Kibana is the tool for the job. The easiest way is to click the link from a pod’s logs:
...
There is an annoyance in the Kibana login process - if you are asked to log in, do so. Once in you’ll notice that there’s no query term. So close the window and again click the link from the pod logs, and you’ll see the query term and matching logs:
...
Kibana stores around a week of logs, so it is a good and powerful resource.
Monitoring
It is important to monitor the application so that problems can be identified and handled before they cause an outage or other impact to the users. There is a Wiki page that outlines who is responsible for monitoring. The important things to monitor are:
#chefs-sysdig in rocket.chat: tells you when storage is nearing capacity or application pods are using a lot of CPU
Kibana: occasionally check the logs, such as:
kubernetes.namespace_name:"a12c97-prod" AND NOT (message:"level : http" OR message:"level : info")
Other items not discussed in the Wiki are:
#devops-alerts in rocket.chat: stay familiar with what is going on in OCP
#devops-sos in rocket.chat: hear about large OCP problems as they happen
Check the database backup pod and verify pod logs to make sure everything ran OK - we should send these to a Teams or rocket.chat channel instead
Check the certbot logs - no idea when it decides to renew but we should make sure it works when it does try
Check the CHEFS cronjob logs - a user somehow managed to create a form that broke the job. It’s been deleted but we don’t have an easy way to make sure it’s working
It’s good to check at least daily that the Patroni cluster replicas are not lagging. This should ideally be part of Sysdig but that’s future work. In the meantime check using patronictl for each of the three namespaces:
oc -n <NAMESPACE> exec statefulset/patroni-master -it -- bash -c "patronictl list"