DevOps Information

Namespace Resources

There are three main types of namespace resources that have to be monitored: CPU, memory, and storage.

CPU

Each pod has a request and a limit of the amount of CPU (computing power) that the pod needs. The request amount roughly represents the normal amount the pod uses, and the limit is the amount that the pod is able to spike to during high load. These amounts of CPU are measured in m (millicores). If there is no unit shown then it is in cores (1000 millicores). The YAML for DeployConfigs and StatefulSets contains the CPU settings:

If the application is sluggish, check the CPU metrics for the pods. You can view the metrics through the OCP console:

Note:

  • the orange horizontal line in the metrics is the CPU request

  • the blue horizontal line at the top of the metrics is the CPU limit

  • the data in the graphs is downsampled to an average and may hide short spikes

  • you can drill down into the metrics by clicking the graph

If any of the pods have their CPU pegged, it is probably the reason for the problem. Adding CPU, though, won’t necessarily fix the problem so do look for the underlying cause. For example, if the database is at 100% CPU then perhaps it needs some indexes added for long-running queries.

If you do need to adjust the CPU, note that changing the values in the OCP console will only change the values until the next deployment of CHEFS. It’s a good way to try something out without too much effort. To make the change permanent, though, you will need to update the values in the /openshift files in the repo.

There is not an endless supply of CPU, we need to stay within the bounds of what is allocated to the namespace. This amount can be increased if needed, but we will be asked to justify the increase, and as a good cluster citizen we have to make an effort to conserve resources. The compute-long-running-quota looks something like:

Memory

Each pod has a request and a limit of the amount of memory that the pod needs. The request amount roughly represents the normal amount the pod uses, and the limit is the amount that the pod is able to spike to during high load. These amounts of memory are typically measured in Mi (megabytes) or Gi (gigabytes). The YAML for DeployConfigs and StatefulSets contains the memory settings:

If the application is sluggish, check the memory metrics for the pods. You can view the metrics through the OCP console:

Note:

  • the orange horizontal line in the metrics is the memory request

  • the blue horizontal line at the top of the metrics is the memory limit

  • the data in the graphs is downsampled to an average and may hide short spikes

  • you can drill down into the metrics by clicking the graph

If any of the pods have their memory pegged, it is probably the reason for the problem. Adding memory, though, won’t necessarily fix the problem so do look for the underlying cause. For example, if the database is at 100% memory then perhaps it needs some indexes added for long-running queries.

If you do need to adjust the memory, note that changing the values in the OCP console will only change the values until the next deployment of CHEFS. It’s a good way to try something out without too much effort. To make the change permanent, though, you will need to update the values in the /openshift files in the repo.

There is not an endless supply of memory, we need to stay within the bounds of what is allocated to the namespace. This amount can be increased if needed, but we will be asked to justify the increase, and as a good cluster citizen we have to make an effort to conserve resources. The compute-long-running-quota looks something like:

Storage

Most storage in the pods is ephemeral and disappears when the pod is deleted. However, some pods have persistent storage that survives pod restarts, which is needed for things like database data. When storage fills to 100% it makes things much harder to recover, so it is best to expand storage long before it hits capacity.

You can view the capacity of storage in the OCP console:

Note that the top two storage items don’t show the “Used” amount. This is because those PVCs are not currently mounted to a pod - these PVCs are used for cron jobs which only run for a few minutes per day. However, they are monitored and will produce an alert in the #chefs-sysdig channel on rocket.chat if they reach 90%.

You can expand the size of a PVC either by editing its YAML or by clicking “Expand” in its Actions menu:

Note:

  • Expanding a PVC is nearly instantaneous

  • Expanding a PVC does not cause a pod restart - there is no effect on the users

  • Expanding a PVC is a one way operation - there is no corresponding way to shrink a PVC

  • It is very time consuming (take down app, backup PVC, delete, recreate smaller, restore backup, bring up app) so ensure that you truly need to expand

As with CPU and memory, there is a limit to the amount of storage that is available. There is the overall quota in storage-quota that includes all “classes” of storage:

Storage is tricky, though in that each different class of storage also has its own quota. That is, the netapp-file-standard storage for general use has a different quota than the netapp-block-standard storage that performs better for databases. For example:

If you use up the 128Gi of requests.storage in the top image with netapp-block-standard storage, then you cannot use the storage quota in different classes, even though you're below the quota for that class. It's the overall quota that is at the limit.

Events and Logs

There are two basic types of troubleshooting information in OpenShift:

  • events, which happen to a pod

  • logs, which happen within a pod

Events

Events happen during the lifetime of a pod and consist of items like:

  • pulling a container image for a pod

  • setting up storage for a pod

  • probe failure for a pod

The events for a namespace are found under the “Home“ menu:

Events occur so often that the cluster only stores them for a few hours before they are deleted. Most often when there are events it’s due to a deployment, and many of those events are not of concern. Error events do happen though, and the namespace events is a good place to find them.

If you want to see the events for a specific pod, such as when it’s having problems starting, there is also a way to view the events specific to a pod:

Note that if a pod is having problems starting and nothing is obvious in its events, check the namespace events - some don’t show up in the pod!

Logs

Logs are the standard output (stdout) produced by the process that is run by docker for the container in a pod. In other words, for us it is the output of the CHEFS API. You can view the logs in the pod:

Sometimes it’s easier to view the raw logs or download a copy:

Kibana

If you really want to dig into the logs, Kibana is the tool for the job. The easiest way is to click the link from a pod’s logs:

There is an annoyance in the Kibana login process - if you are asked to log in, do so. Once in you’ll notice that there’s no query term. So close the window and again click the link from the pod logs, and you’ll see the query term and matching logs:

Kibana stores around a week of logs, so it is a good and powerful resource.

Monitoring

It is important to monitor the application so that problems can be identified and handled before they cause an outage or other impact to the users. There is a Wiki page that outlines who is responsible for monitoring. The important things to monitor are:

  • #chefs-sysdig in rocket.chat: tells you when storage is nearing capacity or application pods are using a lot of CPU

  • Kibana: occasionally check the logs, such as:

    • kubernetes.namespace_name:"a12c97-prod" AND NOT (message:"level : http" OR message:"level : info")

Other items not discussed in the Wiki are:

  • #devops-alerts in rocket.chat: stay familiar with what is going on in OCP

  • #devops-sos in rocket.chat: hear about large OCP problems as they happen

  • Check the database backup pod and verify pod logs to make sure everything ran OK - we should send these to a Teams or rocket.chat channel instead

  • Check the certbot logs - no idea when it decides to renew but we should make sure it works when it does try

  • Check the CHEFS cronjob logs - a user somehow managed to create a form that broke the job. It’s been deleted but we don’t have an easy way to make sure it’s working

  • It’s good to check at least daily that the Patroni cluster replicas are not lagging. This should ideally be part of Sysdig but that’s future work. In the meantime check using patronictl for each of the three namespaces:

    • oc -n <NAMESPACE> exec statefulset/patroni-master -it -- bash -c "patronictl list"