Digital.gov.bc.ca currently use two services for metrics: Web Analytics, a service provided by GDX that collects user-level information, and; Sysdig Monitor, a service that provides hosting analytics specifically geared for Kubernetes, which can provide alerts when something goes wrong. Both of these services can be useful and serve different purposes; while there is a lot of data to look at, for our purposes we can focus on a few key metrics to help ensure we are providing valuable content and a reliable service.
Web Analytics
Web Analytics, as provided by GDX, comes packaged as a javascript library, called Snowplow, that can be easily integrated into a web application. The data is then aggregated into a Looker Dashboard.
...
Data can by refined through filtering. Filters are available controls on the top menu bar:
...
SysDig
<Stub>
...
Sysdig Monitor
Sysdig monitor is a monitoring, alerting and data collection SaaS tool that has been integrated into Openshift. Sysdig agents are installed on each node of the gold and silver clusters to collect information about applications on the platform; the dashboard itself is hosted off platform in the sysdig cloud. The benefits of this service include monitoring the health of the application, event notifications, and metric visualization. While the current configuration is geared more towards infrastructure metrics, which provide important data about resource utilization, it can also be extended to display application level data by using Prometheus and adding its client library to application code.
Much of the information that is presented in the default dashboard is also available in the Openshift web console, under the Developer → Observe section. Sysdig’s dashboard tends to be more visually appealing and is much more customizable and extensible. There are many other dashboards available, allowing for different lenses to better consume the data.
SysDig can also be configured to send alerts based on certain queryable events; things such as pod counts or CPU usage can be useful alerts to notify developers about potential issues. Currently, the site has alerts set up for when less than three pods are available (ie. a pod goes down) for each of the three deployments.
Golden Signals
In Site Reliability Engineering, there is a concept called the ‘Golden Signals’, which typically refers to 4 key metrics which can be used to identify potential problems that may cause site reliability issues or degrade user experience. Understanding when individual pods are experiencing these issues in combination with each other can help diagnose the root cause more quickly. These metrics can be described as:
Latency - The roundtrip time it takes for a site to serve resource.
Errors - the set of errors produced by the servers, which can vary in severity.
Traffic - the amount of requests coming in, versus expectations.
Saturation - the actual load versus expected capacity.
The metrics should be considered the baseline for determining the health of a service. These metrics are avaliable in Sysdig under the Pod Status & Performance dashboard.
...