Task
Design and Deploy a Monitoring Solution that will enable us to:
Choose a solution that is cloud agnostic and at the same time offers a wide range of support to tools we’re using.
Have a view of multi cluster metrics which includes (but not limited to) Kubernetes state metrics, Workloads, Node Metrics, Container scanning metrics, argocd metrics etc.
Design a Unified Monitoring through a “single pane of glass” with the ability to filter and view each environments (dev, stage, production), multiple clusters and services outside of the GKE/GCP ecosystem.
Create a repeatable/reusable monitoring architecture which can easily be redeployed in any new environments.
Create, design and take advantage of tools/software that has the ability to construct a highly available monitoring stack.
Monitoring Stack Diagram
Below is a high level monitoring stack diagram (Monitoring-Stack-001)
which illustrates current folder and projects and where the different parts of the monitoring stack is/will be deployed.
Now zooming in to this diagram Monitoring-Stack-002
which aims to illustrate how Grafana uses Thanos in querying metrics from different Clusters' Prometheuses.
Finally, this diagram Monitoring-Stack-003
will try to provide a bigger picture on how Thanos will manage the different metrics sources (stores), send data to object storage while maintaining high-availability.
Kubernetes Prometheus Stack
This is deployed using the helm chart packaged by Bitnami which deploys the following:
Component | Purpose | Kind |
---|---|---|
Prometheus Operator | The main purpose of this operator is to simplify and automate the configuration and management of the Prometheus monitoring stack running on a Kubernetes cluster. Essentially it is a custom controller that monitors the new object types introduced through the following CRDs:
|
|
Prometheus | Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. |
|
Alertmanager | The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts. | Currently Disabled and not deployed in our Monitoring Stack |
Node Exporters | Node exporter is an official Prometheus exporter for capturing all the Linux system-related metrics. It collects all the hardware and Operating System level metrics that are exposed by the kernel. |
|
Kube State Metrics | kube-state-metrics (KSM) is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects. It is not focused on the health of the individual Kubernetes components, but rather on the health of the various objects inside, such as deployments, nodes and pods. |
|
Service Monitors | The Prometheus Operator includes a Custom Resource Definition that allows the definition of the ServiceMonitor. The ServiceMonitor is used to define an application you wish to scrape metrics from within Kubernetes, the controller will action the ServiceMonitors we define and automatically build the required Prometheus configuration. Within the ServiceMonitor we specify the Kubernetes Labels that the Operator can use to identify the Kubernetes Service which in turn then identifies the Pods, that we wish to monitor. | apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor |
See ArgoCD ApplicationSet yaml definition here and the actual sample ArgoCD Application here deployed in monitoring dev cluster.
MultiCluster Monitoring with Thanos
What is Thanos and why we chose it?
The Thanos Project turns Prometheus into a highly available metrics platform with unlimited metrics storage. This article about Thanos is a great and easy read to better understand the limitations of Kube Prometheus and how Thanos aims to solve that limitation/s.
The three key features of Thanos, are as follows:
Global query view of metrics.
Unlimited retention of metrics.
High availability of components, including Prometheus.
Components
Following the KISS and Unix philosophies, Thanos is made of a set of components with each filling a specific role.
Sidecar: connects to Prometheus, reads its data for query and/or uploads it to cloud storage.
Store Gateway: serves metrics inside of a cloud storage bucket.
Compactor: compacts, downsamples and applies retention on the data stored in cloud storage bucket.
Receiver: receives data from Prometheus’s remote-write WAL, exposes it and/or upload it to cloud storage.
Ruler/Rule: evaluates recording and alerting rules against data in Thanos for exposition and/or upload.
Querier/Query: implements Prometheus’s v1 API to aggregate data from the underlying components.
Query Frontend: implements Prometheus’s v1 API proxies it to Query while caching the response and optional splitting by queries day.
This is currently exposed in Dev in this url - https://thanos.dev.safibank.online/graph
Deployment with Sidecar:
More info can be found from Thanos official documentation.
Grafana
Grafana is an open source solution for running data analytics, pulling up metrics that make sense of the massive amount of data & to monitor our apps with the help of cool customizable dashboards.
Basically, we chose Grafana for creating and visualizing dashboards from metrics we pulled from GKE clusters and other sources through Thanos which communicates via the Thanos sidecar that runs alongside Prometheus.
Below is a snapshot from our Grafana Dev Kubernetes Global Dashboard.
Monitoring as a Code
How each of the above components are currently deployed via Argo CD and Terraform
Monitoring GKE cluster was deployed first as this is the cluster where Thanos and Grafana will be hosted. Each of our environments (dev, stage, production) will have a dedicated Monitoring GKE cluster and all of the GKE clusters in all environments will have Prometheus (with Thanos sidecar) installed. (all in the monitoring namespace)
Pre-requisites:
Add the Project and APIs required here
Monitoring CIDR Network was added here for dev env.
Monitoring GKE resource is then added and created through Terraform here.
Continuous Deployment
Prometheus with Thanos sidecar
Prometheus is deployed as Helm Chart via ArgoCD ApplicationSet - See the yaml definition here
If you inspect the yaml file, you will notice that each cluster as defined in the list generator, has the following:
Thanos
Thanos and Grafana are deployed via a Kustomize overlay using public Helm Charts as sources via ArgoCD Application.
SaFiMono/devops/argocd/environments/dev/monitoring ├── base │ ├── grafana-dashboards.yaml │ ├── grafana.yaml │ └── thanos.yaml ├── grafana │ ├── Chart.yaml │ └── values.yaml ├── kustomization.yaml └── thanos ├── Chart.yaml └── values.yaml 3 directories, 8 files
If we inspect this file, SaFiMono/devops/argocd/environments/dev/monitoring/thanos/values.yaml
, we will see the following:
thanos: objstoreConfig: |- type: GCS config: bucket: safi-thanos-dev querier: stores: # safi-cicd - 172.19.0.223:10901 # safi-dev-apps - 172.16.47.237:10901 # safi-dev-tyk - 172.16.96.59:10901 # safi-dev-hcv - 172.16.64.13:10901 # safi-dev-monitoring - kube-prometheus-prometheus-thanos.monitoring.svc:10901 bucketweb: enabled: true serviceAccount: annotations: iam.gke.io/gcp-service-account: safi-thanos-gcs-dev@safi-env-dev-monitoring.iam.gserviceaccount.com compactor: enabled: true serviceAccount: annotations: iam.gke.io/gcp-service-account: safi-thanos-gcs-dev@safi-env-dev-monitoring.iam.gserviceaccount.com storegateway: enabled: true serviceAccount: annotations: iam.gke.io/gcp-service-account: safi-thanos-gcs-dev@safi-env-dev-monitoring.iam.gserviceaccount.com ruler: enabled: true alertmanagers: - http://prometheus-operator-alertmanager.monitoring.svc.cluster.local:9093 config: |- groups: - name: "metamonitoring" rules: - alert: "PrometheusDown" expr: absent(up{prometheus="monitoring/prometheus-operator"})
As seen in the values.yaml above, we can see that:
We are using Google Cloud Storage as our Object Storage for storing metrics data sent by the Store Gateway. The bucket and its config is deployed via this terraform code.
These are the services exposed via the Internal GKE LB from prometheus-thanos sidecar.
The Service accounts indicated in the values.yaml above are also created by Terraform via this tf file.
Grafana with kiwigrid sidecar
Grafana is deployed using Grafana Helm Chart with the following values.yaml.
If we inspect the values.yaml we will see that:
Grafana Dashboards
The Grafana Dashboards is a custom helm chart created for the purpose of automating the uploading of Grafana Dashboards as Configmap.
How it works
This private helm chart in our Safi Chart Museum is installed in our Monitoring Cluster in the monitoring namespace where Grafana is also deployed.
Grafana-Dashboards chart is deployed via this Application on ArgoCD
New dashboards in json format can be added in this dashboards folder
Make sure to update the Chart in Chart.yaml version so ArgoCD will be able to pick it up and continuously deploy it in the target cluster as configmap.
References
Attachments:
monitoring 1st level.jpg (image/jpeg)
monitoring 1st level (1).jpg (image/jpeg)
Monitoring-Stack-002.drawio.png (image/png)
Monitoring-Stack-003.drawio.png (image/png)
thanos-with-sidecar.png (image/png)
image-20220722-012007.png (image/png)
image-20220722-012606.png (image/png)
image-20220722-012944.png (image/png)
image-20220722-014104.png (image/png)
image-20220722-014125.png (image/png)
image-20220722-014138.png (image/png)
image-20220722-014355.png (image/png)
image-20220722-014946.png (image/png)
image-20220722-015017.png (image/png)
image-20220722-015308.png (image/png)
image-20220722-015402.png (image/png)
image-20220722-020604.png (image/png)
image-20220722-020705.png (image/png)
image-20220722-020736.png (image/png)
Monitoring-Stack-003.drawio (1).png (image/png)
Monitoring-Stack-003.drawio (1).png (image/png)