Fleet Observability
In Development (Q4 2026)
This feature is currently in development. Fleet observability patterns described here are subject to change.
Purpose: For platform engineers and operators, explains how to aggregate metrics, logs, and alerts across a fleet for unified operational visibility.
Architecture
Spoke Clusters Hub Cluster
┌──────────────┐ ┌──────────────────┐
│ Prometheus │──remote-write──────► │ Thanos / Mimir │
│ Loki │──log-forward───────► │ Centralized Loki │
│ Tempo │──trace-forward─────► │ Centralized Tempo│
└──────────────┘ │ Grafana (fleet) │
└──────────────────┘
Federated Metrics
Each spoke cluster's Prometheus remote-writes to a central Thanos/Mimir receiver on the hub:
- Cluster-identifying labels added automatically (
cluster,region,environment) - Retention: 2h local (spoke), 90d centralized (hub)
- Query: Grafana on hub queries across all clusters transparently
Cross-Cluster Dashboards
Pre-built fleet dashboards include:
| Dashboard | Shows |
|---|---|
| Fleet Overview | Cluster health, node count, pod count, version matrix |
| Resource Utilization | CPU/memory by cluster, group, and namespace |
| GitOps Status | Reconciliation status, drift alerts, failed Kustomizations |
| Policy Compliance | Kyverno violations by cluster and policy |
| SLO Tracking | Availability and error budget by cluster and service |
Fleet Alerting
Fleet-level alert rules fire when:
- A cluster becomes unreachable (agent heartbeat timeout)
- Policy compliance drops below threshold (e.g., < 95%)
- Resource utilization exceeds capacity planning targets
- GitOps reconciliation fails for > 15 minutes
- Kubernetes version falls behind fleet minimum
SLO Tracking
Define fleet-level SLOs that span clusters:
| SLO | Target | Measurement |
|---|---|---|
| Cluster availability | 99.9% | Agent heartbeat |
| GitOps reconciliation success | 99.5% | FluxCD status |
| Policy compliance | 100% critical, 95% overall | Kyverno audit |
| Certificate validity | 100% (> 30d remaining) | cert-manager |