Skip to main content

Fleet Observability

In Development (Q4 2026)

This feature is currently in development. Fleet observability patterns described here are subject to change.

Purpose: For platform engineers and operators, explains how to aggregate metrics, logs, and alerts across a fleet for unified operational visibility.

Architecture

Spoke Clusters Hub Cluster
┌──────────────┐ ┌──────────────────┐
│ Prometheus │──remote-write──────► │ Thanos / Mimir │
│ Loki │──log-forward───────► │ Centralized Loki │
│ Tempo │──trace-forward─────► │ Centralized Tempo│
└──────────────┘ │ Grafana (fleet) │
└──────────────────┘

Federated Metrics

Each spoke cluster's Prometheus remote-writes to a central Thanos/Mimir receiver on the hub:

  • Cluster-identifying labels added automatically (cluster, region, environment)
  • Retention: 2h local (spoke), 90d centralized (hub)
  • Query: Grafana on hub queries across all clusters transparently

Cross-Cluster Dashboards

Pre-built fleet dashboards include:

DashboardShows
Fleet OverviewCluster health, node count, pod count, version matrix
Resource UtilizationCPU/memory by cluster, group, and namespace
GitOps StatusReconciliation status, drift alerts, failed Kustomizations
Policy ComplianceKyverno violations by cluster and policy
SLO TrackingAvailability and error budget by cluster and service

Fleet Alerting

Fleet-level alert rules fire when:

  • A cluster becomes unreachable (agent heartbeat timeout)
  • Policy compliance drops below threshold (e.g., < 95%)
  • Resource utilization exceeds capacity planning targets
  • GitOps reconciliation fails for > 15 minutes
  • Kubernetes version falls behind fleet minimum

SLO Tracking

Define fleet-level SLOs that span clusters:

SLOTargetMeasurement
Cluster availability99.9%Agent heartbeat
GitOps reconciliation success99.5%FluxCD status
Policy compliance100% critical, 95% overallKyverno audit
Certificate validity100% (> 30d remaining)cert-manager