Skip to main content

Metrics (kube-prometheus-stack)

Purpose: For platform engineers, shows how to configure Prometheus alerting rules, retention, remote write, and ServiceMonitor patterns.

Task Summary

Prometheus is deployed as part of kube-prometheus-stack via FluxCD. This guide covers how to add scrape targets for your services, create alerting rules, and tune retention settings.

Prerequisites

  • kube-prometheus-stack deployed (default in openCenter clusters)
  • kubectl access to the cluster
  • Familiarity with PromQL basics

Add a Scrape Target

Prometheus discovers scrape targets through ServiceMonitor and PodMonitor CRDs. To expose metrics from your application:

Step 1: Expose a /metrics endpoint

Your application must serve Prometheus-format metrics on an HTTP endpoint (typically /metrics).

Step 2: Create a ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
release: kube-prometheus-stack # Must match Prometheus serviceMonitorSelector
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: http-metrics
interval: 30s
path: /metrics

The release: kube-prometheus-stack label is required for Prometheus to discover the ServiceMonitor. Without it, the target is ignored.

Step 3: Verify the target

# Check ServiceMonitor was created
kubectl get servicemonitor -n my-app

# Port-forward to Prometheus UI
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Open http://localhost:9090/targets — your target should appear as UP

Create Alerting Rules

Alerting rules are defined via PrometheusRule CRDs:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: my-app
labels:
release: kube-prometheus-stack
spec:
groups:
- name: my-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5..", job="my-app"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High 5xx error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value }} req/s over the last 5 minutes."

Verify the rule is loaded:

kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Open http://localhost:9090/rules — your rule group should appear

Recording Rules

Recording rules pre-compute expensive queries and store the result as a new time series:

spec:
groups:
- name: my-app-recording
rules:
- record: my_app:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="my-app"}[5m]))

Use recording rules for dashboard queries that aggregate across many series or use histogram_quantile.

Retention and Storage

Retention is configured in the HelmRelease values. The default is 15 days. To change it, add an override in the customer overlay:

# applications/overlays/<cluster>/services/kube-prometheus-stack/override-values.yaml
prometheus:
prometheusSpec:
retention: 30d
retentionSize: 50GB
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
resources:
requests:
storage: 100Gi

Verification

# Check Prometheus is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Check all targets are healthy
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Visit http://localhost:9090/targets

# Check alerting rules are loaded
# Visit http://localhost:9090/rules

Troubleshooting

ServiceMonitor target not appearing: Verify the release: kube-prometheus-stack label is present. Check that the Service selector matches the ServiceMonitor selector.

"out of memory" on Prometheus pod: Reduce the number of scraped series or increase memory limits in the HelmRelease values. Check cardinality with: prometheus_tsdb_head_series metric.

Further Reading