Metrics (kube-prometheus-stack)

Purpose: For platform engineers, shows how to configure Prometheus alerting rules, retention, remote write, and ServiceMonitor patterns.

Task Summary

Prometheus is deployed as part of kube-prometheus-stack via FluxCD. This guide covers how to add scrape targets for your services, create alerting rules, and tune retention settings.

Prerequisites

kube-prometheus-stack deployed (default in openCenter clusters)
kubectl access to the cluster
Familiarity with PromQL basics

Add a Scrape Target

Prometheus discovers scrape targets through ServiceMonitor and PodMonitor CRDs. To expose metrics from your application:

Step 1: Expose a /metrics endpoint

Your application must serve Prometheus-format metrics on an HTTP endpoint (typically /metrics).

Step 2: Create a ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-app
  labels:
    release: kube-prometheus-stack  # Must match Prometheus serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
    - port: http-metrics
      interval: 30s
      path: /metrics

The release: kube-prometheus-stack label is required for Prometheus to discover the ServiceMonitor. Without it, the target is ignored.

Step 3: Verify the target

# Check ServiceMonitor was created
kubectl get servicemonitor -n my-app

# Port-forward to Prometheus UI
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Open http://localhost:9090/targets — your target should appear as UP

Create Alerting Rules

Alerting rules are defined via PrometheusRule CRDs:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: my-app
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: my-app
      rules:
        - alert: HighErrorRate
          expr: rate(http_requests_total{status=~"5..", job="my-app"}[5m]) > 0.1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High 5xx error rate on {{ $labels.instance }}"
            description: "Error rate is {{ $value }} req/s over the last 5 minutes."

Verify the rule is loaded:

kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Open http://localhost:9090/rules — your rule group should appear

Recording Rules

Recording rules pre-compute expensive queries and store the result as a new time series:

spec:
  groups:
    - name: my-app-recording
      rules:
        - record: my_app:http_request_duration_seconds:p99
          expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="my-app"}[5m]))

Use recording rules for dashboard queries that aggregate across many series or use histogram_quantile.

Retention and Storage

Retention is configured in the HelmRelease values. The default is 15 days. To change it, add an override in the customer overlay:

# applications/overlays/<cluster>/services/kube-prometheus-stack/override-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: 50GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          resources:
            requests:
              storage: 100Gi

Verification

# Check Prometheus is running
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus

# Check all targets are healthy
kubectl port-forward svc/kube-prometheus-stack-prometheus -n monitoring 9090:9090
# Visit http://localhost:9090/targets

# Check alerting rules are loaded
# Visit http://localhost:9090/rules

Troubleshooting

ServiceMonitor target not appearing: Verify the release: kube-prometheus-stack label is present. Check that the Service selector matches the ServiceMonitor selector.

"out of memory" on Prometheus pod: Reduce the number of scraped series or increase memory limits in the HelmRelease values. Check cardinality with: prometheus_tsdb_head_series metric.

Task Summary​

Prerequisites​

Add a Scrape Target​

Step 1: Expose a /metrics endpoint​

Step 2: Create a ServiceMonitor​

Step 3: Verify the target​

Create Alerting Rules​

Recording Rules​

Retention and Storage​

Verification​

Troubleshooting​

Further Reading​