Managed Kafka: Monitoring

Purpose: For platform engineers, shows how to configure JMX exporters, Prometheus metrics, and Grafana dashboards for Kafka.

Task Summary

This guide covers the monitoring pipeline for openCenter Managed Kafka: JMX exporter configuration on Strimzi brokers, Prometheus scraping via ServiceMonitor, and Grafana dashboard provisioning. The goal is broker-level visibility into health, throughput, and resource pressure.

Prerequisites

A running Kafka cluster deployed per Deploying Kafka
kube-prometheus-stack deployed (Prometheus + Grafana) from openCenter-gitops-base
kubectl access to the data-kafka and monitoring namespaces

Step 1: Enable JMX Exporter on Brokers

The Kafka CR's metricsConfig field enables the JMX Prometheus exporter as a sidecar on each broker pod. This was included in the deployment tutorial, but here is the relevant section:

# In the Kafka CR spec.kafka block
metricsConfig:
  type: jmxPrometheusExporter
  valueFrom:
    configMapKeyRef:
      name: kafka-metrics
      key: kafka-metrics-config.yml

The ConfigMap kafka-metrics defines which JMX MBeans are exported as Prometheus metrics. The default configuration exports broker-level gauges and counters:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-metrics
  namespace: data-kafka
data:
  kafka-metrics-config.yml: |
    lowercaseOutputName: true
    rules:
      - pattern: "kafka.server<type=BrokerTopicMetrics, name=(.+)><>Count"
        name: kafka_server_brokertopicmetrics_$1_total
        type: COUNTER
      - pattern: "kafka.server<type=ReplicaManager, name=(.+)><>Value"
        name: kafka_server_replicamanager_$1
        type: GAUGE
      - pattern: "kafka.controller<type=KafkaController, name=(.+)><>Value"
        name: kafka_controller_kafkacontroller_$1
        type: GAUGE
      - pattern: "kafka.network<type=RequestMetrics, name=(.+), request=(.+)><>Count"
        name: kafka_network_requestmetrics_$1_total
        type: COUNTER
        labels:
          request: "$2"

The exporter listens on port 9404 inside each broker pod.

Step 2: Create a ServiceMonitor

Prometheus discovers scrape targets through ServiceMonitor CRDs. Create one for the Kafka brokers:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-brokers
  namespace: data-kafka
  labels:
    release: kube-prometheus-stack
spec:
  selector:
    matchLabels:
      strimzi.io/cluster: production
      strimzi.io/kind: Kafka
  endpoints:
    - port: tcp-prometheus
      interval: 30s
      path: /metrics
  namespaceSelector:
    matchNames:
      - data-kafka

The release: kube-prometheus-stack label ensures the Prometheus operator picks up this ServiceMonitor. Adjust the label if your Prometheus instance uses a different selector.

Verify Prometheus is scraping the brokers:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — look for data-kafka/kafka-brokers targets

Step 3: Key Metrics to Monitor

Metric	Type	What It Tells You
`kafka_server_replicamanager_underreplicatedpartitions`	Gauge	Partitions where replicas are behind the leader. Non-zero means data risk.
`kafka_server_replicamanager_isrshrinkspersec`	Counter	Rate of ISR shrinks. Sustained shrinks indicate broker or disk problems.
`kafka_server_brokertopicmetrics_messagesinpersec_total`	Counter	Inbound message rate across all topics.
`kafka_server_brokertopicmetrics_bytesinpersec_total`	Counter	Inbound byte rate. Use for capacity planning.
`kafka_server_brokertopicmetrics_bytesoutpersec_total`	Counter	Outbound byte rate (consumer fetch traffic).
`kafka_controller_kafkacontroller_activecontrollercount`	Gauge	Must be exactly 1. Zero means no controller; >1 means split-brain.
`kafka_controller_kafkacontroller_offlinepartitionscount`	Gauge	Partitions with no available leader. Must be 0 in healthy state.

Step 4: Provision Grafana Dashboards

Create a ConfigMap with the Grafana dashboard JSON. Grafana's sidecar auto-discovers dashboards labeled with grafana_dashboard: "1":

apiVersion: v1
kind: ConfigMap
metadata:
  name: kafka-grafana-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  kafka-overview.json: |
    {
      "dashboard": {
        "title": "Kafka Broker Overview",
        "panels": [
          {
            "title": "Under-Replicated Partitions",
            "type": "stat",
            "targets": [{"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"}]
          },
          {
            "title": "Messages In/sec",
            "type": "timeseries",
            "targets": [{"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_total[5m]))"}]
          },
          {
            "title": "Active Controller Count",
            "type": "stat",
            "targets": [{"expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)"}]
          }
        ]
      }
    }

This is a simplified example. The full openCenter Kafka dashboard includes panels for disk usage, request latency percentiles, consumer lag (where platform-visible), JVM heap pressure, and certificate expiry.

Step 5: Configure Alerts

Add PrometheusRule resources for critical Kafka conditions:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-alerts
  namespace: data-kafka
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: kafka.rules
      rules:
        - alert: KafkaUnderReplicatedPartitions
          expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Kafka has under-replicated partitions"
            description: "{{ $value }} partitions are under-replicated for more than 5 minutes."
        - alert: KafkaNoActiveController
          expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Kafka active controller count is not 1"
        - alert: KafkaOfflinePartitions
          expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has offline partitions"

Verification

Confirm JMX exporter is running on broker pods:

kubectl exec -n data-kafka production-kafka-0 -- curl -s localhost:9404/metrics | head -20

Check Prometheus targets page for data-kafka/kafka-brokers — all 3 brokers should show UP.
Open Grafana and search for "Kafka Broker Overview" dashboard.
Trigger a test alert by scaling down a broker (non-prod only) and confirming the KafkaUnderReplicatedPartitions alert fires.

Troubleshooting

Prometheus shows no Kafka targets — Verify the ServiceMonitor label matches the Prometheus operator's serviceMonitorSelector. Check kubectl get servicemonitor -n data-kafka exists.

Metrics return empty — Confirm the kafka-metrics ConfigMap is mounted correctly. Check broker pod logs for JMX exporter startup errors.

Grafana dashboard not appearing — Verify the ConfigMap has the grafana_dashboard: "1" label and is in the namespace Grafana's sidecar watches (typically monitoring).

Task Summary​

Prerequisites​

Step 1: Enable JMX Exporter on Brokers​

Step 2: Create a ServiceMonitor​

Step 3: Key Metrics to Monitor​

Step 4: Provision Grafana Dashboards​

Step 5: Configure Alerts​

Verification​

Troubleshooting​