Skip to main content

Managed Kafka: Monitoring

Purpose: For platform engineers, shows how to configure JMX exporters, Prometheus metrics, and Grafana dashboards for Kafka.

Task Summary

This guide covers the monitoring pipeline for openCenter Managed Kafka: JMX exporter configuration on Strimzi brokers, Prometheus scraping via ServiceMonitor, and Grafana dashboard provisioning. The goal is broker-level visibility into health, throughput, and resource pressure.

Prerequisites

  • A running Kafka cluster deployed per Deploying Kafka
  • kube-prometheus-stack deployed (Prometheus + Grafana) from openCenter-gitops-base
  • kubectl access to the data-kafka and monitoring namespaces

Step 1: Enable JMX Exporter on Brokers

The Kafka CR's metricsConfig field enables the JMX Prometheus exporter as a sidecar on each broker pod. This was included in the deployment tutorial, but here is the relevant section:

# In the Kafka CR spec.kafka block
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml

The ConfigMap kafka-metrics defines which JMX MBeans are exported as Prometheus metrics. The default configuration exports broker-level gauges and counters:

apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-metrics
namespace: data-kafka
data:
kafka-metrics-config.yml: |
lowercaseOutputName: true
rules:
- pattern: "kafka.server<type=BrokerTopicMetrics, name=(.+)><>Count"
name: kafka_server_brokertopicmetrics_$1_total
type: COUNTER
- pattern: "kafka.server<type=ReplicaManager, name=(.+)><>Value"
name: kafka_server_replicamanager_$1
type: GAUGE
- pattern: "kafka.controller<type=KafkaController, name=(.+)><>Value"
name: kafka_controller_kafkacontroller_$1
type: GAUGE
- pattern: "kafka.network<type=RequestMetrics, name=(.+), request=(.+)><>Count"
name: kafka_network_requestmetrics_$1_total
type: COUNTER
labels:
request: "$2"

The exporter listens on port 9404 inside each broker pod.

Step 2: Create a ServiceMonitor

Prometheus discovers scrape targets through ServiceMonitor CRDs. Create one for the Kafka brokers:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-brokers
namespace: data-kafka
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
strimzi.io/cluster: production
strimzi.io/kind: Kafka
endpoints:
- port: tcp-prometheus
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- data-kafka

The release: kube-prometheus-stack label ensures the Prometheus operator picks up this ServiceMonitor. Adjust the label if your Prometheus instance uses a different selector.

Verify Prometheus is scraping the brokers:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — look for data-kafka/kafka-brokers targets

Step 3: Key Metrics to Monitor

MetricTypeWhat It Tells You
kafka_server_replicamanager_underreplicatedpartitionsGaugePartitions where replicas are behind the leader. Non-zero means data risk.
kafka_server_replicamanager_isrshrinkspersecCounterRate of ISR shrinks. Sustained shrinks indicate broker or disk problems.
kafka_server_brokertopicmetrics_messagesinpersec_totalCounterInbound message rate across all topics.
kafka_server_brokertopicmetrics_bytesinpersec_totalCounterInbound byte rate. Use for capacity planning.
kafka_server_brokertopicmetrics_bytesoutpersec_totalCounterOutbound byte rate (consumer fetch traffic).
kafka_controller_kafkacontroller_activecontrollercountGaugeMust be exactly 1. Zero means no controller; >1 means split-brain.
kafka_controller_kafkacontroller_offlinepartitionscountGaugePartitions with no available leader. Must be 0 in healthy state.

Step 4: Provision Grafana Dashboards

Create a ConfigMap with the Grafana dashboard JSON. Grafana's sidecar auto-discovers dashboards labeled with grafana_dashboard: "1":

apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-grafana-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
kafka-overview.json: |
{
"dashboard": {
"title": "Kafka Broker Overview",
"panels": [
{
"title": "Under-Replicated Partitions",
"type": "stat",
"targets": [{"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"}]
},
{
"title": "Messages In/sec",
"type": "timeseries",
"targets": [{"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_total[5m]))"}]
},
{
"title": "Active Controller Count",
"type": "stat",
"targets": [{"expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)"}]
}
]
}
}

This is a simplified example. The full openCenter Kafka dashboard includes panels for disk usage, request latency percentiles, consumer lag (where platform-visible), JVM heap pressure, and certificate expiry.

Step 5: Configure Alerts

Add PrometheusRule resources for critical Kafka conditions:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-alerts
namespace: data-kafka
labels:
release: kube-prometheus-stack
spec:
groups:
- name: kafka.rules
rules:
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka has under-replicated partitions"
description: "{{ $value }} partitions are under-replicated for more than 5 minutes."
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka active controller count is not 1"
- alert: KafkaOfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka has offline partitions"

Verification

  1. Confirm JMX exporter is running on broker pods:
    kubectl exec -n data-kafka production-kafka-0 -- curl -s localhost:9404/metrics | head -20
  2. Check Prometheus targets page for data-kafka/kafka-brokers — all 3 brokers should show UP.
  3. Open Grafana and search for "Kafka Broker Overview" dashboard.
  4. Trigger a test alert by scaling down a broker (non-prod only) and confirming the KafkaUnderReplicatedPartitions alert fires.

Troubleshooting

Prometheus shows no Kafka targets — Verify the ServiceMonitor label matches the Prometheus operator's serviceMonitorSelector. Check kubectl get servicemonitor -n data-kafka exists.

Metrics return empty — Confirm the kafka-metrics ConfigMap is mounted correctly. Check broker pod logs for JMX exporter startup errors.

Grafana dashboard not appearing — Verify the ConfigMap has the grafana_dashboard: "1" label and is in the namespace Grafana's sidecar watches (typically monitoring).