Managed Kafka: Monitoring
Purpose: For platform engineers, shows how to configure JMX exporters, Prometheus metrics, and Grafana dashboards for Kafka.
Task Summary
This guide covers the monitoring pipeline for openCenter Managed Kafka: JMX exporter configuration on Strimzi brokers, Prometheus scraping via ServiceMonitor, and Grafana dashboard provisioning. The goal is broker-level visibility into health, throughput, and resource pressure.
Prerequisites
- A running Kafka cluster deployed per Deploying Kafka
- kube-prometheus-stack deployed (Prometheus + Grafana) from
openCenter-gitops-base kubectlaccess to thedata-kafkaandmonitoringnamespaces
Step 1: Enable JMX Exporter on Brokers
The Kafka CR's metricsConfig field enables the JMX Prometheus exporter as a sidecar on each broker pod. This was included in the deployment tutorial, but here is the relevant section:
# In the Kafka CR spec.kafka block
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
The ConfigMap kafka-metrics defines which JMX MBeans are exported as Prometheus metrics. The default configuration exports broker-level gauges and counters:
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-metrics
namespace: data-kafka
data:
kafka-metrics-config.yml: |
lowercaseOutputName: true
rules:
- pattern: "kafka.server<type=BrokerTopicMetrics, name=(.+)><>Count"
name: kafka_server_brokertopicmetrics_$1_total
type: COUNTER
- pattern: "kafka.server<type=ReplicaManager, name=(.+)><>Value"
name: kafka_server_replicamanager_$1
type: GAUGE
- pattern: "kafka.controller<type=KafkaController, name=(.+)><>Value"
name: kafka_controller_kafkacontroller_$1
type: GAUGE
- pattern: "kafka.network<type=RequestMetrics, name=(.+), request=(.+)><>Count"
name: kafka_network_requestmetrics_$1_total
type: COUNTER
labels:
request: "$2"
The exporter listens on port 9404 inside each broker pod.
Step 2: Create a ServiceMonitor
Prometheus discovers scrape targets through ServiceMonitor CRDs. Create one for the Kafka brokers:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: kafka-brokers
namespace: data-kafka
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
strimzi.io/cluster: production
strimzi.io/kind: Kafka
endpoints:
- port: tcp-prometheus
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- data-kafka
The release: kube-prometheus-stack label ensures the Prometheus operator picks up this ServiceMonitor. Adjust the label if your Prometheus instance uses a different selector.
Verify Prometheus is scraping the brokers:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Open http://localhost:9090/targets — look for data-kafka/kafka-brokers targets
Step 3: Key Metrics to Monitor
| Metric | Type | What It Tells You |
|---|---|---|
kafka_server_replicamanager_underreplicatedpartitions | Gauge | Partitions where replicas are behind the leader. Non-zero means data risk. |
kafka_server_replicamanager_isrshrinkspersec | Counter | Rate of ISR shrinks. Sustained shrinks indicate broker or disk problems. |
kafka_server_brokertopicmetrics_messagesinpersec_total | Counter | Inbound message rate across all topics. |
kafka_server_brokertopicmetrics_bytesinpersec_total | Counter | Inbound byte rate. Use for capacity planning. |
kafka_server_brokertopicmetrics_bytesoutpersec_total | Counter | Outbound byte rate (consumer fetch traffic). |
kafka_controller_kafkacontroller_activecontrollercount | Gauge | Must be exactly 1. Zero means no controller; >1 means split-brain. |
kafka_controller_kafkacontroller_offlinepartitionscount | Gauge | Partitions with no available leader. Must be 0 in healthy state. |
Step 4: Provision Grafana Dashboards
Create a ConfigMap with the Grafana dashboard JSON. Grafana's sidecar auto-discovers dashboards labeled with grafana_dashboard: "1":
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-grafana-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
kafka-overview.json: |
{
"dashboard": {
"title": "Kafka Broker Overview",
"panels": [
{
"title": "Under-Replicated Partitions",
"type": "stat",
"targets": [{"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"}]
},
{
"title": "Messages In/sec",
"type": "timeseries",
"targets": [{"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_total[5m]))"}]
},
{
"title": "Active Controller Count",
"type": "stat",
"targets": [{"expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)"}]
}
]
}
}
This is a simplified example. The full openCenter Kafka dashboard includes panels for disk usage, request latency percentiles, consumer lag (where platform-visible), JVM heap pressure, and certificate expiry.
Step 5: Configure Alerts
Add PrometheusRule resources for critical Kafka conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-alerts
namespace: data-kafka
labels:
release: kube-prometheus-stack
spec:
groups:
- name: kafka.rules
rules:
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replicamanager_underreplicatedpartitions) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka has under-replicated partitions"
description: "{{ $value }} partitions are under-replicated for more than 5 minutes."
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka active controller count is not 1"
- alert: KafkaOfflinePartitions
expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka has offline partitions"
Verification
- Confirm JMX exporter is running on broker pods:
kubectl exec -n data-kafka production-kafka-0 -- curl -s localhost:9404/metrics | head -20 - Check Prometheus targets page for
data-kafka/kafka-brokers— all 3 brokers should showUP. - Open Grafana and search for "Kafka Broker Overview" dashboard.
- Trigger a test alert by scaling down a broker (non-prod only) and confirming the
KafkaUnderReplicatedPartitionsalert fires.
Troubleshooting
Prometheus shows no Kafka targets — Verify the ServiceMonitor label matches the Prometheus operator's serviceMonitorSelector. Check kubectl get servicemonitor -n data-kafka exists.
Metrics return empty — Confirm the kafka-metrics ConfigMap is mounted correctly. Check broker pod logs for JMX exporter startup errors.
Grafana dashboard not appearing — Verify the ConfigMap has the grafana_dashboard: "1" label and is in the namespace Grafana's sidecar watches (typically monitoring).