Set Up Service Observability
Purpose: For platform engineers, shows how to configure metrics, logs, and traces for a service, covering Prometheus scraping, Loki log shipping, and Tempo distributed tracing.
Prerequisites
-
Observability stack deployed (kube-prometheus-stack, Loki, Tempo, OpenTelemetry)
-
Service deployed in cluster
-
Service exposes metrics endpoint
-
kubectl access to cluster
Observability Requirements
All platform services must provide:
-
Metrics: RED/USE metrics (Rate, Errors, Duration / Utilization, Saturation, Errors)
-
Logs: Structured JSON logs with correlation IDs
-
Traces: Distributed tracing via OpenTelemetry OTLP
-
Dashboards: Grafana dashboard JSON in repository
-
Alerts: PrometheusRule with actionable runbooks
Steps
The examples below assume a common consumer layout where cluster-local service manifests live under applications/overlays/<cluster>/services/. If your cluster repository uses a different root, apply the same resources from the equivalent service overlay path in that repo.
1. Configure Prometheus metrics scraping
Create ServiceMonitor for automatic scraping.
In your cluster repo, create the manifest in the service overlay path. In the common layout used in these examples, that file is applications/overlays/<cluster>/services/my-service/servicemonitor.yaml:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
namespace: my-service
labels:
app.kubernetes.io/name: my-service
spec:
selector:
matchLabels:
app.kubernetes.io/name: my-service
endpoints:
- port: metrics
path: /metrics
interval: 30s
scrapeTimeout: 10s
# Optional: relabel metrics
relabelings:
- sourceLabels: [__meta_kubernetes_pod_name]
targetLabel: pod
- sourceLabels: [__meta_kubernetes_namespace]
targetLabel: namespace
For non-standard ports:
endpoints:
- port: http
path: /actuator/prometheus # Spring Boot example
interval: 30s
Apply:
kubectl apply -f applications/overlays/<cluster>/services/my-service/servicemonitor.yaml
Verify scraping:
# Check ServiceMonitor
kubectl get servicemonitor my-service -n my-service
# Check Prometheus targets
kubectl port-forward -n observability svc/kube-prometheus-stack-prometheus 9090:9090
# Open browser: http://localhost:9090/targets
# Search for "my-service"
2. Configure structured logging
Update service to output JSON logs:
Example application configuration:
# For Go applications
logging:
format: json
level: info
output: stdout
# For Node.js (Winston)
logging:
format: json
transports:
- type: console
level: info
# For Python (structlog)
logging:
format: json
level: INFO
handlers:
- stream: ext://sys.stdout
Required log fields:
{
"timestamp": "2024-02-14T10:30:00Z",
"level": "info",
"message": "Request processed",
"service": "my-service",
"trace_id": "abc123def456",
"span_id": "ghi789jkl012",
"user_id": "user-123",
"request_id": "req-456",
"duration_ms": 45,
"status_code": 200
}
3. Configure Loki log collection
OpenTelemetry collector automatically scrapes pod logs. Verify configuration:
# Check OpenTelemetry collector
kubectl get pods -n observability -l app.kubernetes.io/name=opentelemetry-collector
# Check collector configuration
kubectl get configmap opentelemetry-collector -n observability -o yaml
Add log parsing annotations to pod:
apiVersion: v1
kind: Pod
metadata:
name: my-service
annotations:
# Parse JSON logs
loki.grafana.com/scrape: "true"
loki.grafana.com/format: "json"
spec:
containers:
- name: my-service
image: my-service:1.0.0
Query logs in Grafana:
# Port-forward to Grafana
kubectl port-forward -n observability svc/kube-prometheus-stack-grafana 3000:80
# Open browser: http://localhost:3000
# Navigate to Explore > Loki
# Query: {namespace="my-service"}
4. Configure distributed tracing
Add OpenTelemetry SDK to application:
Example for Go:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(
context.Background(),
otlptracegrpc.WithEndpoint("opentelemetry-collector.observability.svc.cluster.local:4317"),
otlptracegrpc.WithInsecure(),
)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
}
Configure service to send traces:
# Environment variables for OpenTelemetry
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://opentelemetry-collector.observability.svc.cluster.local:4317"
- name: OTEL_SERVICE_NAME
value: "my-service"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1" # Sample 10% of traces
Verify traces in Grafana:
# Navigate to Explore > Tempo
# Query by trace ID or service name
5. Create Grafana dashboard
Export dashboard JSON and commit to repository:
Store the dashboard JSON in the service overlay path in your cluster repo. In the common layout used in these examples, that file is applications/overlays/<cluster>/services/my-service/dashboard.json:
{
"dashboard": {
"title": "My Service",
"tags": ["my-service", "platform"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{service=\"my-service\"}[5m])",
"legendFormat": "{{method}} {{status}}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{service=\"my-service\",status=~\"5..\"}[5m])",
"legendFormat": "5xx errors"
}
]
},
{
"title": "Request Duration (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service=\"my-service\"}[5m]))",
"legendFormat": "p95"
}
]
}
]
}
}
Import dashboard via ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: my-service-dashboard
namespace: observability
labels:
grafana_dashboard: "1"
data:
my-service.json: |
# Paste dashboard JSON here
6. Create Prometheus alert rules
In your cluster repo, create the alert rule manifest in the service overlay path. In the common layout used in these examples, that file is applications/overlays/<cluster>/services/my-service/prometheusrule.yaml:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-service-alerts
namespace: my-service
labels:
prometheus: kube-prometheus-stack
spec:
groups:
- name: my-service
interval: 30s
rules:
# High error rate
- alert: MyServiceHighErrorRate
expr: |
rate(http_requests_total{service="my-service",status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
service: my-service
annotations:
summary: "High error rate for my-service"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook_url: "https://runbooks.example.com/my-service/high-error-rate"
# High latency
- alert: MyServiceHighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="my-service"}[5m])) > 1
for: 5m
labels:
severity: warning
service: my-service
annotations:
summary: "High latency for my-service"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
runbook_url: "https://runbooks.example.com/my-service/high-latency"
# Service down
- alert: MyServiceDown
expr: |
up{job="my-service"} == 0
for: 2m
labels:
severity: critical
service: my-service
annotations:
summary: "My service is down"
description: "Service has been down for more than 2 minutes"
runbook_url: "https://runbooks.example.com/my-service/service-down"
Apply:
kubectl apply -f applications/overlays/<cluster>/services/my-service/prometheusrule.yaml
Verify:
# Check PrometheusRule
kubectl get prometheusrule my-service-alerts -n my-service
# Check in Prometheus UI
kubectl port-forward -n observability svc/kube-prometheus-stack-prometheus 9090:9090
# Open: http://localhost:9090/alerts
7. Configure Alertmanager routing
Update Alertmanager configuration:
apiVersion: v1
kind: Secret
metadata:
name: alertmanager-kube-prometheus-stack-alertmanager
namespace: observability
type: Opaque
stringData:
alertmanager.yaml: |
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Route my-service alerts to specific receiver
- match:
service: my-service
receiver: 'my-service-team'
continue: false
receivers:
- name: 'default'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#alerts'
- name: 'my-service-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
channel: '#my-service-alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Verification
Complete observability checklist:
# 1. Metrics are being scraped
kubectl get servicemonitor my-service -n my-service
# Check Prometheus targets UI
# 2. Logs are being collected
# Query in Grafana Loki: {namespace="my-service"}
# 3. Traces are being collected
# Query in Grafana Tempo by service name
# 4. Dashboard is available
# Check Grafana dashboards list
# 5. Alerts are configured
kubectl get prometheusrule my-service-alerts -n my-service
# Check Prometheus alerts UI
Required Metrics
Troubleshooting
Metrics not appearing in Prometheus
Check ServiceMonitor selector:
kubectl get servicemonitor my-service -n my-service -o yaml
kubectl get service my-service -n my-service -o yaml
Labels must match.
Check Prometheus logs:
kubectl logs -n observability -l app.kubernetes.io/name=prometheus
Logs not in Loki
Check OpenTelemetry collector:
kubectl logs -n observability -l app.kubernetes.io/name=opentelemetry-collector
Verify log format is JSON:
kubectl logs -n my-service -l app.kubernetes.io/name=my-service
Traces not in Tempo
Check OTLP endpoint is reachable:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl -v http://opentelemetry-collector.observability.svc.cluster.local:4317
Check application trace configuration:
kubectl get pod -n my-service -l app.kubernetes.io/name=my-service -o yaml | grep OTEL
Alerts not firing
Check PrometheusRule is loaded:
kubectl get prometheusrule -n my-service
Check alert expression in Prometheus UI:
# Navigate to Prometheus > Alerts
# Click on alert to see evaluation
Check Alertmanager configuration:
kubectl get secret alertmanager-kube-prometheus-stack-alertmanager -n observability -o yaml
Best Practices
-
Use consistent metric names - Follow Prometheus naming conventions
-
Add correlation IDs - Link logs, metrics, and traces
-
Sample traces appropriately - 1-10% for high-traffic services
-
Create actionable alerts - Include runbook links
-
Test alerts - Trigger alerts in non-production
-
Monitor the monitors - Alert on observability stack health
-
Set SLOs - Define service-level objectives