Edit

Troubleshoot FluxCD Reconciliation

Purpose: For platform engineers, shows how to debug FluxCD reconciliation issues, covering status checks, log analysis, common errors, and remediation steps.

Prerequisites

  • FluxCD installed in cluster

  • flux CLI installed (flux version)

  • kubectl access to cluster

  • Basic understanding of FluxCD resources

Quick Diagnostics

Check overall Flux health

# Check all Flux components
flux check

# Check Flux controllers
kubectl get pods -n flux-system

# Check Flux version
flux version

Expected output:

✔ All checks passed

Check resource status

# Check all Flux resources
flux get all

# Check specific resource types
flux get sources git
flux get sources helm
flux get helmreleases
flux get kustomizations

Common Issues and Solutions

Issue 1: GitRepository Authentication Failure

Symptom:

flux get sources git
NAME                    READY   MESSAGE
opencenter-base         False   fetch failed

Diagnosis:

kubectl describe gitrepository opencenter-base -n flux-system

Look for:

Message: failed to checkout and determine revision

Solution:

Use HTTPS in the base-repo GitRepository and do not attach a secretRef.

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: opencenter-base
  namespace: flux-system
spec:
  interval: 15m
  url: https://github.com/opencenter-cloud/openCenter-gitops-base
  ref:
    branch: main

Force reconciliation:

flux reconcile source git opencenter-base

Issue 2: HelmRelease Stuck in "Installing"

Symptom:

flux get helmreleases -n cert-manager
NAME            READY   MESSAGE
cert-manager    False   install retries exhausted

Diagnosis:

kubectl describe helmrelease cert-manager -n cert-manager

Check events:

kubectl get events -n cert-manager --sort-by='.lastTimestamp'

View Helm controller logs:

flux logs --kind=HelmRelease --name=cert-manager --namespace=cert-manager

Common Causes:

  1. Helm repository not accessible

flux get sources helm
kubectl describe helmrepository cert-manager -n flux-system
  1. Chart version not found

Check HelmRelease chart version:

kubectl get helmrelease cert-manager -n cert-manager -o jsonpath='{.spec.chart.spec.version}'

Check available versions:

helm search repo cert-manager --versions
  1. Values validation failed

Check values secrets:

kubectl get secret cert-manager-values-base -n cert-manager

Decode and validate:

kubectl get secret cert-manager-values-base -n cert-manager -o jsonpath='{.data.values\.yaml}' | base64 -d | yq eval

Solution:

Suspend and resume HelmRelease:

flux suspend helmrelease cert-manager -n cert-manager
flux resume helmrelease cert-manager -n cert-manager

Or delete and let Flux recreate:

kubectl delete helmrelease cert-manager -n cert-manager
flux reconcile kustomization cert-manager

Issue 3: Kustomization Drift Detected

Symptom:

flux get kustomizations
NAME            READY   MESSAGE
cert-manager    True    Applied revision: main@sha1:abc123, drift detected

Diagnosis:

kubectl describe kustomization cert-manager -n flux-system

Check drift detection mode:

kubectl get kustomization cert-manager -n flux-system -o jsonpath='{.spec.driftDetection.mode}'

Cause:

Resources were modified outside of Git (manual kubectl apply or Helm upgrade).

Solution:

View drifted resources:

flux diff kustomization cert-manager

Force reconciliation to restore Git state:

flux reconcile kustomization cert-manager --with-source

Prevent drift by enabling remediation:

spec:
  driftDetection:
    mode: enabled
  prune: true
  force: true  # Force apply even if resources exist

Issue 4: SOPS Decryption Failed

Symptom:

flux get kustomizations
NAME            READY   MESSAGE
my-service      False   decryption failed

Diagnosis:

kubectl describe kustomization my-service -n flux-system

Look for:

Message: failed to decrypt secret: no age key found

Solution:

Check age key secret exists:

kubectl get secret sops-age -n flux-system

If missing, create:

kubectl create secret generic sops-age \
  --from-file=age.agekey=${HOME}/.config/sops/age/<cluster>_keys.txt \
  -n flux-system

Verify Kustomization references secret:

kubectl get kustomization my-service -n flux-system -o jsonpath='{.spec.decryption}'

Should show:

{"provider":"sops","secretRef":{"name":"sops-age"}}

Force reconciliation:

flux reconcile kustomization my-service

Issue 5: Dependency Wait Timeout

Symptom:

flux get kustomizations
NAME                READY   MESSAGE
cert-manager-certs  False   dependency 'cert-manager' is not ready

Diagnosis:

kubectl describe kustomization cert-manager-certs -n flux-system

Check dependency status:

flux get kustomizations | grep cert-manager

Solution:

Check dependency is healthy:

kubectl get kustomization cert-manager -n flux-system

If dependency is stuck, troubleshoot it first.

If dependency is ready but not detected, force reconciliation:

flux reconcile kustomization cert-manager
flux reconcile kustomization cert-manager-certs

Increase timeout if needed:

spec:
  dependsOn:
    - name: cert-manager
  timeout: 10m  # Increase from default 5m

Issue 6: Image Pull Errors

Symptom:

HelmRelease shows ready, but pods fail to start:

kubectl get pods -n cert-manager
NAME                           READY   STATUS             RESTARTS   AGE
cert-manager-5d7f9c8b6-abc12   0/1     ImagePullBackOff   0          2m

Diagnosis:

kubectl describe pod cert-manager-5d7f9c8b6-abc12 -n cert-manager

Look for:

Failed to pull image "registry.example.com/cert-manager:v1.18.2": rpc error: code = Unknown desc = failed to pull and unpack image

Solution:

Check image exists:

# For public images
docker pull registry.example.com/cert-manager:v1.18.2

# For private registries
kubectl get secret -n cert-manager | grep regcred

Create image pull secret if needed:

kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass \
  -n cert-manager

Update HelmRelease values:

imagePullSecrets:
  - name: regcred

Issue 7: Resource Quota Exceeded

Symptom:

flux logs --kind=HelmRelease --name=my-service
Error: admission webhook denied the request: exceeded quota

Diagnosis:

kubectl describe resourcequota -n my-service

Solution:

Increase quota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: my-service-quota
  namespace: my-service
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi

Or reduce resource requests in Helm values.

Issue 8: Webhook Timeout

Symptom:

flux logs --kind=Kustomization --name=my-service
Error: context deadline exceeded

Diagnosis:

Check admission webhooks:

kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

Solution:

Check webhook service is running:

kubectl get pods -n kyverno
kubectl get pods -n cert-manager

If webhook is down, suspend Kustomization temporarily:

flux suspend kustomization my-service
# Fix webhook
flux resume kustomization my-service

Increase timeout:

spec:
  timeout: 10m

Debugging Commands

View Flux controller logs

# All controllers
flux logs

# Specific controller
flux logs --kind=Kustomization --name=my-service

# Follow logs
flux logs --follow

# Last 100 lines
flux logs --tail=100

Force reconciliation

# Reconcile source
flux reconcile source git opencenter-base

# Reconcile Kustomization
flux reconcile kustomization my-service

# Reconcile with source update
flux reconcile kustomization my-service --with-source

# Reconcile HelmRelease
flux reconcile helmrelease my-service -n my-service

Suspend and resume

# Suspend (stop reconciliation)
flux suspend kustomization my-service

# Resume
flux resume kustomization my-service

Export and inspect resources

# Export GitRepository
flux export source git opencenter-base

# Export HelmRelease
flux export helmrelease cert-manager -n cert-manager

# Export Kustomization
flux export kustomization my-service

Trace reconciliation

# Trace Kustomization
flux trace kustomization my-service

# Shows:
# - Source
# - Dependencies
# - Applied resources
# - Health checks

Verification Checklist

After resolving issues:

# 1. All sources are ready
flux get sources git
flux get sources helm

# 2. All Kustomizations are ready
flux get kustomizations

# 3. All HelmReleases are ready
flux get helmreleases --all-namespaces

# 4. No suspended resources
flux get all | grep -i suspended

# 5. Check recent events
kubectl get events -n flux-system --sort-by='.lastTimestamp' | tail -20

Prevention Best Practices

  1. Pin versions - Use specific tags/versions, not latest

  2. Test in non-production - Validate changes before production

  3. Use health checks - Configure readiness/liveness probes

  4. Set resource limits - Prevent resource exhaustion

  5. Monitor Flux - Set up alerts for reconciliation failures

  6. Backup age keys - Store SOPS keys securely

  7. Document dependencies - Clear dependency chains

  8. Use drift detection - Catch manual changes

  9. Implement retries - Configure remediation policies

  10. Regular upgrades - Keep Flux up to date

Emergency Procedures

Complete Flux failure

If all Flux controllers are down:

# Check controller pods
kubectl get pods -n flux-system

# Restart controllers
kubectl rollout restart deployment -n flux-system

# If that fails, reinstall Flux
flux uninstall --silent
flux bootstrap git \
  --url=ssh://git@github.com/${GIT_REPO}.git \
  --branch=main \
  --path=<cluster-repo-bootstrap-path>

Rollback to previous version

# Find previous commit
git log --oneline

# Revert to previous commit
git revert HEAD
git push origin main

# Force reconciliation
flux reconcile source git opencenter-base
flux reconcile kustomization my-service

Manual intervention required

If Flux cannot recover:

# Suspend Flux
flux suspend kustomization my-service

# Apply manually
kubectl apply -f <cluster-service-overlay-path>/

# Resume Flux
flux resume kustomization my-service

Next Steps

  • Set up Flux monitoring (see setup-observability.md[setup-observability.md])

  • Configure Flux notifications (Slack, PagerDuty)

  • Implement automated testing for Flux resources

  • Create runbooks for common Flux issues