Edit

Purpose: For platform teams, explains configuration management from initialization through updates, covering configuration flow through drift management.

Understanding the configuration lifecycle helps you manage clusters effectively and maintain consistency. This explanation covers how configuration evolves from creation to production.

Configuration as Code

openCenter treats configuration as code with these principles:

  1. Single Source of Truth: One YAML file defines entire cluster

  2. Version Controlled: Configuration stored in Git

  3. Declarative: Describe desired state, not steps

  4. Validated: Multi-layered validation before deployment

  5. Auditable: All changes tracked in Git history

Why configuration as code: Reproducible deployments, audit trail, rollback capability, collaboration via pull requests.

Evidence: .kiro/steering/product.md:30-35, Session 2 B0 section 2

Configuration Stages

Stage 1: Initialization

Purpose: Create initial configuration with sensible defaults.

Command:

opencenter cluster init my-cluster --org my-org --type openstack

Process:

1. CLI: Load built-in defaults (internal/config/defaults.go)
2. CLI: Load CLI defaults (~/.config/opencenter/config.yaml)
3. CLI: Apply command-line flags (--org, --type)
4. CLI: Generate configuration file
5. CLI: Write to ~/.config/opencenter/clusters/my-org/.my-cluster-config.yaml

Generated Configuration:

schema_version: "2.0"

opencenter:
  meta:
    name: my-cluster
    environment: production
    region: sjc3
    organization: my-org

  infrastructure:
    provider: openstack
    openstack:
      region: sjc3
      availability_zone: az1
      # ... (100+ default fields)

  cluster:
    kubernetes:
      version: "1.33.5"
    master_count: 3
    worker_count: 2
    # ... (50+ default fields)

  services:
    cert-manager:
      enabled: true
    keycloak:
      enabled: true
    # ... (20+ services)

Why this design: Defaults provide production-ready configuration. Users only customize what’s needed. Reduces configuration complexity.

Evidence: cmd/cluster_init.go, internal/config/defaults.go:48-451, Session 2 B0 section 2

Stage 2: Customization

Purpose: Customize configuration for specific requirements.

Methods:

  1. Direct Editing: Edit configuration file with text editor

  2. Interactive Mode: opencenter cluster edit my-cluster

  3. CLI Flags: opencenter cluster set my-cluster cluster.worker_count=5

Common Customizations:

# Node counts
opencenter:
  cluster:
    master_count: 3  # High availability
    worker_count: 5  # Increased capacity

# Networking
opencenter:
  cluster:
    networking:
      pod_subnet: "10.42.0.0/16"
      service_subnet: "10.43.0.0/16"
      cni_plugin: cilium  # Changed from Calico

# Services
opencenter:
  services:
    loki:
      enabled: false  # Disabled for cost optimization
    harbor:
      enabled: true  # Enabled for container registry
      hostname: harbor.example.com

Why this design: Flexibility for different use cases. Interactive mode for guided editing. CLI flags for automation.

Evidence: cmd/cluster_edit.go, cmd/cluster_set.go, Session 2 B0 section 3

Stage 3: Validation

Purpose: Verify configuration correctness before deployment.

Command:

opencenter cluster validate my-cluster

Validation Layers:

1. Schema Validation (JSON schema compliance)
    ↓
2. Business Rules (cross-field dependencies)
    ↓
3. Provider Validation (provider-specific constraints)
    ↓
4. Connectivity Validation (API reachability, optional)

Example Validation:

# Configuration
opencenter:
  cluster:
    networking:
      use_octavia: false
      vrrp_enabled: true
      # vrrp_ip: missing

# Validation Error
Error: When use_octavia=false and vrrp_enabled=true, vrrp_ip must be set
Location: opencenter.cluster.networking.vrrp_ip
Severity: error

Why this design: Fail fast (catch errors early). Progressive validation (fast checks first). Specific error messages (easy to fix).

Evidence: internal/config/validator.go, tests/features/workflow.feature:38-43, Session 1 A3

Stage 4: Setup (GitOps Repository Generation)

Purpose: Generate complete GitOps repository structure.

Command:

opencenter cluster generate my-cluster

Process:

1. Template Engine: Load embedded templates
2. Template Engine: Inject configuration values
3. Template Engine: Render to GitOps repository
4. SOPS Manager: Encrypt secrets
5. Git: Initialize repository (optional)

Generated Structure:

<git_dir>/
├── .gitignore
├── .sops.yaml
├── README.md
│
├── applications/
│   └── overlays/my-cluster/
│       ├── flux-system/          # FluxCD bootstrap
│       ├── services/              # Platform services
│       └── managed-services/      # Customer applications
│
└── infrastructure/
    └── clusters/my-cluster/
        ├── main.tf                # Terraform/OpenTofu
        ├── inventory/             # Kubespray Ansible
        └── kubeconfig.yaml        # Generated after deployment

Why this design: Standardized structure (consistency). Templates ensure correctness. Secrets encrypted before commit.

Evidence: internal/gitops/, tests/features/workflow.feature:58-65, Session 2 B0 section 15

Stage 5: Deployment

Purpose: Provision infrastructure and deploy Kubernetes.

Command:

opencenter cluster deploy my-cluster

Process:

1. Terraform: Provision infrastructure (VMs, networks, storage)
    ↓
2. Kubespray: Deploy Kubernetes (control plane, workers, CNI)
    ↓
3. FluxCD: Bootstrap GitOps (install controllers, create sources)
    ↓
4. FluxCD: Reconcile services (deploy platform services)
    ↓
5. CLI: Cluster ready

Duration: 20-40 minutes (depends on provider and cluster size)

Why this design: Automated end-to-end deployment. No manual steps. Idempotent (safe to re-run).

Evidence: cmd/cluster_bootstrap.go, Session 2 B0 section 2

Stage 6: Operation

Purpose: Manage running cluster.

Operations:

  • Monitor: Check cluster health and service status

  • Update: Apply configuration changes

  • Scale: Add/remove worker nodes

  • Backup: Backup cluster data (Velero)

  • Troubleshoot: Diagnose and fix issues

Evidence: Session 2 B0 section 2

Stage 7: Updates

Purpose: Apply configuration changes to running cluster.

Workflow:

1. User: Edit configuration file
2. User: opencenter cluster validate my-cluster
3. User: opencenter cluster generate my-cluster
4. User: git commit -m "Update configuration"
5. User: git push
6. FluxCD: Detects change (within 15m)
7. FluxCD: Reconciles new state
8. Services: Updated in cluster

Update Types:

  • Service Configuration: Enable/disable services, change settings

  • Node Scaling: Add/remove worker nodes

  • Networking: Change CNI plugin, load balancer settings

  • Security: Update RBAC policies, network policies

Why this design: GitOps workflow (Git as source of truth). Automated reconciliation (no manual kubectl). Auditable (Git history).

Evidence: Ecosystem.md GitOps workflow, Session 2 B0 section 2

Stage 8: Decommission

Purpose: Safely delete cluster and clean up resources.

Command:

opencenter cluster destroy my-cluster

Process:

1. Backup: Backup cluster data (optional)
2. Terraform: Destroy infrastructure (VMs, networks, storage)
3. CLI: Delete configuration file (optional)
4. Git: Archive repository (optional)

Why this design: Clean resource deletion. No orphaned resources. Optional backup for recovery.

Evidence: cmd/cluster_destroy.go, Session 2 B0 section 2

Configuration Precedence

Precedence Order (Highest to Lowest)

  1. Command-line flags: opencenter cluster set my-cluster cluster.worker_count=5

  2. Configuration file: ~/.config/opencenter/clusters/my-org/.my-cluster-config.yaml

  3. CLI defaults: ~/.config/opencenter/config.yaml

  4. Built-in defaults: internal/config/defaults.go

Example:

# Built-in default: worker_count = 2
# CLI default: worker_count = 3
# Configuration file: worker_count = 4
# Cluster set command: opencenter cluster set my-cluster cluster.worker_count=5

# Result: worker_count = 5 (command-line flag wins)

Why this design: Flexibility for different use cases. Override at multiple levels. Sensible defaults reduce configuration.

Evidence: internal/config/manager.go, Session 2 B0 section 3

Configuration Storage

File Locations

Cluster Configuration:

~/.config/opencenter/clusters/<organization>/<cluster>/.<cluster>-config.yaml

CLI Defaults:

~/.config/opencenter/config.yaml

Secrets:

~/.config/opencenter/clusters/<organization>/secrets/age/<cluster>-key.txt
~/.config/opencenter/clusters/<organization>/secrets/ssh/<cluster>-key

GitOps Repository:

<git_dir>/  # User-specified location

Why this design: Organization-based structure (multi-tenancy). Secrets separate from configuration. GitOps repository user-controlled.

Evidence: .kiro/steering/structure.md:118-128, Session 2 B0 section 14

Configuration Validation

Validation Layers

Layer 1: Schema Validation

Purpose: Verify structure, types, and formats.

Example:

# Invalid: worker_count is string, should be integer
opencenter:
  cluster:
    worker_count: "five"

# Validation Error
Error: Invalid type for opencenter.cluster.worker_count
Expected: integer
Actual: string

Why this layer: Fast (instant). Catches 80% of errors. Clear error messages.

Evidence: internal/config/validator.go, Session 1 A3

Layer 2: Business Rules

Purpose: Verify cross-field dependencies and logical consistency.

Example:

# Invalid: VRRP enabled but no VRRP IP
opencenter:
  cluster:
    networking:
      use_octavia: false
      vrrp_enabled: true
      # vrrp_ip: missing

# Validation Error
Error: When use_octavia=false and vrrp_enabled=true, vrrp_ip must be set

Why this layer: Catches logical errors. Prevents deployment failures. Specific error messages.

Evidence: internal/config/validator.go, tests/features/workflow.feature:38-43

Layer 3: Provider Validation

Purpose: Verify provider-specific constraints.

Example:

# Invalid: Image ID doesn't exist in OpenStack
opencenter:
  infrastructure:
    openstack:
      image_id: "invalid-image-id"

# Validation Error
Error: Image ID not found in OpenStack region sjc3
Image ID: invalid-image-id
Available images: [list of valid images]

Why this layer: Catches provider-specific errors. Prevents deployment failures. Provides helpful suggestions.

Evidence: internal/config/*_validator.go, Session 1 A3

Layer 4: Connectivity Validation (Optional)

Purpose: Verify API reachability and credentials.

Example:

# Enable connectivity validation
opencenter cluster validate my-cluster --validation online

Checks:

  • OpenStack API reachable

  • Credentials valid

  • Quotas sufficient

  • Networks available

Why optional: Requires credentials and network access. Slower than other layers. But catches deployment-time failures.

Evidence: internal/config/validator.go, Session 1 A3

Configuration Updates

Update Strategies

Strategy 1: In-Place Update

Use Case: Service configuration changes (enable/disable services, change settings)

Workflow:

1. Edit configuration file
2. Validate configuration
3. Render GitOps repository
4. Commit and push
5. FluxCD reconciles (5-15 minutes)

Example:

# Before
opencenter:
  services:
    loki:
      enabled: true

# After
opencenter:
  services:
    loki:
      enabled: false

Why this strategy: No cluster rebuild. Fast updates. GitOps workflow.

Strategy 2: Node Scaling

Use Case: Add/remove worker nodes

Workflow:

1. Edit configuration file (change worker_count)
2. Validate configuration
3. Render GitOps repository
4. Run Terraform apply (provision new nodes)
5. Run Kubespray (join new nodes to cluster)

Example:

# Before
opencenter:
  cluster:
    worker_count: 2

# After
opencenter:
  cluster:
    worker_count: 5

Why this strategy: Horizontal scaling. No downtime. Automated provisioning.

Strategy 3: Cluster Rebuild

Use Case: Provider change, major Kubernetes version upgrade

Workflow:

1. Backup cluster data (Velero)
2. Create new configuration file
3. Deploy new cluster
4. Restore data to new cluster
5. Validate applications
6. Cutover traffic
7. Decommission old cluster

Why this strategy: Clean slate. No migration complexity. But requires downtime.

Configuration Drift

Drift Detection

Definition: Difference between configuration file and actual infrastructure/cluster state.

Types:

  1. Infrastructure Drift: VMs deleted, networks changed, storage modified

  2. Configuration Drift: Manual kubectl changes, direct API modifications

  3. Service Drift: Service versions changed, settings modified

Detection:

# Detect infrastructure drift
opencenter cluster drift my-cluster

# Detect configuration drift (FluxCD)
kubectl get kustomizations -A

Why drift happens: Manual changes, external automation, infrastructure failures.

Evidence: cmd/cluster_drift.go, Session 1 A8

Drift Prevention

Strategy 1: GitOps Discipline

Practice: All changes through Git, no manual kubectl.

Enforcement:

  • RBAC (limit direct cluster access)

  • Audit logging (track manual changes)

  • FluxCD reconciliation (revert manual changes)

Strategy 2: Immutable Infrastructure

Practice: Replace infrastructure instead of modifying.

Example: Deploy new nodes instead of upgrading existing nodes.

Strategy 3: Automated Reconciliation

Practice: FluxCD continuously reconciles Git → Cluster.

Interval: 5-15 minutes (configurable)

Why this strategy: Self-healing. Prevents configuration drift. Enforces GitOps discipline.

Evidence: Ecosystem.md drift detection

Drift Remediation

Process:

1. Detect drift (opencenter cluster drift)
2. Analyze changes (what changed, why)
3. Decide action:
   a. Accept drift (update configuration file)
   b. Revert drift (re-apply configuration)
4. Document decision (Git commit message)

Example:

# Drift detected: worker_count changed from 5 to 3

# Option A: Accept drift (update configuration)
vim ~/.config/opencenter/clusters/my-org/.my-cluster-config.yaml
# Change worker_count to 3
git commit -m "Accept worker_count drift (cost optimization)"

# Option B: Revert drift (re-apply configuration)
opencenter cluster generate my-cluster
terraform apply
# Provisions 2 new workers to reach 5 total

Configuration Versioning

Git-Based Versioning

Practice: Store configuration in Git with semantic versioning.

Workflow:

# Initial configuration
git add .my-cluster-config.yaml
git commit -m "Initial cluster configuration"
git tag v1.0.0

# Update configuration
vim .my-cluster-config.yaml
git commit -m "Enable Harbor registry"
git tag v1.1.0

# Rollback to previous version
git checkout v1.0.0
opencenter cluster generate my-cluster

Why this practice: Audit trail (Git history). Rollback capability (Git checkout). Collaboration (pull requests).

Configuration Snapshots

Practice: Backup configuration before major changes.

Workflow:

# Before major change
cp .my-cluster-config.yaml .my-cluster-config.yaml.backup-$(date +%Y%m%d)

# Make changes
vim .my-cluster-config.yaml

# If rollback needed
cp .my-cluster-config.yaml.backup-20260217 .my-cluster-config.yaml

Why this practice: Quick rollback. No Git required. Local backup.

Best Practices

1. Validate Before Deploy

Practice: Always validate configuration before deployment.

Workflow:

# Edit configuration
vim .my-cluster-config.yaml

# Validate
opencenter cluster validate my-cluster

# Only deploy if validation passes
opencenter cluster generate my-cluster

Rationale: Catch errors early. Prevent deployment failures. Faster feedback loop.

2. Use Small, Incremental Changes

Practice: One logical change per commit.

Example:

Good: "Enable Harbor registry"
Bad: "Enable Harbor, update Kubernetes version, add 3 workers"

Rationale: Easier to review. Easier to rollback. Clearer history.

3. Test in Dev First

Practice: Test configuration changes in dev before production.

Workflow:

1. Apply change to dev cluster
2. Validate functionality
3. Monitor for issues
4. Apply to production

Rationale: Catch issues in dev, not prod. Validate changes before production.

4. Document Configuration Decisions

Practice: Document why configuration choices were made.

Example:

# Configuration
opencenter:
  services:
    loki:
      enabled: false  # Disabled for cost optimization (logs to S3 instead)

Rationale: Context for future maintainers. Audit trail for decisions.

5. Backup Before Major Changes

Practice: Backup cluster data before major configuration changes.

Workflow:

# Backup cluster
velero backup create pre-upgrade-backup --include-namespaces app1,app2

# Make changes
vim .my-cluster-config.yaml
opencenter cluster generate my-cluster

# If rollback needed
velero restore create --from-backup pre-upgrade-backup

Rationale: Safety net for major changes. Quick recovery if issues.

Common Misconceptions

"Configuration changes require cluster rebuild"

Reality: Most configuration changes can be applied in-place (service configuration, node scaling). Only provider changes require rebuild.

"Validation guarantees successful deployment"

Reality: Validation catches most errors, but not all. Runtime issues (quota exhaustion, network failures) can still occur.

"Manual changes are permanent"

Reality: FluxCD reverts manual changes on next reconciliation. All changes must go through Git.

"Configuration file contains all cluster state"

Reality: Configuration file defines desired state. Actual state is in cluster. Drift can occur.

"GitOps means no manual intervention"

Reality: Manual intervention is sometimes necessary (debugging, emergencies). But changes should be committed to Git afterward.

Further Reading

  • architecture.md[Architecture] - System design and components

  • gitops-workflow.md[GitOps Workflow] - Repository structure and reconciliation

  • ../operations/validate-configuration.md[Validate Configuration] - Validation procedures

  • ../operations/troubleshoot-deployment.md[Troubleshoot Deployment] - Fix deployment issues

  • ../reference/configuration-schema.md[Configuration Schema] - Complete field reference


Evidence

This explanation is based on:

  • Configuration lifecycle: Session 2 B0 section 2

  • Configuration structure: Session 2 B0 section 3

  • Validation layers: internal/config/validator.go, Session 1 A3

  • GitOps workflow: Ecosystem.md GitOps flow

  • Drift detection: cmd/cluster_drift.go, Session 1 A8

  • Configuration precedence: internal/config/manager.go

  • File locations: .kiro/steering/structure.md:118-128