Skip to main content

Air-Gap State Management

Purpose: For contributors, explains build state persistence for resumable air-gap builds.

Why state management matters

Air-gap builds are long-running operations that pull hundreds of container images, bundle Helm charts, and generate Zarf packages. A network interruption or disk-full condition mid-build shouldn't force a restart from scratch. The state management system tracks progress so builds can resume from the last successful step.

How it works

The build pipeline in orchestrator.py executes a sequence of discrete steps. Each step is wrapped by the transaction.py module, which records completion in the state.py persistence layer.

orchestrator.py
├── Step 1: Load config → state.mark_complete("config")
├── Step 2: Resolve manifests → state.mark_complete("manifests")
├── Step 3: Pull images → state.mark_complete("images")
├── Step 4: Bundle charts → state.mark_complete("charts")
├── Step 5: Generate zarf.yaml → state.mark_complete("zarf_gen")
└── Step 6: Package → state.mark_complete("package")

On resume, the orchestrator reads the state file and skips steps already marked complete.

State file

The state.py module persists build state as a YAML file in the build output directory. It records:

  • Which steps have completed
  • Timestamps for each step
  • The configuration hash (to detect config changes that invalidate the state)
  • Partial progress within long steps (e.g., which images have been pulled)

If the configuration changes between runs, the state is invalidated and the build restarts from the beginning.

Transactions

The transaction.py module provides rollback semantics for individual steps. If a step fails partway through (e.g., 50 of 200 images pulled), the transaction records partial progress so the next run can continue from image 51 rather than image 1.

Key behaviors:

  • Each transaction wraps a single build step
  • On success, the step is marked complete in the state file
  • On failure, partial artifacts are cleaned up (or preserved for resume, depending on the step type)
  • The orchestrator catches exceptions from exceptions.py and decides whether to retry or abort

The orchestrator loop

orchestrator.py is the central coordinator. It:

  1. Loads configuration via config.py
  2. Checks for existing state via state.py
  3. Determines which steps to run (skipping completed ones)
  4. Executes each step inside a transaction
  5. Collects metrics via metrics.py (timing, counts, sizes)
  6. Reports progress to the terminal via Rich console output

Trade-offs

  • State files add complexity but prevent expensive re-downloads. A full image pull for a typical deployment can take 30+ minutes on a fast connection.
  • Config hash invalidation is conservative: any change to versions.env or the main config file triggers a full rebuild. This avoids subtle version mismatches in the final package.
  • Partial image resume depends on the container registry supporting range requests. If the registry doesn't support them, the image is re-pulled from scratch.

Common misconceptions

  • The state file is not a cache. It tracks completion status, not cached data. Images and charts are stored in the build output directory.
  • Deleting the state file is safe. It forces a full rebuild but doesn't corrupt anything.
  • State is local to a single build directory. Running builds in different directories creates independent state.

Further reading

  • Air-Gap Code Structure for module details
  • src/opencenter_build/state.py for the state persistence implementation
  • src/opencenter_build/transaction.py for the transaction wrapper