Air-Gap State Management
Purpose: For contributors, explains build state persistence for resumable air-gap builds.
Why state management matters
Air-gap builds are long-running operations that pull hundreds of container images, bundle Helm charts, and generate Zarf packages. A network interruption or disk-full condition mid-build shouldn't force a restart from scratch. The state management system tracks progress so builds can resume from the last successful step.
How it works
The build pipeline in orchestrator.py executes a sequence of discrete steps. Each step is wrapped by the transaction.py module, which records completion in the state.py persistence layer.
orchestrator.py
├── Step 1: Load config → state.mark_complete("config")
├── Step 2: Resolve manifests → state.mark_complete("manifests")
├── Step 3: Pull images → state.mark_complete("images")
├── Step 4: Bundle charts → state.mark_complete("charts")
├── Step 5: Generate zarf.yaml → state.mark_complete("zarf_gen")
└── Step 6: Package → state.mark_complete("package")
On resume, the orchestrator reads the state file and skips steps already marked complete.
State file
The state.py module persists build state as a YAML file in the build output directory. It records:
- Which steps have completed
- Timestamps for each step
- The configuration hash (to detect config changes that invalidate the state)
- Partial progress within long steps (e.g., which images have been pulled)
If the configuration changes between runs, the state is invalidated and the build restarts from the beginning.
Transactions
The transaction.py module provides rollback semantics for individual steps. If a step fails partway through (e.g., 50 of 200 images pulled), the transaction records partial progress so the next run can continue from image 51 rather than image 1.
Key behaviors:
- Each transaction wraps a single build step
- On success, the step is marked complete in the state file
- On failure, partial artifacts are cleaned up (or preserved for resume, depending on the step type)
- The orchestrator catches exceptions from
exceptions.pyand decides whether to retry or abort
The orchestrator loop
orchestrator.py is the central coordinator. It:
- Loads configuration via
config.py - Checks for existing state via
state.py - Determines which steps to run (skipping completed ones)
- Executes each step inside a transaction
- Collects metrics via
metrics.py(timing, counts, sizes) - Reports progress to the terminal via Rich console output
Trade-offs
- State files add complexity but prevent expensive re-downloads. A full image pull for a typical deployment can take 30+ minutes on a fast connection.
- Config hash invalidation is conservative: any change to
versions.envor the main config file triggers a full rebuild. This avoids subtle version mismatches in the final package. - Partial image resume depends on the container registry supporting range requests. If the registry doesn't support them, the image is re-pulled from scratch.
Common misconceptions
- The state file is not a cache. It tracks completion status, not cached data. Images and charts are stored in the build output directory.
- Deleting the state file is safe. It forces a full rebuild but doesn't corrupt anything.
- State is local to a single build directory. Running builds in different directories creates independent state.
Further reading
- Air-Gap Code Structure for module details
src/opencenter_build/state.pyfor the state persistence implementationsrc/opencenter_build/transaction.pyfor the transaction wrapper