Air-Gap State Management

Purpose: For contributors, explains build state persistence for resumable air-gap builds.

Why state management matters

Air-gap builds are long-running operations that pull hundreds of container images, bundle Helm charts, and generate Zarf packages. A network interruption or disk-full condition mid-build shouldn't force a restart from scratch. The state management system tracks progress so builds can resume from the last successful step.

How it works

The build pipeline in orchestrator.py executes a sequence of discrete steps. Each step is wrapped by the transaction.py module, which records completion in the state.py persistence layer.

orchestrator.py
  ├── Step 1: Load config          → state.mark_complete("config")
  ├── Step 2: Resolve manifests    → state.mark_complete("manifests")
  ├── Step 3: Pull images          → state.mark_complete("images")
  ├── Step 4: Bundle charts        → state.mark_complete("charts")
  ├── Step 5: Generate zarf.yaml   → state.mark_complete("zarf_gen")
  └── Step 6: Package              → state.mark_complete("package")

On resume, the orchestrator reads the state file and skips steps already marked complete.

State file

The state.py module persists build state as a YAML file in the build output directory. It records:

Which steps have completed
Timestamps for each step
The configuration hash (to detect config changes that invalidate the state)
Partial progress within long steps (e.g., which images have been pulled)

If the configuration changes between runs, the state is invalidated and the build restarts from the beginning.

Transactions

The transaction.py module provides rollback semantics for individual steps. If a step fails partway through (e.g., 50 of 200 images pulled), the transaction records partial progress so the next run can continue from image 51 rather than image 1.

Key behaviors:

Each transaction wraps a single build step
On success, the step is marked complete in the state file
On failure, partial artifacts are cleaned up (or preserved for resume, depending on the step type)
The orchestrator catches exceptions from exceptions.py and decides whether to retry or abort

The orchestrator loop

orchestrator.py is the central coordinator. It:

Loads configuration via config.py
Checks for existing state via state.py
Determines which steps to run (skipping completed ones)
Executes each step inside a transaction
Collects metrics via metrics.py (timing, counts, sizes)
Reports progress to the terminal via Rich console output

Trade-offs

State files add complexity but prevent expensive re-downloads. A full image pull for a typical deployment can take 30+ minutes on a fast connection.
Config hash invalidation is conservative: any change to versions.env or the main config file triggers a full rebuild. This avoids subtle version mismatches in the final package.
Partial image resume depends on the container registry supporting range requests. If the registry doesn't support them, the image is re-pulled from scratch.

Common misconceptions

The state file is not a cache. It tracks completion status, not cached data. Images and charts are stored in the build output directory.
Deleting the state file is safe. It forces a full rebuild but doesn't corrupt anything.
State is local to a single build directory. Running builds in different directories creates independent state.

Why state management matters​

How it works​

State file​

Transactions​

The orchestrator loop​

Trade-offs​

Common misconceptions​

Further reading​