Skip to content

Runner-Local DR Execution Model

Purpose

Describe the intended operating path for HybridOps DR and burst workflows when execution is performed by a shared runner in or near the target environment.

Why this exists

Workstation-driven DR is not an acceptable default because:

  • the operator laptop may not have private cloud reachability
  • the operator laptop may be unavailable during an incident
  • public IP on every target VM is not the right steady-state answer
  • DR should remain executable even when on-prem is partially or fully unavailable

See also:

  1. Observability signals a condition.
  2. Decision service evaluates policy and emits a decision artifact.
  3. Workflow runner validates the decision, approvals, and prerequisites.
  4. Runner executes the selected HyOps blueprint/module path.
  5. Data services are restored or promoted.
  6. Required secrets are synchronized.
  7. GitOps reconciles stateless workloads.
  8. Traffic cutover occurs.
  9. Backup and observability are re-established.
  10. Evidence is published.

Traffic cutover SHOULD prefer internal DNS endpoint updates against the shared internal authority rather than application-by-application raw IP rewrites.

Control-plane pattern

The runner SHOULD be outside the primary on-prem failure domain.

Examples:

  • shared cloud runner
  • on-prem runner on the restored management network
  • edge execution node
  • dedicated automation VM in the cloud hub

Current HybridOps reference implementation:

  • bootstrap a private GCP runner in the hub core subnet via networking/gcp-ops-runner@v1
  • bootstrap an on-prem runner on the management network via networking/onprem-ops-runner@v1
  • bootstrap the runner host software via platform/linux/ops-runner
  • dispatch env-scoped blueprint overlays to that runner via:
  • hyops runner blueprint preflight
  • hyops runner blueprint deploy
  • use that runner for blueprints that declare execution_plane: runner-local

The runner SHOULD have private reachability to the chosen target subnet/VPC.

GCP project-role model

For GCP, the runner-local model is clearer when project roles are explicit:

  • host/network project
  • Shared VPC
  • Cloud Router / NAT
  • runner VM placement
  • private DR VM placement
  • control project
  • env-scoped secret authority such as GCP Secret Manager
  • env-scoped object repositories such as backup buckets
  • workload project
  • optional later role for env-specific service projects

This means the runner can execute in the host/network project while still reading secrets or backup state from the env control project.

Access patterns

Preferred

  • runner-local private reachability

Acceptable fallback

  • explicit bastion
  • provider-specific access such as GCP IAP

Last-resort drill mode

  • temporary public IPs with restricted firewall rules

Normal hybrid vs DR

Normal workload burst may rely on:

  • BGP/IPsec
  • routed overlays
  • private site-to-cloud connectivity

DR must still work if the source on-prem site is down.

Therefore:

  • normal hybrid connectivity is valuable
  • but DR execution cannot depend on that source site remaining alive

Practical HyOps implications

HybridOps blueprints and modules SHOULD be designed so that:

  • cloud DR inventories do not assume workstation reachability
  • access mode is explicit
  • repository, network, and project data are consumed from state
  • step preflight fails before execution when access prerequisites are missing

HybridOps runner dispatch SHOULD:

  • treat the local runtime as the authority
  • stage a disposable runtime bundle onto the runner
  • execute hyops on the runner against that staged runtime
  • sync resulting state, logs, meta, and artifacts back to the local runtime
  • write local runner evidence for every dispatch
  • remove the local staged runner bundle and dispatch env after successful runs by default, so one-shot synced secrets do not linger in the runtime work directory
  • allow explicit one-shot env forwarding for secrets, resolving requested keys from the selected runtime vault first and allowing operator-shell overrides during the transition to fully managed secret synchronization
  • allow a pre-dispatch secret-source refresh so the runtime vault cache can be repopulated from an external authority such as HashiCorp Vault or GCP Secret Manager before the runner job is staged
  • project Terraform Cloud authentication from the selected runtime when local hyops init terraform-cloud has already established a usable token, so TFC-backed remote jobs do not require a second interactive login on the target runner
  • when Terraform Cloud is used as the backend for runner-executed workflows, treat it as the state backend and ensure HyOps-managed workspaces execute in local mode on the runner rather than on Terraform Cloud infrastructure

Future integration points

Expected future integrations:

  • decision service emits dr_mode, target_cloud, execution_plane, and related policy fields
  • runner consumes those fields and selects the right blueprint
  • secret sync from managed secret stores is performed as a distinct execution phase

Guardrails

Do:

  • keep target VMs private by default
  • use a shared runner or explicit bastion
  • keep access mode visible in workflow inputs and evidence

Do not:

  • rely on the operator workstation as the product default
  • assume convenience bastion inference is valid for cloud DR
  • make public IP per target node the standard operating posture

Outcome

If this model is followed, HybridOps can offer:

  • a clean DR story for SMEs and schools
  • a supportable default execution posture
  • a path to more advanced decision-driven recovery later