DR Runner Control-Plane Anti-Drift Note¶
Purpose: Prevent HybridOps DR and burst architecture from drifting back toward workstation-driven, bastion-first, or CI-direct private reachability patterns after the runner-local model has been adopted.
This note complements:
- DR Execution and Access Model
- Decision-Driven DR Orchestration Contract
- Internal DNS Authority and Service Endpoint Model
- Runner-Local DR Execution Model
1. Fixed direction¶
HybridOps DR and burst operations are moving toward:
- one execution runner per supported execution plane
- private-only target infrastructure by default
- decision-driven workflow dispatch
- HyOps execution from the selected runner, not from the operator workstation
This direction is intentional and must remain stable as more clouds and workflows are added.
2. Anti-drift rules¶
2.1 Runners are first-class control-plane assets¶
Runners SHOULD be provisioned and maintained as durable platform components, not treated as ad hoc SSH conveniences.
Meaning:
- GCP DR uses a GCP runner
- Azure DR uses an Azure runner
- AWS DR uses an AWS runner
- on-prem steady-state and failback use an on-prem or Proxmox-adjacent runner
The implementation pattern MUST stay layered:
- provider-specific egress adapter
- provider-specific VM lifecycle
- generic runner bootstrap
- CLI/orchestration-layer runner dispatch
- external secret authority sync before dispatch when required
Runner dispatch MUST preserve the logical execution namespace explicitly. When remote jobs are executed with --root against an extracted runtime bundle, the dispatcher MUST still export HYOPS_ENV so env-scoped naming and provider contracts do not silently degrade on the remote side.
When a workflow creates fresh cloud VMs and then immediately runs a configuration or restore step against them, the VM lifecycle layer MUST expose an explicit SSH-readiness handoff. A completed create operation is not by itself a reliable signal that the next runner-executed step can begin.
For GCP specifically, runner placement and runner dependencies MUST keep project roles explicit:
- runner VM, router, and NAT may live in the host/network project
- env-scoped secrets and backup repositories may live in the control project
- future workload service projects must remain a separate role, not an implicit assumption
2.2 DR should not begin by creating the runner¶
The default operating posture SHOULD assume the runner already exists.
Creating or rebuilding a runner during an incident is acceptable only as:
- initial platform bootstrap
- runner recovery
- lab fallback
It MUST NOT be the normal production DR assumption because it increases RTO unnecessarily.
2.3 GitHub Actions or CI is an orchestrator, not the private network control plane¶
CI systems SHOULD:
- validate decisions
- enforce approval policy
- select the correct runner
- bootstrap the runner only if missing
- collect evidence
CI systems SHOULD NOT be designed as the component that must directly SSH into every private DR target subnet.
2.4 Bastion is fallback, not the product default¶
Explicit bastions remain valid as a fallback or break-glass path.
They MUST NOT become the default architecture for:
- cloud DR execution
- multi-cloud burst execution
- failback orchestration
2.5 Convenience bastion inference must stay constrained¶
Auto-bastion or convenience SSH proxy inference MAY exist for:
- labs
- local on-prem bootstrap
- operator convenience on known local topologies
It MUST NOT be allowed to silently define the cloud DR control-plane model.
2.6 Failback must use the on-prem execution plane¶
When decision or workflow selects failback to on-prem, HybridOps SHOULD dispatch to the on-prem execution runner rather than trying to drive failback from an unrelated cloud runner or an operator laptop.
2.7 Execution-plane mapping must remain explicit¶
Decision artifacts and workflow contracts SHOULD always make the execution-plane mapping visible.
Examples:
target_cloud=gcp->runner_ref=gcp-ops-runnertarget_cloud=azure->runner_ref=azure-ops-runnerdecision_type=failback_onprem->runner_ref=proxmox-ops-runner
2.8 Secret authority must stay outside the runner and outside on-prem¶
DR-critical secrets SHOULD come from an external secret authority and be synchronized into the runtime vault cache before runner execution.
Meaning:
- external HashiCorp Vault is preferred as the neutral authority
- cloud-native secret stores remain valid adapters
- the runtime vault bundle is only the per-env execution cache
HybridOps MUST NOT drift toward treating:
- the operator shell
- the runner filesystem
- or the on-prem site
as the only source of DR secrets.
3. Drift smells¶
The following are signs of architectural drift and should be treated as defects:
- new DR workflows assume operator workstation private reachability
- new GCP documentation implies one project owns runner placement, secrets, backups, and workloads by default
- cloud DR blueprints require public IP on every target VM by default
- GitHub Actions secrets and network access are expanded so CI can reach all private targets directly
- cloud DR starts by provisioning a runner because no standing runner exists
- failback is routed through cloud-specific access patterns instead of the on-prem execution plane
- bastion logic becomes the hidden default for cloud recovery
- provider-specific egress or access logic is added directly into the generic runner bootstrap module
- runner job dispatch is re-embedded into modules or thin one-off blueprints instead of staying in the HyOps CLI/orchestration layer
4. Implementation order¶
To keep the rollout clean:
- finish the GCP runner path first, because it unblocks the currently active GCP DR lane
- add the on-prem or Proxmox runner next, because it closes the failback execution path cleanly
- add Azure and AWS runners only when those DR lanes are actually being enabled
This ordering keeps HybridOps focused on one working default path at a time.
5. Practical product posture¶
The intended long-term posture is:
- runners are persistent and pre-provisioned
- CI dispatches to runners instead of replacing them
- bastion remains supported but secondary
- private-only targets remain the default
- decision service chooses target and mode, not transport hacks
That is the cleanest posture for SMEs, schools, and enterprise customers who want a credible growth path.