Deploy Edge Control Plane (HyOps Blueprint)¶
-
Purpose: Deploy a governed edge control plane for WAN, observability, DNS, and decision control-loop services. Owner: Network/platform engineering
-
Trigger: New environment bring-up, edge rebuild, DR/burst rehearsal
- Impact: Creates edge foundation and connects cloud/on-prem routing control-plane
-
Severity: P2 Pre-reqs:
hyops init hetznerandhyops init gcpcompleted for target env, vault decrypt working, required secrets set. -
Rollback strategy: Destroy module states in reverse dependency order or rerun with corrected inputs and
skip_if_state_okbehavior.
Context¶
Blueprint ref: networking/edge-control-plane@v1
Location: hybridops-core/blueprints/networking/edge-control-plane@v1/blueprint.yml
Important usage model:
- the shipped blueprint is a reusable scaffold
- copy it into runtime config with
hyops blueprint init - replace every
CHANGE_ME_*value in the runtime copy before preflight or deploy - run blueprint commands against the runtime copy with
--file, not the shipped source ref
Step flow:
core/hetzner/vyos-image-seedorg/hetzner/shared-private-networkorg/hetzner/vyos-edge-foundationorg/hetzner/shared-control-hostorg/gcp/wan-hub-networkorg/gcp/wan-cloud-routerorg/gcp/wan-vpn-to-edgeplatform/network/vyos-edge-wanplatform/network/edge-observabilityplatform/network/dns-routingplatform/network/decision-serviceplatform/network/decision-dispatcherplatform/network/decision-consumerplatform/network/decision-executor
Rerun safety:
org/gcp/wan-vpn-to-edgenow reconciles on blueprint reruns so refreshed edge public IPs are pushed back into GCP HA VPN/Cloud Router state- the shared control-plane operations layer now reruns on deploy:
platform/network/edge-observabilityplatform/network/dns-routingplatform/network/decision-serviceplatform/network/decision-dispatcherplatform/network/decision-consumerplatform/network/decision-executor- those steps are treated as current-host convergence, not historical evidence, because
host rebuilds or package drift on
edge_control_hostcan invalidate a previously successful state record - steps marked
skip_if_state_ok: truenow also refuse to skip when explicit step inputs changed since the last successful apply, so overlay edits such as refreshedipsec_source_cidrsforce a real rerun instead of silently trusting historical state - the VyOS artifact bootstrap step is now pinned to the documented runner-based ISO build
contract; if you keep
build-vyos-qcow2.shas the build path, the overlay must provide a pinnedsource_iso_urlandallow_iso_build: true - because that build now runs on the managed ops runner, the blueprint uses the runner-local
installed path under
/opt/hybridops/core/app/tools/build/vyos/rather than assuming the controller source checkout exists on the target host - the GCP WAN hub network bootstrap step now prefers
project_state_ref=org/gcp/project-factoryinstead of a hardcodedproject_id, so project recovery or replacement flows do not leave the blueprint pinned to stale tenant metadata platform/network/edge-observabilitynow verifies container readiness, not only systemd/compose startup, so Grafana or Alertmanager restart loops fail the step instead of publishing a false green state- shipped blueprints no longer carry literal HA VPN PSK placeholders; the GCP VPN step
consumes the vaulted
WAN_IPSEC_PSKenv-backed contract instead - the decision layer is split deliberately:
platform/network/decision-serviceemits decision recordsplatform/network/decision-dispatcherturns those records into approval-gated dispatch requestsplatform/network/decision-consumerturns approved requests into execution-ready recordsplatform/network/decision-executorturns those records into dry-run execution attempts- later runner execution remains a separate concern
Target-state role split:
- Hetzner shared private network: dedicated lifecycle owner
- routed edge default: VyOS
- shared control host: Linux
- GCP routing hub: Cloud Router + NCC
Hetzner ownership model:
org/hetzner/shared-private-networkowns the reusable10.80.0.0/24private network contractorg/hetzner/vyos-edge-foundationconsumes that network for the edge pairorg/hetzner/shared-control-hostconsumes the same network for PowerDNS, decision service, and supporting control-plane services- this split keeps edge destroy/rebuild separate from shared-control host lifecycle
- blueprint preflight now verifies live Hetzner server presence before skipping state-
okedge/control-host steps, so out-of-band server deletion turns the affected step back into a real deploy instead of being silently skipped
Current DNS model:
platform/network/dns-routingremains the DNS cutover/update layer- first-class internal DNS authority target is now
provider: powerdns-api manual-commandremains available as a fallback adapter- the recommended internal DNS authority topology is:
- PowerDNS primary in the shared Hetzner / WAN-edge control plane
- PowerDNS secondary on-prem for local resolution resilience
Important boundary:
- this blueprint currently prepares the shared control plane around WAN edge, observability, and cutover logic
- the first executable DNS authority path now lives in:
networking/powerdns-shared-primary@v1networking/powerdns-onprem-secondary@v1- the first implementation intentionally colocates:
- the writable primary on
edge01 - the read-only secondary on the on-prem runner host
- this keeps cost down while preserving the clean authority/cutover separation
- this aligns with the routed topology in the Network routing contract
- the deprecated
platform/network/wan-edgepath remains only for Linux-edge compatibility labs and is not part of this blueprint
Current Hetzner image model:
- the blueprint now uses
core/hetzner/vyos-image-seedas the default state-first image path - if a matching Hetzner snapshot already exists, the step skips seeding and publishes state
- if no matching snapshot exists, the step can seed one by using
hcloud-upload-image - the recommended Hetzner seed source is now a direct public
raw.xzartifact URL - if the operator supplies only a qcow2 URL, HyOps can still auto-wrap it for Hetzner image seeding, but only when the execution host has
qemu-imgand a publicly reachable base URL configured image_state_refis authoritative for the downstreamorg/hetzner/vyos-edge-foundationstep and must override stale saved image ids- the foundation cloud-init now pins Hetzner's routed public host route and default route via
172.31.1.1oneth0 - the foundation also uses Hetzner's routed private network model on
eth1:private_ip/32plus an explicit route toprivate_network_cidrvia the standard private gateway - the foundation now also performs one intentional first-boot reboot so the cloud-init-written VyOS
config.bootbecomes the active runtime configuration - this means fresh edge bring-up has a short settle window before public SSH becomes usable; treat that as expected first-boot behavior, not image corruption
core/hetzner/vyos-image-registerremains available as a compatibility path for externally managed images- if the Hetzner edge foundation is reapplied and new public IPs are assigned, rerun
org/gcp/wan-vpn-to-edgebefore rerunningplatform/network/vyos-edge-wan - the baseline blueprint now keeps
platform/network/vyos-edge-wan.advertise_prefixesempty; route origination is added later by the spoke/on-prem layer after those routes are actually present on the edge
Preconditions and safety checks¶
-
Validate init readiness:
hyops init status --env dev -
Ensure required runtime secrets exist:
hyops secrets ensure --env dev WAN_IPSEC_PSK hyops secrets ensure --env dev WAN_EDGE_SSH_PRIVATE_KEY hyops secrets ensure --env dev EDGE_OBS_GRAFANA_ADMIN_PASSWORD
To retrieve the Grafana admin password for a live operator session:
hyops vault password >/dev/null
hyops secrets show --env dev EDGE_OBS_GRAFANA_ADMIN_PASSWORD --raw
Default blueprint behavior is env-backed for control-host to edge SSH. HyOps writes a
transient key file on the shared control host at runtime, so the reusable path does not
depend on a manually staged /home/opsadmin/.ssh/id_ed25519.
Baseline profile note:
- This blueprint keeps edge_observability in bootstrap mode (edge_obs_enable_receive=false, edge_obs_enable_store_gateway=false, edge_obs_enable_ruler=false), so object-store secret wiring is not required for first-pass E2E.
-
Initialize the runtime blueprint overlay:
hyops blueprint init --env dev \ --ref networking/edge-control-plane@v1 \ --force -
Edit the runtime copy and replace all
CHANGE_ME_*values:$EDITOR ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml
Minimum values to set before first deploy:
- pinned VyOS artifact version and source ISO URL
- Hetzner network zone, SSH key name, location, server type, and private addressing
- GCP context id, region, subnet CIDRs, router name, and HA VPN gateway names
- operator CIDR for shared control host SSH
- observability probe URLs
- decision-service runtime root and Thanos query URL
- Cloudflare zone, hostname, worker, DNS target, origin URLs, and steering state ref
-
dispatcher target environment
-
Validate blueprint definition and preflight gate against the runtime copy:
hyops blueprint validate \ --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml hyops blueprint preflight --env dev \ --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml
Steps¶
-
Execute blueprint
hyops blueprint deploy --env dev \ --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml \ --execute -
Track run records while running
-
HyOps prints the active step and run-record directory.
- Module logs are written under:
-
~/.hybridops/envs/<env>/logs/module/<module_ref_sanitized>/<run_id>/ -
Verify final state
hyops state show --env dev --module org/hetzner/vyos-edge-foundation hyops state show --env dev --module org/gcp/wan-vpn-to-edge hyops state show --env dev --module platform/network/vyos-edge-wan hyops state show --env dev --module platform/network/edge-observability hyops state show --env dev --module platform/network/decision-service hyops state show --env dev --module platform/network/decision-dispatcher hyops state show --env dev --module platform/network/decision-consumer hyops state show --env dev --module platform/network/decision-executor
Live control-plane verification¶
Use these checks when you need current edge truth for WAN, observability, and the decision loop:
hyops show module org/hetzner/vyos-edge-foundation --env dev
hyops show module org/gcp/wan-vpn-to-edge --env dev
hyops show module platform/network/vyos-edge-wan --env dev
hyops show module platform/network/edge-observability --env dev
hyops show module platform/network/decision-service --env dev
For live query-path validation:
curl -fsS 'https://thanos.hybridops.tech/api/v1/query?query=probe_success{job="edge_blackbox_http"}' \
| jq '.data.result[] | {probe_target: .metric.probe_target, value: .value[1]}'
curl -fsS 'https://thanos.hybridops.tech/api/v1/query?query=hyops_decision_mode' \
| jq '.data.result'
Expected current signals:
- the fixed public edge pair and GCP HA VPN are both
status=ok - edge observability publishes the live Grafana and Thanos hosts
- decision service publishes
mode,reason, and the last successful action - Thanos shows both primary and burst probe targets responding
Keep manual demo-signal injection out of the public runbook. That belongs in controlled demo notes, not in the general operator procedure.
Lifecycle test pattern¶
Use this sequence to validate module destroy/reapply behavior for operations phase:
# destroy in reverse dependency order
hyops destroy --env dev --module platform/network/decision-executor --inputs /tmp/hyops-decision-executor.dev.yml
hyops destroy --env dev --module platform/network/decision-consumer --inputs /tmp/hyops-decision-consumer.dev.yml
hyops destroy --env dev --module platform/network/decision-dispatcher --inputs /tmp/hyops-decision-dispatcher.dev.yml
hyops destroy --env dev --module platform/network/decision-service --inputs /tmp/hyops-decision-service.dev.yml
hyops destroy --env dev --module platform/network/dns-routing --inputs /tmp/hyops-dns-routing.dev.yml
hyops destroy --env dev --module platform/network/edge-observability --inputs /tmp/hyops-edge-observability.dev.yml
hyops destroy --env dev --module platform/network/vyos-edge-wan --inputs /tmp/hyops-vyos-edge-wan.dev.yml
# re-apply via runtime blueprint overlay
hyops blueprint deploy --env dev \
--file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml \
--execute
For destroy inputs, set explicit absent state where applicable:
dns_state: absentforplatform/network/dns-routingedge_obs_state: absentforplatform/network/edge-observability
Advanced observability mode¶
To validate receive/store-gateway/ruler mode (with object-store), apply platform/network/edge-observability with explicit object-store config:
inventory_state_ref: org/hetzner/vyos-edge-foundation
inventory_vm_groups:
edge:
- edge01
edge_obs_enable_receive: true
edge_obs_enable_store_gateway: true
edge_obs_enable_ruler: true
edge_obs_hashring_endpoints:
- 127.0.0.1:10907
edge_obs_objstore_config: |
type: FILESYSTEM
config:
directory: /opt/hybridops/edge-observability/data/objstore
edge_obs_grafana_admin_password: "ChangeMe-Observability-Strong1!"
hyops preflight --env dev --strict --module platform/network/edge-observability --inputs /tmp/hyops-edge-observability.advanced.yml
hyops apply --env dev --module platform/network/edge-observability --inputs /tmp/hyops-edge-observability.advanced.yml
Verification¶
Success indicators:
- Blueprint summary ends with
status=ok. - All required steps report
status=okorskippedwith valid prior state. platform/network/decision-servicestate isokand policy/action inputs are rendered.platform/network/decision-dispatcherstate isokand reportsrecord-onlyexecution mode.platform/network/decision-consumerstate isokand reportsapproval-onlyexecution mode.platform/network/decision-executorstate isokand reportsdry-runexecution mode.platform/network/edge-observabilitykeeps Grafana, Alertmanager, and Thanos Query healthy on the shared control host; container restart loops are treated as a step failure.- a synthetic control-loop validation produces an
awaiting-approvaldispatch request, then anapproved-readyexecution record, then adry-run-readyexecution-attempt record without executing any workflow. - if DNS automation is enabled,
platform/network/dns-routingstate records the desired target and provider (manual-commandorpowerdns-api) - for the baseline WAN underlay, GCP Cloud Router BGP peers are
Established; learned spoke routes may remain0until an on-prem/spoke route-originating layer is deployed - after route-export changes, one HA VPN/BGP leg may reconverge more slowly than the
other; allow the full HyOps convergence window before treating a single-leg
Connectstate as a real failure
Troubleshooting¶
- If
platform/network/vyos-edge-wanfails with both tunnels stuck atNO_INCOMING_PACKETSand the VyOS edges show BGP neighbors inConnect, check the Hetzner edge firewall allowlist first. -
org/hetzner/vyos-edge-foundationmust allow the current GCP HA VPN public peer IPs inipsec_source_cidrs. If those IPs changed after a GCP VPN gateway recreate, update the edge foundation inputs and rerun the blueprint. -
Hetzner token invalid: rerun init and replace token.hyops init hetzner --env dev --force -
inputs.ssh_public_key contains placeholder: set a real public key before first Hetzner foundation apply. missing required env var: WAN_IPSEC_PSK: set secret withhyops secrets ensureand rerun.- If you enable receive/store-gateway/ruler in
edge_observability, also provide object-store config and includeEDGE_OBS_OBJSTORE_CONFIGin required env. - For step-by-step debugging, run affected modules directly with
hyops preflight/apply --module ...and inspect run-record path shown by CLI. - For internal DNS cutover, prefer the PowerDNS API path and the module example:
$HYOPS_CORE_ROOT/modules/platform/network/dns-routing/examples/inputs.powerdns.yml