Skip to content

Deploy Edge Control Plane (HyOps Blueprint)

  • Purpose: Deploy a governed edge control plane for WAN, observability, DNS, and decision control-loop services. Owner: Network/platform engineering

  • Trigger: New environment bring-up, edge rebuild, DR/burst rehearsal

  • Impact: Creates edge foundation and connects cloud/on-prem routing control-plane
  • Severity: P2 Pre-reqs: hyops init hetzner and hyops init gcp completed for target env, vault decrypt working, required secrets set.

  • Rollback strategy: Destroy module states in reverse dependency order or rerun with corrected inputs and skip_if_state_ok behavior.

Context

Blueprint ref: networking/edge-control-plane@v1 Location: hybridops-core/blueprints/networking/edge-control-plane@v1/blueprint.yml

Important usage model:

  • the shipped blueprint is a reusable scaffold
  • copy it into runtime config with hyops blueprint init
  • replace every CHANGE_ME_* value in the runtime copy before preflight or deploy
  • run blueprint commands against the runtime copy with --file, not the shipped source ref

Step flow:

  1. core/hetzner/vyos-image-seed
  2. org/hetzner/shared-private-network
  3. org/hetzner/vyos-edge-foundation
  4. org/hetzner/shared-control-host
  5. org/gcp/wan-hub-network
  6. org/gcp/wan-cloud-router
  7. org/gcp/wan-vpn-to-edge
  8. platform/network/vyos-edge-wan
  9. platform/network/edge-observability
  10. platform/network/dns-routing
  11. platform/network/decision-service
  12. platform/network/decision-dispatcher
  13. platform/network/decision-consumer
  14. platform/network/decision-executor

Rerun safety:

  • org/gcp/wan-vpn-to-edge now reconciles on blueprint reruns so refreshed edge public IPs are pushed back into GCP HA VPN/Cloud Router state
  • the shared control-plane operations layer now reruns on deploy:
  • platform/network/edge-observability
  • platform/network/dns-routing
  • platform/network/decision-service
  • platform/network/decision-dispatcher
  • platform/network/decision-consumer
  • platform/network/decision-executor
  • those steps are treated as current-host convergence, not historical evidence, because host rebuilds or package drift on edge_control_host can invalidate a previously successful state record
  • steps marked skip_if_state_ok: true now also refuse to skip when explicit step inputs changed since the last successful apply, so overlay edits such as refreshed ipsec_source_cidrs force a real rerun instead of silently trusting historical state
  • the VyOS artifact bootstrap step is now pinned to the documented runner-based ISO build contract; if you keep build-vyos-qcow2.sh as the build path, the overlay must provide a pinned source_iso_url and allow_iso_build: true
  • because that build now runs on the managed ops runner, the blueprint uses the runner-local installed path under /opt/hybridops/core/app/tools/build/vyos/ rather than assuming the controller source checkout exists on the target host
  • the GCP WAN hub network bootstrap step now prefers project_state_ref=org/gcp/project-factory instead of a hardcoded project_id, so project recovery or replacement flows do not leave the blueprint pinned to stale tenant metadata
  • platform/network/edge-observability now verifies container readiness, not only systemd/compose startup, so Grafana or Alertmanager restart loops fail the step instead of publishing a false green state
  • shipped blueprints no longer carry literal HA VPN PSK placeholders; the GCP VPN step consumes the vaulted WAN_IPSEC_PSK env-backed contract instead
  • the decision layer is split deliberately:
  • platform/network/decision-service emits decision records
  • platform/network/decision-dispatcher turns those records into approval-gated dispatch requests
  • platform/network/decision-consumer turns approved requests into execution-ready records
  • platform/network/decision-executor turns those records into dry-run execution attempts
  • later runner execution remains a separate concern

Target-state role split:

  • Hetzner shared private network: dedicated lifecycle owner
  • routed edge default: VyOS
  • shared control host: Linux
  • GCP routing hub: Cloud Router + NCC

Hetzner ownership model:

  • org/hetzner/shared-private-network owns the reusable 10.80.0.0/24 private network contract
  • org/hetzner/vyos-edge-foundation consumes that network for the edge pair
  • org/hetzner/shared-control-host consumes the same network for PowerDNS, decision service, and supporting control-plane services
  • this split keeps edge destroy/rebuild separate from shared-control host lifecycle
  • blueprint preflight now verifies live Hetzner server presence before skipping state-ok edge/control-host steps, so out-of-band server deletion turns the affected step back into a real deploy instead of being silently skipped

Current DNS model:

  • platform/network/dns-routing remains the DNS cutover/update layer
  • first-class internal DNS authority target is now provider: powerdns-api
  • manual-command remains available as a fallback adapter
  • the recommended internal DNS authority topology is:
  • PowerDNS primary in the shared Hetzner / WAN-edge control plane
  • PowerDNS secondary on-prem for local resolution resilience

Important boundary:

  • this blueprint currently prepares the shared control plane around WAN edge, observability, and cutover logic
  • the first executable DNS authority path now lives in:
  • networking/powerdns-shared-primary@v1
  • networking/powerdns-onprem-secondary@v1
  • the first implementation intentionally colocates:
  • the writable primary on edge01
  • the read-only secondary on the on-prem runner host
  • this keeps cost down while preserving the clean authority/cutover separation
  • this aligns with the routed topology in the Network routing contract
  • the deprecated platform/network/wan-edge path remains only for Linux-edge compatibility labs and is not part of this blueprint

Current Hetzner image model:

  • the blueprint now uses core/hetzner/vyos-image-seed as the default state-first image path
  • if a matching Hetzner snapshot already exists, the step skips seeding and publishes state
  • if no matching snapshot exists, the step can seed one by using hcloud-upload-image
  • the recommended Hetzner seed source is now a direct public raw.xz artifact URL
  • if the operator supplies only a qcow2 URL, HyOps can still auto-wrap it for Hetzner image seeding, but only when the execution host has qemu-img and a publicly reachable base URL configured
  • image_state_ref is authoritative for the downstream org/hetzner/vyos-edge-foundation step and must override stale saved image ids
  • the foundation cloud-init now pins Hetzner's routed public host route and default route via 172.31.1.1 on eth0
  • the foundation also uses Hetzner's routed private network model on eth1: private_ip/32 plus an explicit route to private_network_cidr via the standard private gateway
  • the foundation now also performs one intentional first-boot reboot so the cloud-init-written VyOS config.boot becomes the active runtime configuration
  • this means fresh edge bring-up has a short settle window before public SSH becomes usable; treat that as expected first-boot behavior, not image corruption
  • core/hetzner/vyos-image-register remains available as a compatibility path for externally managed images
  • if the Hetzner edge foundation is reapplied and new public IPs are assigned, rerun org/gcp/wan-vpn-to-edge before rerunning platform/network/vyos-edge-wan
  • the baseline blueprint now keeps platform/network/vyos-edge-wan.advertise_prefixes empty; route origination is added later by the spoke/on-prem layer after those routes are actually present on the edge

Preconditions and safety checks

  1. Validate init readiness:

    hyops init status --env dev
    
  2. Ensure required runtime secrets exist:

    hyops secrets ensure --env dev WAN_IPSEC_PSK
    hyops secrets ensure --env dev WAN_EDGE_SSH_PRIVATE_KEY
    hyops secrets ensure --env dev EDGE_OBS_GRAFANA_ADMIN_PASSWORD
    

To retrieve the Grafana admin password for a live operator session:

hyops vault password >/dev/null
hyops secrets show --env dev EDGE_OBS_GRAFANA_ADMIN_PASSWORD --raw

Default blueprint behavior is env-backed for control-host to edge SSH. HyOps writes a transient key file on the shared control host at runtime, so the reusable path does not depend on a manually staged /home/opsadmin/.ssh/id_ed25519.

Baseline profile note: - This blueprint keeps edge_observability in bootstrap mode (edge_obs_enable_receive=false, edge_obs_enable_store_gateway=false, edge_obs_enable_ruler=false), so object-store secret wiring is not required for first-pass E2E.

  1. Initialize the runtime blueprint overlay:

    hyops blueprint init --env dev \
      --ref networking/edge-control-plane@v1 \
      --force
    
  2. Edit the runtime copy and replace all CHANGE_ME_* values:

    $EDITOR ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml
    

Minimum values to set before first deploy:

  • pinned VyOS artifact version and source ISO URL
  • Hetzner network zone, SSH key name, location, server type, and private addressing
  • GCP context id, region, subnet CIDRs, router name, and HA VPN gateway names
  • operator CIDR for shared control host SSH
  • observability probe URLs
  • decision-service runtime root and Thanos query URL
  • Cloudflare zone, hostname, worker, DNS target, origin URLs, and steering state ref
  • dispatcher target environment

  • Validate blueprint definition and preflight gate against the runtime copy:

    hyops blueprint validate \
      --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml
    
    hyops blueprint preflight --env dev \
      --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml
    

Steps

  1. Execute blueprint

    hyops blueprint deploy --env dev \
      --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml \
      --execute
    
  2. Track run records while running

  3. HyOps prints the active step and run-record directory.

  4. Module logs are written under:
  5. ~/.hybridops/envs/<env>/logs/module/<module_ref_sanitized>/<run_id>/

  6. Verify final state

    hyops state show --env dev --module org/hetzner/vyos-edge-foundation
    hyops state show --env dev --module org/gcp/wan-vpn-to-edge
    hyops state show --env dev --module platform/network/vyos-edge-wan
    hyops state show --env dev --module platform/network/edge-observability
    hyops state show --env dev --module platform/network/decision-service
    hyops state show --env dev --module platform/network/decision-dispatcher
    hyops state show --env dev --module platform/network/decision-consumer
    hyops state show --env dev --module platform/network/decision-executor
    

Live control-plane verification

Use these checks when you need current edge truth for WAN, observability, and the decision loop:

hyops show module org/hetzner/vyos-edge-foundation --env dev
hyops show module org/gcp/wan-vpn-to-edge --env dev
hyops show module platform/network/vyos-edge-wan --env dev
hyops show module platform/network/edge-observability --env dev
hyops show module platform/network/decision-service --env dev

For live query-path validation:

curl -fsS 'https://thanos.hybridops.tech/api/v1/query?query=probe_success{job="edge_blackbox_http"}' \
  | jq '.data.result[] | {probe_target: .metric.probe_target, value: .value[1]}'
curl -fsS 'https://thanos.hybridops.tech/api/v1/query?query=hyops_decision_mode' \
  | jq '.data.result'

Expected current signals:

  • the fixed public edge pair and GCP HA VPN are both status=ok
  • edge observability publishes the live Grafana and Thanos hosts
  • decision service publishes mode, reason, and the last successful action
  • Thanos shows both primary and burst probe targets responding

Keep manual demo-signal injection out of the public runbook. That belongs in controlled demo notes, not in the general operator procedure.

Lifecycle test pattern

Use this sequence to validate module destroy/reapply behavior for operations phase:

# destroy in reverse dependency order
hyops destroy --env dev --module platform/network/decision-executor --inputs /tmp/hyops-decision-executor.dev.yml
hyops destroy --env dev --module platform/network/decision-consumer --inputs /tmp/hyops-decision-consumer.dev.yml
hyops destroy --env dev --module platform/network/decision-dispatcher --inputs /tmp/hyops-decision-dispatcher.dev.yml
hyops destroy --env dev --module platform/network/decision-service --inputs /tmp/hyops-decision-service.dev.yml
hyops destroy --env dev --module platform/network/dns-routing --inputs /tmp/hyops-dns-routing.dev.yml
hyops destroy --env dev --module platform/network/edge-observability --inputs /tmp/hyops-edge-observability.dev.yml
hyops destroy --env dev --module platform/network/vyos-edge-wan --inputs /tmp/hyops-vyos-edge-wan.dev.yml

# re-apply via runtime blueprint overlay
hyops blueprint deploy --env dev \
  --file ~/.hybridops/envs/dev/config/blueprints/edge-control-plane.yml \
  --execute

For destroy inputs, set explicit absent state where applicable:

  • dns_state: absent for platform/network/dns-routing
  • edge_obs_state: absent for platform/network/edge-observability

Advanced observability mode

To validate receive/store-gateway/ruler mode (with object-store), apply platform/network/edge-observability with explicit object-store config:

inventory_state_ref: org/hetzner/vyos-edge-foundation
inventory_vm_groups:
  edge:
    - edge01
edge_obs_enable_receive: true
edge_obs_enable_store_gateway: true
edge_obs_enable_ruler: true
edge_obs_hashring_endpoints:
  - 127.0.0.1:10907
edge_obs_objstore_config: |
  type: FILESYSTEM
  config:
    directory: /opt/hybridops/edge-observability/data/objstore
edge_obs_grafana_admin_password: "ChangeMe-Observability-Strong1!"
hyops preflight --env dev --strict --module platform/network/edge-observability --inputs /tmp/hyops-edge-observability.advanced.yml
hyops apply --env dev --module platform/network/edge-observability --inputs /tmp/hyops-edge-observability.advanced.yml

Verification

Success indicators:

  • Blueprint summary ends with status=ok.
  • All required steps report status=ok or skipped with valid prior state.
  • platform/network/decision-service state is ok and policy/action inputs are rendered.
  • platform/network/decision-dispatcher state is ok and reports record-only execution mode.
  • platform/network/decision-consumer state is ok and reports approval-only execution mode.
  • platform/network/decision-executor state is ok and reports dry-run execution mode.
  • platform/network/edge-observability keeps Grafana, Alertmanager, and Thanos Query healthy on the shared control host; container restart loops are treated as a step failure.
  • a synthetic control-loop validation produces an awaiting-approval dispatch request, then an approved-ready execution record, then a dry-run-ready execution-attempt record without executing any workflow.
  • if DNS automation is enabled, platform/network/dns-routing state records the desired target and provider (manual-command or powerdns-api)
  • for the baseline WAN underlay, GCP Cloud Router BGP peers are Established; learned spoke routes may remain 0 until an on-prem/spoke route-originating layer is deployed
  • after route-export changes, one HA VPN/BGP leg may reconverge more slowly than the other; allow the full HyOps convergence window before treating a single-leg Connect state as a real failure

Troubleshooting

  • If platform/network/vyos-edge-wan fails with both tunnels stuck at NO_INCOMING_PACKETS and the VyOS edges show BGP neighbors in Connect, check the Hetzner edge firewall allowlist first.
  • org/hetzner/vyos-edge-foundation must allow the current GCP HA VPN public peer IPs in ipsec_source_cidrs. If those IPs changed after a GCP VPN gateway recreate, update the edge foundation inputs and rerun the blueprint.

  • Hetzner token invalid: rerun init and replace token.

    hyops init hetzner --env dev --force
    
  • inputs.ssh_public_key contains placeholder: set a real public key before first Hetzner foundation apply.

  • missing required env var: WAN_IPSEC_PSK: set secret with hyops secrets ensure and rerun.
  • If you enable receive/store-gateway/ruler in edge_observability, also provide object-store config and include EDGE_OBS_OBJSTORE_CONFIG in required env.
  • For step-by-step debugging, run affected modules directly with hyops preflight/apply --module ... and inspect run-record path shown by CLI.
  • For internal DNS cutover, prefer the PowerDNS API path and the module example:
  • $HYOPS_CORE_ROOT/modules/platform/network/dns-routing/examples/inputs.powerdns.yml

References