Skip to content

Failover PostgreSQL HA to GCP (HyOps Blueprint)

  • Purpose: Restore a new PostgreSQL HA primary in GCP from pgBackRest backups (GCS or S3). Owner: Platform engineering / SRE

  • Trigger: DR event (on-prem DB unavailable) or scheduled failover drill

  • Impact: A new PostgreSQL primary becomes available in GCP; applications must be cut over explicitly (DNS / config / traffic).
  • Severity: P1 Pre-reqs: GCP init is ready for the target env, pgBackRest backups exist and are recent, shared PowerDNS primary is already provisioned, required secrets are present in runtime vault, and operators have fenced the on-prem primary to prevent split-brain.

  • Rollback strategy: If failover is aborted, destroy the DR VMs and do not cut traffic; if failover is completed, use the dedicated failback runbook to return service to on-prem.

Context

Blueprint ref: dr/postgresql-ha-failover-gcp@v1 Location: hybridops-core/blueprints/dr/postgresql-ha-failover-gcp@v1/blueprint.yml

Default step flow:

  1. org/gcp/wan-cloud-nat (provision private egress for the workloads subnet)
  2. platform/gcp/platform-vm (provision DR nodes)
  3. platform/postgresql-ha with apply_mode=restore (restore cluster from pgBackRest repo)
  4. platform/postgresql-ha-backup (re-enable backups on the DR primary)
  5. platform/network/dns-routing (optional internal DNS cutover when endpoint_dns_name is set)
  6. platform/network/dns-routing (apply_mode=status, live PowerDNS verification)

State isolation:

  • platform/postgresql-ha#postgresql_restore_gcp_dr
  • platform/postgresql-ha-backup#postgresql_backup_config_gcp_dr
  • platform/network/dns-routing#postgresql_dns_cutover_gcp_dr
  • platform/network/dns-routing#postgresql_dns_status_ha_failover_gcp

This separation is intentional. The DR failover lane must not overwrite the primary on-prem PostgreSQL HA state or a generic DNS-routing state during drills or repeat executions.

The DR VM step now carries an explicit post-apply SSH readiness gate before the restore step is allowed to start. This is intentional: a successful cloud VM create is not enough for a tarball-safe DR flow if the next step immediately depends on SSH into those fresh nodes.

The restore and backup-configuration steps now declare:

  • execution_plane: runner-local

Meaning:

  • this blueprint is intended to run from a shared runner with private reachability to the GCP target subnet
  • it is not designed to assume workstation-direct access to the private DR VM IPs

The shipped DR blueprint also sets restore_delta: true. That is intentional for this lane: these are destructive restore targets created for failover, and package/bootstrap tasks may leave PGDATA non-empty before pgBackRest runs. The module default stays conservative, but the DR blueprint opts into delta restore under the explicit restore_confirm=true guard.

When the pgBackRest repository contains divergent timelines from earlier drills or promotions, set the restore lane explicitly:

  • restore_set
  • restore_target_timeline
  • optionally restore_target_time for PITR

If the lane is already using backup_state_ref, leave restore_target_timeline blank unless you have inspected the repository and are deliberately pinning a specific lineage.

The DNS step is intentionally endpoint-contract driven:

  • it consumes endpoint_dns_name and endpoint_target from the restored PostgreSQL HA state
  • it remains a no-op while endpoint_dns_name is blank
  • when endpoint_dns_name is set, it updates the internal DNS authority via platform/network/dns-routing

For the A-record target, the shipped blueprint now uses endpoint_host, not endpoint_target. This is deliberate:

  • endpoint_target is the client-facing endpoint contract and may itself be the stable DNS name
  • endpoint_host is the concrete active data-plane address that PowerDNS should publish for the record
  • for PostgreSQL HA this is the VIP when one is configured; otherwise it is the resolved leader address rather than a transport-specific instance name

Internal DNS authority should be treated as a separate shared control-plane service:

  • NetBox remains IPAM / inventory source of truth
  • PowerDNS is the internal authoritative DNS engine
  • platform/network/dns-routing is the cutover automation layer that updates PowerDNS records

For HybridOps, the clean first posture is to self-host PowerDNS in the shared control plane rather than bundling DNS authority into the NetBox bootstrap path.

The DNS cutover step now prefers:

  • powerdns_state_ref: platform/network/powerdns-authority#shared_primary

That keeps the shipped blueprint state-driven and avoids hardcoding the PowerDNS API endpoint by default.

For runner-safe execution, the DNS step also uses:

  • ssh_private_key_env: WAN_EDGE_SSH_PRIVATE_KEY

This keeps the DR lane tarball-safe. The runner does not need a baked private key file on disk; HybridOps materializes a transient key file from the synced env value and derives the surviving PowerDNS control host from powerdns_state_ref.

GCP project roles

For this blueprint, keep the GCP project split explicit:

  • host/network project
  • runner placement
  • DR VM placement
  • Shared VPC
  • Cloud Router / NAT for both runner and DR workloads subnets
  • control project
  • env-scoped object repository state such as org/gcp/object-repo
  • env-scoped GCP Secret Manager secrets
  • workload project
  • optional future role if env-specific workload compute is separated from the host/network project

It is valid for the DR VMs to be restored into the host/network project while backup bucket state and runner-synced secrets are sourced from the env control project.

DR operating model (important)

This blueprint is the restore-path DR implementation for a GCP target.

  • It is compatible with both:
  • backup-restore DR (Mode A), and
  • warm-standby promotion programs (Mode B) where this runbook is still used for rebuild/reseed.
  • It is not a dual-readonly pattern by itself.

For the platform-wide strategy (including Azure/GCP selection and secondary backup copy policy), see: - PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)

Preconditions and safety checks

  1. Fence the on-prem primary (split-brain guard)

This blueprint does not stop the on-prem cluster for you.

Before failover you must ensure the on-prem primary cannot accept writes. Typical fencing options:

  • Power off the on-prem DB VMs (or the on-prem site).
  • Block egress/ingress to the on-prem DB VIP at the network layer.
  • Disable write traffic at the app tier (maintenance mode) and verify no clients can reach the old VIP.

If you cannot prove the on-prem primary is fenced, do not proceed.

  1. Ensure object repository state exists (GCS/S3)

This blueprint consumes repository settings from module state via repo_state_ref (default in blueprint: org/gcp/object-repo#pgbackrest_primary). Ensure that referenced object repository module is already ready and that backups/WAL are present.

Recommended role split:

  • org/gcp/object-repo#pgbackrest_primary lives in the env control project
  • platform/gcp/platform-vm#gcp_pg_vms places DR compute in the host/network project

Optional (infra provisioning):

# GCS object repo (bucket + service account plumbing; no SA keys in state)
hyops apply --env dev \
  --module org/gcp/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/gcp/object-repo/examples/inputs.min.yml"

# S3 object repo (bucket + IAM user; no access keys in state)
hyops apply --env dev \
  --module org/aws/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/aws/object-repo/examples/inputs.min.yml"

# Azure object repo (storage account + container; no account keys in state)
hyops apply --env dev \
  --module org/azure/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/azure/object-repo/examples/inputs.min.yml"
  1. Ensure pgBackRest repository credentials exist in runtime vault

For GCS repositories, HyOps expects the service account JSON in an env key (default: PG_BACKUP_GCS_SA_JSON).

Example (store SA JSON into vault from the current shell):

export PG_BACKUP_GCS_SA_JSON="$(cat /path/to/gcs-sa.json)"
hyops secrets set --env dev --from-env PG_BACKUP_GCS_SA_JSON

For S3 repositories, store the access/secret keys into vault (defaults):

export PG_BACKUP_S3_ACCESS_KEY_ID="..."
export PG_BACKUP_S3_SECRET_ACCESS_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_S3_ACCESS_KEY_ID PG_BACKUP_S3_SECRET_ACCESS_KEY

For Azure repositories, store the storage account key into vault (default key name):

export PG_BACKUP_AZURE_ACCOUNT_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_AZURE_ACCOUNT_KEY

Optional (when secondary_enabled=true in backup step): store secondary backend credentials as well, for example:

# Secondary Azure repo2 key example
export PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY
  1. Ensure Patroni and application DB secrets exist in runtime vault
    hyops secrets ensure --env dev \
      PATRONI_SUPERUSER_PASSWORD \
      PATRONI_REPLICATION_PASSWORD \
      NETBOX_DB_PASSWORD
    

If you want the blueprint to update internal DNS during cutover, also ensure:

hyops secrets ensure --env dev POWERDNS_API_KEY
hyops secrets ensure --env dev WAN_EDGE_SSH_PRIVATE_KEY
  1. If you already have an env-local blueprint file, validate and preflight it
    hyops blueprint validate --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"
    hyops blueprint preflight --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"
    

hyops blueprint preflight now runs step-level module driver preflight when upstream step state is already present. That means it fails early on missing required secret env vars or a locked runtime vault, before deploy --execute starts the restore step. If the current shell has not unlocked GPG for the runtime vault, unlock it first:

hyops vault password >/dev/null
  1. Ensure the runner can reach DR VMs over SSH (network model)

Best practice (enterprise): run HyOps from a runner with private reachability to the target VPC. The recommended bootstrap path is networking/gcp-ops-runner@v1, which provisions a private runner VM in the hub core subnet for runner-local execution.

Important:

  • ssh_proxy_jump_auto only auto-infers a Proxmox bastion for on-prem inventories.
  • For this GCP DR blueprint, HyOps will not auto-route private GCP VM addresses through the Proxmox host.
  • If your runner cannot route to the GCP subnet directly, you must either:
  • provide an explicit cloud-reachable bastion via ssh_proxy_jump_host, or
  • set assign_public_ip: true for the DR VMs and restrict SSH ingress for the drill.

Community/drill fast-path: set assign_public_ip: true for DR VMs in your blueprint file and apply a restricted SSH firewall rule.

Example:

SRC_CIDR="203.0.113.10/32"  # replace with your operator source
gcloud compute firewall-rules create hyops-dr-allow-ssh-postgres \
  --network <your-vpc-network> \
  --allow tcp:22 \
  --source-ranges "$SRC_CIDR" \
  --target-tags postgres

Steps

  1. Prepare a blueprint file with real values

The built-in blueprint contains placeholders and now fails fast until they are replaced. Required at minimum for the GCP VM step:

  • context_id
  • zone

Expected upstream state in the selected env:

  • org/gcp/project-factory
  • org/gcp/wan-hub-network

Optional overrides if you do not want the default state-driven path:

  • project_id
  • network
  • subnetwork

The shipped GCP VM step uses ssh_keys_from_init: true. If you override that path with explicit ssh_keys, also set ssh_keys_from_init: false.

Use the blueprint --file option so you can keep a local, environment-specific copy:

hyops blueprint init --env dev \
  --ref dr/postgresql-ha-failover-gcp@v1 \
  --dest-name dr-postgresql-ha-failover-gcp.yml

# edit ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml
# replace every CHANGE_ME_* value before preflight or deploy
# (set any environment-specific overrides such as context_id, zone, or explicit project/network values)
# by default the DR VM step uses ssh_keys_from_init=true and consumes the key from ~/.hybridops/envs/<env>/meta/gcp.ready.json
# default repository contract is repo_state_ref=org/gcp/object-repo

hyops blueprint validate --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"
hyops blueprint preflight --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"

Note:

  • If platform/gcp/platform-vm#gcp_pg_vms is already ok, the blueprint now skips re-provisioning that step (skip_if_state_ok=true).
  • Restore still requires explicit confirmation (restore_confirm: true) and will remain blocked in preflight until set.
  • Backup step supports optional secondary repository copy (secondary_enabled=true, secondary_repo_state_ref or secondary_backend + secondary_* fields).

  • Execute failover restore

This is guarded. The PostgreSQL HA module will refuse to run restore unless:

  • inputs.apply_mode=restore
  • inputs.restore_confirm=true

    hyops blueprint deploy --env dev \
      --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" \
      --execute
    
    # Optional override at runtime:
    # HYOPS_INPUT_repo_state_ref=org/aws/object-repo hyops blueprint deploy --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" --execute
    # HYOPS_INPUT_repo_state_ref=org/azure/object-repo hyops blueprint deploy --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" --execute
    
  • Observe progress and capture run records

  • Each step prints a run record: path.

  • Logs are written under:
  • $HOME/.hybridops/envs/<env>/logs/module/...

Verification

  1. Confirm module state and outputs
    cat "$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_dr.json"
    

Expected:

  • status: ok
  • outputs.cap.db.postgresql_ha: ready
  • Prefer the explicit endpoint contract from module state:
  • endpoint_target
  • endpoint_target_type
  • endpoint_port
  • db_host remains the active host/VIP for diagnostics and backward compatibility.

  • Confirm the DR endpoint accepts connections

From a host that has network reachability to the DR DB endpoint:

nc -vz <endpoint_target> <endpoint_port>
  1. Confirm live internal DNS now matches the restored DR target

Expected:

  • platform/network/dns-routing#postgresql_dns_status_ha_failover_gcp is status: ok
  • published dns.status is live-ok
  • published dns.targets matches the restored DR endpoint_host

Current recovery-lane verification

Use these checks when you need the current failover lane truth rather than the original restore logs:

hyops show module platform/postgresql-ha#postgresql_restore_gcp_dr --env dev
hyops show module platform/network/dns-routing#postgresql_dns_status_ha_failover_gcp --env dev

Expected:

  • the failover lane publishes the current cloud endpoint
  • the DNS status instance reports the live observed target for that lane
  • the state is suitable for reviewer proof without rerunning the disruptive drill

Post-actions and clean-up

  • Perform application cutover explicitly against endpoint_target (DNS / config / traffic switching).
  • Recommended DR pattern:
  • set endpoint_dns_name in the PostgreSQL HA inputs for a stable cross-site DB name
  • keep applications pointed at that DNS name
  • move the DNS record during failover/failback instead of teaching apps raw leader IPs
  • Do not re-enable the on-prem primary until you have an approved failback plan.
  • If the DR event is a drill and you did not cut traffic, destroy the DR VMs to avoid ongoing cost.

Troubleshooting

  • If restore fails at preflight with missing env keys, ensure inputs.required_env includes the repo credential key(s) and that they are present in vault.
  • If restore fails with SSH reachability errors, ensure you can reach the DR VMs (public IP, VPN, bastion, or IAP).
  • If pgBackRest repo cannot be accessed, verify bucket name/path and permissions for the service account / S3 keys.

References