Skip to content

Failover PostgreSQL HA to GCP (HyOps Blueprint)

Purpose: Restore a new PostgreSQL HA primary in GCP from pgBackRest backups (GCS or S3).
Owner: Platform engineering / SRE
Trigger: DR event (on-prem DB unavailable) or scheduled failover drill
Impact: A new PostgreSQL primary becomes available in GCP; applications must be cut over explicitly (DNS / config / traffic).
Severity: P1
Pre-reqs: GCP init is ready for the target env, pgBackRest backups exist and are recent, shared PowerDNS primary is already provisioned, required secrets are present in runtime vault, and operators have fenced the on-prem primary to prevent split-brain.
Rollback strategy: If failover is aborted, destroy the DR VMs and do not cut traffic; if failover is completed, use the dedicated failback runbook to return service to on-prem.

Context

Blueprint ref: dr/postgresql-ha-failover-gcp@v1
Location: hybridops-core/blueprints/dr/postgresql-ha-failover-gcp@v1/blueprint.yml

Default step flow:

  1. org/gcp/wan-cloud-nat (provision private egress for the workloads subnet)
  2. platform/gcp/platform-vm (provision DR nodes)
  3. platform/onprem/postgresql-ha with apply_mode=restore (restore cluster from pgBackRest repo)
  4. platform/onprem/postgresql-ha-backup (re-enable backups on the DR primary)
  5. platform/network/dns-routing (optional internal DNS cutover when endpoint_dns_name is set)

The DR VM step now carries an explicit post-apply SSH readiness gate before the restore step is allowed to start. This is intentional: a successful cloud VM create is not enough for a tarball-safe DR flow if the next step immediately depends on SSH into those fresh nodes.

The restore and backup-configuration steps now declare:

  • execution_plane: runner-local

Meaning:

  • this blueprint is intended to run from a shared runner with private reachability to the GCP target subnet
  • it is not designed to assume workstation-direct access to the private DR VM IPs

The DNS step is intentionally endpoint-contract driven:

  • it consumes endpoint_dns_name and endpoint_target from the restored PostgreSQL HA state
  • it remains a no-op while endpoint_dns_name is blank
  • when endpoint_dns_name is set, it updates the internal DNS authority via platform/network/dns-routing

Internal DNS authority should be treated as a separate shared control-plane service:

  • NetBox remains IPAM / inventory source of truth
  • PowerDNS is the internal authoritative DNS engine
  • platform/network/dns-routing is the cutover automation layer that updates PowerDNS records

For HybridOps, the clean first posture is to self-host PowerDNS in the shared control plane rather than bundling DNS authority into the NetBox bootstrap path.

The DNS cutover step now prefers:

  • powerdns_state_ref: platform/network/powerdns-authority#shared_primary

That keeps the shipped blueprint state-driven and avoids hardcoding the PowerDNS API endpoint by default.

GCP project roles

For this blueprint, keep the GCP project split explicit:

  • host/network project
  • runner placement
  • DR VM placement
  • Shared VPC
  • Cloud Router / NAT for both runner and DR workloads subnets
  • control project
  • env-scoped object repository state such as org/gcp/object-repo
  • env-scoped GCP Secret Manager secrets
  • workload project
  • optional future role if env-specific workload compute is separated from the host/network project

It is valid for the DR VMs to be restored into the host/network project while backup bucket state and runner-synced secrets are sourced from the env control project.

DR operating model (important)

This blueprint is the restore-path DR implementation for a GCP target.

  • It is compatible with both:
  • backup-restore DR (Mode A), and
  • warm-standby promotion programs (Mode B) where this runbook is still used for rebuild/reseed.
  • It is not a dual-readonly pattern by itself.

For the platform-wide strategy (including Azure/GCP selection and secondary backup copy policy), see: - PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)

Preconditions and safety checks

  1. Fence the on-prem primary (split-brain guard)

This blueprint does not stop the on-prem cluster for you.

Before failover you must ensure the on-prem primary cannot accept writes. Typical fencing options:

  • Power off the on-prem DB VMs (or the on-prem site).
  • Block egress/ingress to the on-prem DB VIP at the network layer.
  • Disable write traffic at the app tier (maintenance mode) and verify no clients can reach the old VIP.

If you cannot prove the on-prem primary is fenced, do not proceed.

  1. Ensure object repository state exists (GCS/S3)

This blueprint consumes repository settings from module state via repo_state_ref (default in blueprint: org/gcp/object-repo).
Ensure that referenced object repository module is already ready and that backups/WAL are present.

Recommended role split:

  • org/gcp/object-repo lives in the env control project
  • platform/gcp/platform-vm#gcp_pg_vms places DR compute in the host/network project

Optional (infra provisioning):

# GCS object repo (bucket + service account plumbing; no SA keys in state)
hyops apply --env dev \
  --module org/gcp/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/gcp/object-repo/examples/inputs.min.yml"

# S3 object repo (bucket + IAM user; no access keys in state)
hyops apply --env dev \
  --module org/aws/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/aws/object-repo/examples/inputs.min.yml"

# Azure object repo (storage account + container; no account keys in state)
hyops apply --env dev \
  --module org/azure/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/azure/object-repo/examples/inputs.min.yml"
  1. Ensure pgBackRest repository credentials exist in runtime vault

For GCS repositories, HyOps expects the service account JSON in an env key (default: PG_BACKUP_GCS_SA_JSON).

Example (store SA JSON into vault from the current shell):

export PG_BACKUP_GCS_SA_JSON="$(cat /path/to/gcs-sa.json)"
hyops secrets set --env dev --from-env PG_BACKUP_GCS_SA_JSON

For S3 repositories, store the access/secret keys into vault (defaults):

export PG_BACKUP_S3_ACCESS_KEY_ID="..."
export PG_BACKUP_S3_SECRET_ACCESS_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_S3_ACCESS_KEY_ID PG_BACKUP_S3_SECRET_ACCESS_KEY

For Azure repositories, store the storage account key into vault (default key name):

export PG_BACKUP_AZURE_ACCOUNT_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_AZURE_ACCOUNT_KEY

Optional (when secondary_enabled=true in backup step): store secondary backend credentials as well, for example:

# Secondary Azure repo2 key example
export PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY
  1. Ensure Patroni and application DB secrets exist in runtime vault
hyops secrets ensure --env dev \
  PATRONI_SUPERUSER_PASSWORD \
  PATRONI_REPLICATION_PASSWORD \
  NETBOX_DB_PASSWORD

If you want the blueprint to update internal DNS during cutover, also ensure:

hyops secrets ensure --env dev POWERDNS_API_KEY
  1. Validate and preflight the blueprint
hyops blueprint validate --ref dr/postgresql-ha-failover-gcp@v1
hyops blueprint preflight --env dev --ref dr/postgresql-ha-failover-gcp@v1

hyops blueprint preflight now runs step-level module driver preflight when upstream step state is already present. That means it fails early on missing required secret env vars or a locked runtime vault, before deploy --execute starts the restore step. If the current shell has not unlocked GPG for the runtime vault, unlock it first:

hyops vault password >/dev/null
  1. Ensure the runner can reach DR VMs over SSH (network model)

Best practice (enterprise): run HyOps from a runner with private reachability to the target VPC. The recommended bootstrap path is networking/gcp-ops-runner@v1, which provisions a private runner VM in the hub core subnet for runner-local execution.

Important:

  • ssh_proxy_jump_auto only auto-infers a Proxmox bastion for on-prem inventories.
  • For this GCP DR blueprint, HyOps will not auto-route private GCP VM addresses through the Proxmox host.
  • If your runner cannot route to the GCP subnet directly, you must either:
  • provide an explicit cloud-reachable bastion via ssh_proxy_jump_host, or
  • set assign_public_ip: true for the DR VMs and restrict SSH ingress for the drill.

Community/drill fast-path: set assign_public_ip: true for DR VMs in your blueprint file and apply a restricted SSH firewall rule.

Example:

SRC_CIDR="203.0.113.10/32"  # replace with your operator source
gcloud compute firewall-rules create hyops-dr-allow-ssh-postgres \
  --network <your-vpc-network> \
  --allow tcp:22 \
  --source-ranges "$SRC_CIDR" \
  --target-tags postgres

Steps

  1. Prepare a blueprint file with real values

The built-in blueprint contains placeholders and now fails fast until they are replaced. Required at minimum for the GCP VM step:

Expected upstream state in the selected env:

  • org/gcp/project-factory
  • org/gcp/wan-hub-network

Optional overrides if you do not want the default state-driven path:

  • project_id
  • network
  • subnetwork

The shipped GCP VM step uses ssh_keys_from_init: true. If you override that path with explicit ssh_keys, also set ssh_keys_from_init: false.

Use the blueprint --file option so you can keep a local, environment-specific copy:

hyops blueprint init --env dev \
  --ref dr/postgresql-ha-failover-gcp@v1 \
  --dest-name dr-postgresql-ha-failover-gcp.yml

# edit ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml
# (set any environment-specific overrides such as zone or explicit project/network values)
# by default the DR VM step uses ssh_keys_from_init=true and consumes the key from ~/.hybridops/envs/<env>/meta/gcp.ready.json
# default repository contract is repo_state_ref=org/gcp/object-repo

hyops blueprint validate --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"
hyops blueprint preflight --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"

Note:

  • If platform/gcp/platform-vm#gcp_pg_vms is already ok, the blueprint now skips re-provisioning that step (skip_if_state_ok=true).
  • Restore still requires explicit confirmation (restore_confirm: true) and will remain blocked in preflight until set.
  • Backup step supports optional secondary repository copy (secondary_enabled=true, secondary_repo_state_ref or secondary_backend + secondary_* fields).

  • Execute failover restore

This is guarded. The PostgreSQL HA module will refuse to run restore unless:

  • inputs.apply_mode=restore
  • inputs.restore_confirm=true
hyops blueprint deploy --env dev \
  --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" \
  --execute

# Optional override at runtime:
# HYOPS_INPUT_repo_state_ref=org/aws/object-repo hyops blueprint deploy --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" --execute
# HYOPS_INPUT_repo_state_ref=org/azure/object-repo hyops blueprint deploy --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" --execute
  1. Observe progress and capture evidence

  2. Each step prints an evidence: path.

  3. Logs are written under:
  4. $HOME/.hybridops/envs/<env>/logs/module/...

Verification

  1. Confirm module state and outputs
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__postgresql-ha/latest.json"

Expected:

  • status: ok
  • outputs.cap.db.postgresql_ha: ready
  • Prefer the explicit endpoint contract from module state:
  • endpoint_target
  • endpoint_target_type
  • endpoint_port
  • db_host remains the active host/VIP for diagnostics and backward compatibility.

  • Confirm the DR endpoint accepts connections

From a host that has network reachability to the DR DB endpoint:

nc -vz <endpoint_target> <endpoint_port>

Post-actions and clean-up

  • Perform application cutover explicitly against endpoint_target (DNS / config / traffic switching).
  • Recommended DR pattern:
  • set endpoint_dns_name in the PostgreSQL HA inputs for a stable cross-site DB name
  • keep applications pointed at that DNS name
  • move the DNS record during failover/failback instead of teaching apps raw leader IPs
  • Do not re-enable the on-prem primary until you have an approved failback plan.
  • If the DR event is a drill and you did not cut traffic, destroy the DR VMs to avoid ongoing cost.

Troubleshooting

  • If restore fails at preflight with missing env keys, ensure inputs.required_env includes the repo credential key(s) and that they are present in vault.
  • If restore fails with SSH reachability errors, ensure you can reach the DR VMs (public IP, VPN, bastion, or IAP).
  • If pgBackRest repo cannot be accessed, verify bucket name/path and permissions for the service account / S3 keys.

References