Failover PostgreSQL HA to GCP (HyOps Blueprint)¶
Purpose: Restore a new PostgreSQL HA primary in GCP from pgBackRest backups (GCS or S3).
Owner: Platform engineering / SRE
Trigger: DR event (on-prem DB unavailable) or scheduled failover drill
Impact: A new PostgreSQL primary becomes available in GCP; applications must be cut over explicitly (DNS / config / traffic).
Severity: P1
Pre-reqs: GCP init is ready for the target env, pgBackRest backups exist and are recent, shared PowerDNS primary is already provisioned, required secrets are present in runtime vault, and operators have fenced the on-prem primary to prevent split-brain.
Rollback strategy: If failover is aborted, destroy the DR VMs and do not cut traffic; if failover is completed, use the dedicated failback runbook to return service to on-prem.
Context¶
Blueprint ref: dr/postgresql-ha-failover-gcp@v1
Location: hybridops-core/blueprints/dr/postgresql-ha-failover-gcp@v1/blueprint.yml
Default step flow:
org/gcp/wan-cloud-nat(provision private egress for the workloads subnet)platform/gcp/platform-vm(provision DR nodes)platform/onprem/postgresql-hawithapply_mode=restore(restore cluster from pgBackRest repo)platform/onprem/postgresql-ha-backup(re-enable backups on the DR primary)platform/network/dns-routing(optional internal DNS cutover whenendpoint_dns_nameis set)
The DR VM step now carries an explicit post-apply SSH readiness gate before the restore step is allowed to start. This is intentional: a successful cloud VM create is not enough for a tarball-safe DR flow if the next step immediately depends on SSH into those fresh nodes.
The restore and backup-configuration steps now declare:
execution_plane: runner-local
Meaning:
- this blueprint is intended to run from a shared runner with private reachability to the GCP target subnet
- it is not designed to assume workstation-direct access to the private DR VM IPs
The DNS step is intentionally endpoint-contract driven:
- it consumes
endpoint_dns_nameandendpoint_targetfrom the restored PostgreSQL HA state - it remains a no-op while
endpoint_dns_nameis blank - when
endpoint_dns_nameis set, it updates the internal DNS authority viaplatform/network/dns-routing
Internal DNS authority should be treated as a separate shared control-plane service:
NetBoxremains IPAM / inventory source of truthPowerDNSis the internal authoritative DNS engineplatform/network/dns-routingis the cutover automation layer that updates PowerDNS records
For HybridOps, the clean first posture is to self-host PowerDNS in the shared control plane rather than bundling DNS authority into the NetBox bootstrap path.
The DNS cutover step now prefers:
powerdns_state_ref: platform/network/powerdns-authority#shared_primary
That keeps the shipped blueprint state-driven and avoids hardcoding the PowerDNS API endpoint by default.
GCP project roles¶
For this blueprint, keep the GCP project split explicit:
host/network project- runner placement
- DR VM placement
- Shared VPC
- Cloud Router / NAT for both runner and DR workloads subnets
control project- env-scoped object repository state such as
org/gcp/object-repo - env-scoped GCP Secret Manager secrets
workload project- optional future role if env-specific workload compute is separated from the host/network project
It is valid for the DR VMs to be restored into the host/network project while backup bucket state and runner-synced secrets are sourced from the env control project.
DR operating model (important)¶
This blueprint is the restore-path DR implementation for a GCP target.
- It is compatible with both:
- backup-restore DR (Mode A), and
- warm-standby promotion programs (Mode B) where this runbook is still used for rebuild/reseed.
- It is not a dual-readonly pattern by itself.
For the platform-wide strategy (including Azure/GCP selection and secondary backup copy policy), see: - PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)
Preconditions and safety checks¶
- Fence the on-prem primary (split-brain guard)
This blueprint does not stop the on-prem cluster for you.
Before failover you must ensure the on-prem primary cannot accept writes. Typical fencing options:
- Power off the on-prem DB VMs (or the on-prem site).
- Block egress/ingress to the on-prem DB VIP at the network layer.
- Disable write traffic at the app tier (maintenance mode) and verify no clients can reach the old VIP.
If you cannot prove the on-prem primary is fenced, do not proceed.
- Ensure object repository state exists (GCS/S3)
This blueprint consumes repository settings from module state via repo_state_ref (default in blueprint: org/gcp/object-repo).
Ensure that referenced object repository module is already ready and that backups/WAL are present.
Recommended role split:
org/gcp/object-repolives in the env control projectplatform/gcp/platform-vm#gcp_pg_vmsplaces DR compute in the host/network project
Optional (infra provisioning):
# GCS object repo (bucket + service account plumbing; no SA keys in state)
hyops apply --env dev \
--module org/gcp/object-repo \
--inputs "$HOME/.hybridops/core/app/modules/org/gcp/object-repo/examples/inputs.min.yml"
# S3 object repo (bucket + IAM user; no access keys in state)
hyops apply --env dev \
--module org/aws/object-repo \
--inputs "$HOME/.hybridops/core/app/modules/org/aws/object-repo/examples/inputs.min.yml"
# Azure object repo (storage account + container; no account keys in state)
hyops apply --env dev \
--module org/azure/object-repo \
--inputs "$HOME/.hybridops/core/app/modules/org/azure/object-repo/examples/inputs.min.yml"
- Ensure pgBackRest repository credentials exist in runtime vault
For GCS repositories, HyOps expects the service account JSON in an env key (default: PG_BACKUP_GCS_SA_JSON).
Example (store SA JSON into vault from the current shell):
export PG_BACKUP_GCS_SA_JSON="$(cat /path/to/gcs-sa.json)"
hyops secrets set --env dev --from-env PG_BACKUP_GCS_SA_JSON
For S3 repositories, store the access/secret keys into vault (defaults):
export PG_BACKUP_S3_ACCESS_KEY_ID="..."
export PG_BACKUP_S3_SECRET_ACCESS_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_S3_ACCESS_KEY_ID PG_BACKUP_S3_SECRET_ACCESS_KEY
For Azure repositories, store the storage account key into vault (default key name):
export PG_BACKUP_AZURE_ACCOUNT_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_AZURE_ACCOUNT_KEY
Optional (when secondary_enabled=true in backup step): store secondary backend credentials as well, for example:
# Secondary Azure repo2 key example
export PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY="..."
hyops secrets set --env dev --from-env PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY
- Ensure Patroni and application DB secrets exist in runtime vault
hyops secrets ensure --env dev \
PATRONI_SUPERUSER_PASSWORD \
PATRONI_REPLICATION_PASSWORD \
NETBOX_DB_PASSWORD
If you want the blueprint to update internal DNS during cutover, also ensure:
hyops secrets ensure --env dev POWERDNS_API_KEY
- Validate and preflight the blueprint
hyops blueprint validate --ref dr/postgresql-ha-failover-gcp@v1
hyops blueprint preflight --env dev --ref dr/postgresql-ha-failover-gcp@v1
hyops blueprint preflight now runs step-level module driver preflight when upstream step state is already present.
That means it fails early on missing required secret env vars or a locked runtime vault, before deploy --execute starts the restore step.
If the current shell has not unlocked GPG for the runtime vault, unlock it first:
hyops vault password >/dev/null
- Ensure the runner can reach DR VMs over SSH (network model)
Best practice (enterprise): run HyOps from a runner with private reachability to the target VPC. The recommended bootstrap path is networking/gcp-ops-runner@v1, which provisions a private runner VM in the hub core subnet for runner-local execution.
Important:
ssh_proxy_jump_autoonly auto-infers a Proxmox bastion for on-prem inventories.- For this GCP DR blueprint, HyOps will not auto-route private GCP VM addresses through the Proxmox host.
- If your runner cannot route to the GCP subnet directly, you must either:
- provide an explicit cloud-reachable bastion via
ssh_proxy_jump_host, or - set
assign_public_ip: truefor the DR VMs and restrict SSH ingress for the drill.
Community/drill fast-path: set assign_public_ip: true for DR VMs in your blueprint file and apply a restricted SSH firewall rule.
Example:
SRC_CIDR="203.0.113.10/32" # replace with your operator source
gcloud compute firewall-rules create hyops-dr-allow-ssh-postgres \
--network <your-vpc-network> \
--allow tcp:22 \
--source-ranges "$SRC_CIDR" \
--target-tags postgres
Steps¶
- Prepare a blueprint file with real values
The built-in blueprint contains placeholders and now fails fast until they are replaced. Required at minimum for the GCP VM step:
Expected upstream state in the selected env:
org/gcp/project-factoryorg/gcp/wan-hub-network
Optional overrides if you do not want the default state-driven path:
project_idnetworksubnetwork
The shipped GCP VM step uses ssh_keys_from_init: true.
If you override that path with explicit ssh_keys, also set ssh_keys_from_init: false.
Use the blueprint --file option so you can keep a local, environment-specific copy:
hyops blueprint init --env dev \
--ref dr/postgresql-ha-failover-gcp@v1 \
--dest-name dr-postgresql-ha-failover-gcp.yml
# edit ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml
# (set any environment-specific overrides such as zone or explicit project/network values)
# by default the DR VM step uses ssh_keys_from_init=true and consumes the key from ~/.hybridops/envs/<env>/meta/gcp.ready.json
# default repository contract is repo_state_ref=org/gcp/object-repo
hyops blueprint validate --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"
hyops blueprint preflight --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml"
Note:
- If
platform/gcp/platform-vm#gcp_pg_vmsis alreadyok, the blueprint now skips re-provisioning that step (skip_if_state_ok=true). - Restore still requires explicit confirmation (
restore_confirm: true) and will remain blocked in preflight until set. -
Backup step supports optional secondary repository copy (
secondary_enabled=true,secondary_repo_state_reforsecondary_backend+secondary_*fields). -
Execute failover restore
This is guarded. The PostgreSQL HA module will refuse to run restore unless:
inputs.apply_mode=restoreinputs.restore_confirm=true
hyops blueprint deploy --env dev \
--file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" \
--execute
# Optional override at runtime:
# HYOPS_INPUT_repo_state_ref=org/aws/object-repo hyops blueprint deploy --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" --execute
# HYOPS_INPUT_repo_state_ref=org/azure/object-repo hyops blueprint deploy --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp.yml" --execute
-
Observe progress and capture evidence
-
Each step prints an
evidence:path. - Logs are written under:
$HOME/.hybridops/envs/<env>/logs/module/...
Verification¶
- Confirm module state and outputs
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__postgresql-ha/latest.json"
Expected:
status: okoutputs.cap.db.postgresql_ha: ready- Prefer the explicit endpoint contract from module state:
endpoint_targetendpoint_target_typeendpoint_port-
db_hostremains the active host/VIP for diagnostics and backward compatibility. -
Confirm the DR endpoint accepts connections
From a host that has network reachability to the DR DB endpoint:
nc -vz <endpoint_target> <endpoint_port>
Post-actions and clean-up¶
- Perform application cutover explicitly against
endpoint_target(DNS / config / traffic switching). - Recommended DR pattern:
- set
endpoint_dns_namein the PostgreSQL HA inputs for a stable cross-site DB name - keep applications pointed at that DNS name
- move the DNS record during failover/failback instead of teaching apps raw leader IPs
- Do not re-enable the on-prem primary until you have an approved failback plan.
- If the DR event is a drill and you did not cut traffic, destroy the DR VMs to avoid ongoing cost.
Troubleshooting¶
- If restore fails at preflight with missing env keys, ensure
inputs.required_envincludes the repo credential key(s) and that they are present in vault. - If restore fails with SSH reachability errors, ensure you can reach the DR VMs (public IP, VPN, bastion, or IAP).
- If pgBackRest repo cannot be accessed, verify bucket name/path and permissions for the service account / S3 keys.