Skip to content

Failback PostgreSQL HA to On-Prem (HyOps Blueprint)

Purpose: Restore service back to an on-prem PostgreSQL HA cluster using the same pgBackRest repository used during failover.
Owner: Platform engineering / SRE
Trigger: End of DR event (on-prem site restored) or scheduled failback drill
Impact: A new on-prem PostgreSQL primary becomes available; applications must be cut back explicitly.
Severity: P1
Pre-reqs: On-prem infrastructure is restored (Proxmox + network), backups are current (final backup from DR primary is strongly recommended), shared PowerDNS primary is already provisioned, required secrets are present in runtime vault, operators have fenced the DR primary before cut-back, and an on-prem runner is available for runner-local execution.
Rollback strategy: If cut-back fails, keep DR primary fenced-off but available, and re-point apps back to DR while investigating.

Context

Blueprint ref: dr/postgresql-ha-failback-onprem@v1
Location (example file): ~/.hybridops/envs/<env>/config/blueprints/dr-postgresql-ha-failback-onprem.yml

Default step flow:

  1. core/onprem/network-sdn (optional converge)
  2. core/onprem/template-image (Rocky 9)
  3. platform/onprem/platform-vm (rebuild on-prem DB nodes)
  4. platform/onprem/postgresql-ha with apply_mode=restore and execution_plane=runner-local
  5. platform/onprem/postgresql-ha-backup with execution_plane=runner-local (re-enable backups on the on-prem primary)
  6. platform/network/dns-routing (optional internal DNS cutback when endpoint_dns_name is set)

Internal DNS authority remains a separate shared control-plane service:

  • NetBox is still IPAM / inventory metadata
  • PowerDNS is the internal authoritative DNS engine
  • platform/network/dns-routing performs the cutback record update

Do not treat NetBox as the authoritative DNS server for this failback path.

The DNS cutback step now prefers:

  • powerdns_state_ref: platform/network/powerdns-authority#shared_primary

That keeps the shipped blueprint state-driven and avoids hardcoding the PowerDNS API endpoint by default.

DR operating model (important)

This blueprint is the controlled failback path after cloud DR operation.

  • It assumes one active cloud DR primary at failback time.
  • It does not require dual-readonly cloud clusters.
  • It works with GCP, Azure, or S3-backed repositories via repo_state_ref.
  • Internal DNS cutback remains inactive until endpoint_dns_name is set in the blueprint overlay.

For target-cloud and repository policy guidance, see: - PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)

Preconditions and safety checks

  1. Take a final backup from the DR primary (recommended)

This is the cleanest way to minimize data loss before cutting back.

Example (on-demand backup entrypoint):

HYOPS_INPUT_apply_mode=backup \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha-backup \
  --inputs modules/platform/onprem/postgresql-ha-backup/examples/inputs.gcs.yml
  1. Fence the DR primary (split-brain guard)

Before failback cut-back, you must ensure the DR primary cannot accept writes.

This runbook assumes you will fence via one of:

  • Power off DR DB VMs
  • Block network access to the DR DB VIP/IP
  • Freeze writes at app tier and verify the DR endpoint is not reachable

  • Ensure repository credentials and Patroni secrets exist in runtime vault

If the object repository module state does not exist yet (or you are formalizing it as code), provision it first:

# GCS object repo (bucket + service account plumbing; no SA keys in state)
hyops apply --env dev \
  --module org/gcp/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/gcp/object-repo/examples/inputs.min.yml"

# S3 object repo (bucket + IAM user; no access keys in state)
hyops apply --env dev \
  --module org/aws/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/aws/object-repo/examples/inputs.min.yml"

# Azure object repo (storage account + container; no account keys in state)
hyops apply --env dev \
  --module org/azure/object-repo \
  --inputs "$HOME/.hybridops/core/app/modules/org/azure/object-repo/examples/inputs.min.yml"
hyops secrets ensure --env dev \
  PATRONI_SUPERUSER_PASSWORD \
  PATRONI_REPLICATION_PASSWORD \
  NETBOX_DB_PASSWORD

If you want the blueprint to update internal DNS during cutback, also ensure:

hyops secrets ensure --env dev POWERDNS_API_KEY

For GCS repos:

hyops secrets ensure --env dev PG_BACKUP_GCS_SA_JSON

For Azure repos:

hyops secrets ensure --env dev PG_BACKUP_AZURE_ACCOUNT_KEY

Optional (when secondary_enabled=true in backup step): ensure secondary backend credentials exist, for example:

hyops secrets ensure --env dev PG_BACKUP_SECONDARY_AZURE_ACCOUNT_KEY

Steps

  1. Prepare a blueprint file with real values

The failback blueprint requires:

  • Real on-prem VM IPs (static inputs) or a known-good IPAM workflow.
  • The restored PostgreSQL nodes should live on the shared data VLAN (vnetdata), not the management VLAN.
  • A valid repo_state_ref (default: org/gcp/object-repo) and repo settings.
  • Explicit restore confirmation (restore_confirm=true).

Example:

hyops blueprint init --env dev \
  --ref networking/onprem-ops-runner@v1 \
  --dest-name onprem-ops-runner.yml

hyops blueprint init --env dev \
  --ref dr/postgresql-ha-failback-onprem@v1 \
  --dest-name dr-postgresql-ha-failback-onprem.yml

# edit ~/.hybridops/envs/dev/config/blueprints/onprem-ops-runner.yml
# if the on-prem runner does not already exist
#
# edit ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failback-onprem.yml
# (set vnetdata IPs such as 10.12.0.31/24, 10.12.0.32/24, 10.12.0.33/24,
# restore_confirm, repo_state_ref if overriding, etc)

hyops blueprint validate --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failback-onprem.yml"
hyops blueprint preflight --env dev --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failback-onprem.yml"
  1. Execute failback restore
HYOPS_CORE_ROOT=/path/to/hybridops-core \
hyops runner blueprint deploy --env dev \
  --runner-state-ref platform/linux/ops-runner#onprem_ops_runner_bootstrap \
  --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failback-onprem.yml" \
  --execute

# Optional override at runtime:
# HYOPS_INPUT_repo_state_ref=org/aws/object-repo HYOPS_CORE_ROOT=/path/to/hybridops-core hyops runner blueprint deploy --env dev --runner-state-ref platform/linux/ops-runner#onprem_ops_runner_bootstrap --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failback-onprem.yml" --execute
# HYOPS_INPUT_repo_state_ref=org/azure/object-repo HYOPS_CORE_ROOT=/path/to/hybridops-core hyops runner blueprint deploy --env dev --runner-state-ref platform/linux/ops-runner#onprem_ops_runner_bootstrap --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failback-onprem.yml" --execute

Note: Backup configuration step also supports optional secondary repository copy (secondary_enabled=true, secondary_repo_state_ref or explicit secondary_* inputs).

  1. Observe progress and capture evidence

  2. Each step prints an evidence: path.

  3. Logs are written under:
  4. $HOME/.hybridops/envs/<env>/logs/module/...

Verification

  1. Confirm module state and outputs
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__postgresql-ha/latest.json"

Expected:

  • status: ok
  • outputs.cap.db.postgresql_ha: ready

  • Validate client connectivity to the on-prem endpoint

nc -vz <endpoint_target> <endpoint_port>

Post-actions and clean-up

  • Prefer the explicit endpoint contract from module state:
  • endpoint_target
  • endpoint_target_type
  • endpoint_port
  • db_host remains the current active host/VIP for diagnostics and backward compatibility.
  • Cut application traffic back to endpoint_target explicitly.
  • Recommended failback pattern:
  • keep applications pointed at a stable DNS name via endpoint_dns_name
  • update that DNS record during failback instead of handing applications raw node IPs
  • After cut-back is stable, destroy DR resources to avoid ongoing cost.
  • Review RPO/RTO evidence and file a post-incident report (or drill report).

Troubleshooting

  • If restore fails preflight, ensure required_env includes the repo credential env key(s) and vault decryption works.
  • If restore fails during Patroni bootstrap, verify the pgBackRest repository contains a valid stanza for the cluster and that WAL archive is present.
  • If restore fails due to non-empty PGDATA, rebuild the VMs or use restore_delta=true only with explicit approval.

References