Skip to content

Failback PostgreSQL Cloud SQL DR to On-Prem (HyOps Blueprint)

  • Purpose: Gate the failback decision and then repoint the stable PostgreSQL service endpoint back to the on-prem PostgreSQL HA lane. Owner: Platform engineering / SRE

  • Trigger: End of a managed-cloud DR event or failback drill

  • Impact: Applications are redirected back to the on-prem PostgreSQL HA endpoint.
  • Severity: P1 Pre-reqs: The managed cloud primary has been fenced, the on-prem PostgreSQL HA lane has already been rebuilt or reseeded, and DNS authority credentials are available.

  • Rollback strategy: If the manual gate is not confirmed, do nothing. If cutback is unsafe, keep service on the managed DR primary until the on-prem target is re-verified.

Context

Blueprint ref: dr/postgresql-cloudsql-failback-onprem@v1 Location: hybridops-core/blueprints/dr/postgresql-cloudsql-failback-onprem@v1/blueprint.yml

Default step flow:

  1. core/shared/manual-gate
  2. platform/network/dns-routing
  3. platform/network/dns-routing (apply_mode=status, live PowerDNS verification)

Important:

  • this blueprint does not rebuild the on-prem cluster for you
  • rebuild, reseed, or reverse-sync work must already be complete before the manual gate is confirmed
  • this keeps the product honest until reverse replication automation is explicitly shipped and tested
  • DNS cutback consumes endpoint_host from platform/postgresql-ha#postgresql_restore_onprem_failback because the route uses an A record
  • if the rebuilt failback lane reused the original on-prem source addresses and those hosts were later fenced during the managed-cloud promote drill, restart Patroni and republish platform/postgresql-ha#postgresql_restore_onprem_failback before trusting its endpoint_host for DNS cutback

Manual gate expectations

Set the manual gate only after all of these are already true:

  • managed_primary_fenced=true
  • onprem_target_rebuilt=true
  • onprem_primary_writable=true
  • failback_approved=true

Validate and execute

hyops blueprint validate --ref dr/postgresql-cloudsql-failback-onprem@v1
hyops blueprint preflight --env dev --ref dr/postgresql-cloudsql-failback-onprem@v1
hyops blueprint deploy --env dev --ref dr/postgresql-cloudsql-failback-onprem@v1 --execute

Verify

Confirm:

  • manual gate state is cap.control.manual_gate = confirmed
  • DNS cutback state is cap.network.dns_routing = ready
  • DNS status step succeeded with dns.status = live-ok
  • the published record now targets the on-prem PostgreSQL HA endpoint contract
  • application writes land only on the restored on-prem primary