Skip to content

PostgreSQL HA DR Cycle

Overview

PostgreSQL HA DR Cycle shows the higher-control recovery path in the platform. The primary estate stays on-prem, recovery is rebuilt into GCP from pgBackRest, a fresh backup is pinned from the recovered cluster, and service returns on-prem through a controlled failback path.

It suits teams that want repeatable recovery proof, direct control over the database estate, and evidence that service can return cleanly.

Case study

  • Context: the on-prem PostgreSQL HA estate needed current public proof of a full recovery cycle rather than older run records alone.
  • Challenge: the platform had to prove restore into GCP, backup continuity from the recovered lane, and controlled return on-prem without relying on stale runtime state.
  • Approach: pgBackRest object storage remained the recovery backbone. dr/postgresql-ha-failover-gcp@v1 drove restore into GCP, a fresh backup was taken from the recovered cluster, and dr/postgresql-ha-failback-onprem@v1 handled the controlled return.
  • Outcome: the March 31, 2026 redrill restored the service into GCP on 10.72.16.23, pinned backup set 20260331-173211F, returned service on-prem on 10.12.0.31, and retained the full run record set.

The evidence below captures recovery into GCP, fresh backup continuity, controlled return on-prem, and final DNS truth after failback.

Covers pgBackRest restore into GCP, fresh backup from the recovered cluster, controlled failback on-prem, and final DNS truth after the full cycle.

Outcome

This recovery path proves database recovery, not just infrastructure recreation.

  • Recovery into GCP is driven from the backup and restore chain, not from assumptions about standby readiness.
  • A fresh backup is pinned from the recovered GCP cluster before failback begins.
  • Service returns on-prem through a deliberate path with final DNS truth preserved.

Operating model

  • The primary service is a self-managed on-prem PostgreSQL HA estate.
  • pgBackRest object storage is the recovery backbone.
  • GCP is used as the recovery environment for the drill.
  • Return on-prem is a planned step, not an implicit side effect.

Compared with the managed Cloud SQL path, this option keeps more database control and architectural symmetry at the cost of higher operational involvement.

The recorded redrill covers the full path from on-prem primary to GCP recovery and back again, with Patroni state, backup continuity, and final service position treated as part of the result.

Architecture

PostgreSQL HA DR cycle showing the on-prem primary writing to pgBackRest, recovery being restored into GCP, a fresh backup being pinned there, and service then returning on-prem through controlled failback.

The recovery route is driven from the backup chain, not from assumptions about standby readiness. The GCP lane is backed up again before service returns on-prem.

Recovery sequence

  1. The on-prem source and recovery repository are prepared and verified.
  2. PostgreSQL is restored into the GCP recovery environment.
  3. Patroni state and DNS cutover confirm the recovered GCP service position.
  4. A fresh GCP backup is taken from the recovered primary.
  5. Service returns on-prem and DNS truth is checked after failback.
Stage Run started State published Elapsed
Restore to GCP 2026-03-31T17:03:59Z 2026-03-31T17:30:57Z 26 min 58 sec
Fresh GCP backup 2026-03-31T17:31:52Z 2026-03-31T17:32:19Z 27 sec
Failback to on-prem 2026-03-31T17:34:58Z 2026-03-31T17:44:36Z 9 min 38 sec

The March 31, 2026 timings reflect the recorded redrill path that restored into GCP, pinned a fresh backup, and returned service on-prem.

Platform state

GCP Compute Engine instances list from the redrill, showing platform-dev-pgdr-01, platform-dev-pgdr-02, and platform-dev-pgdr-03 running in europe-west2-a GCS bucket browser for hyops-dev-pgbackrest-a1, showing the pgbackrest backup path used by the recovery cycle Proxmox search view showing dev-pgha-01, dev-pgha-02, and dev-pgha-03 running after the service returned on-prem Recorded Patroni cluster state after failback, showing one leader and two streaming replicas on the on-prem cluster

IP addresses, hostnames, and instance identifiers visible in screenshots and recordings reflect the ephemeral infrastructure provisioned during the recorded exercise.

Implementation

  • Recovery backbone: pgBackRest object storage drives restore into GCP; recovery is from the backup chain, not from a standby assumption.
  • Failover path: dr/postgresql-ha-failover-gcp@v1 restores the cluster into GCP and validates Patroni state and DNS position.
  • Backup continuity: a fresh backup is taken from the recovered GCP primary before any failback begins.
  • Failback path: dr/postgresql-ha-failback-onprem@v1 returns service on-prem with DNS cutover as a final controlled step.
  • DNS layer: platform/network/dns-routing manages the cutover record across both directions.

Where it fits

  • when the database platform itself is part of the protected service
  • when recovery proof, cluster health, and final DNS truth matter as much as rebuild speed
  • when architectural symmetry matters more than minimising ongoing database operations

Key components

  • Primary blueprint: onprem/postgresql-ha@v1
  • Failover blueprint: dr/postgresql-ha-failover-gcp@v1
  • Failback blueprint: dr/postgresql-ha-failback-onprem@v1
  • Recovery backbone: pgBackRest object storage
  • Continuity step: fresh GCP backup before failback
  • DNS cutover layer: platform/network/dns-routing

References

Further reading
Implementation references
  • platform/postgresql-ha
  • platform/postgresql-ha-backup
  • platform/network/dns-routing
  • org/gcp/object-repo#pgbackrest_primary

What was verified

Verified during the recorded March 31, 2026 self-managed PostgreSQL HA redrill, including GCP restore, fresh backup continuity, on-prem failback, Patroni streaming confirmation, and final DNS truth.