Skip to content

PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)

Purpose

Define a professional, deterministic DR model for PostgreSQL that:

  • avoids split-brain risk,
  • keeps operations predictable for packaged/tarball users,
  • supports decision-service-driven failover target selection,
  • scales to multi-cloud without forcing dual-primary or dual-warm complexity.
  1. Single active primary (on-prem during normal operations).
  2. Continuous pgBackRest backup + WAL archive to one primary object repository backend (gcs or azure or s3).
  3. One warm-standby failover target (GCP or Azure), selected by policy/decision service.
  4. Optional secondary backup copy (cross-cloud object replication/copy), for repository survivability.

This gives strong DR posture without introducing unnecessary operational risk.

DR modes

Mode A: Backup-restore DR (baseline, always required)

  • On DR event, provision/reuse target cluster and restore from pgBackRest repository.
  • Best for lower cost and simpler operations.
  • Higher RTO than warm-standby promotion.

Mode B: Warm-standby promotion DR (enterprise fast path)

  • Keep one target cloud cluster in read-only recovery posture.
  • On DR event, promote the standby and cut traffic.
  • Lower RTO, higher steady-state cost.

Mode C: Dual warm-standby (advanced, not default)

  • Two warm standbys in different clouds.
  • Not recommended by default due to cost, orchestration complexity, and increased operational blast radius.

Multi-cloud configuration

If decision service may choose between GCP and Azure, use:

  • One active warm standby target at a time (selected by policy),
  • plus secondary backup copy to the other cloud.

Do not run dual read-only PostgreSQL clusters by default unless you have a strict, tested requirement and operational capacity for it.

Tiered DR configurations (SME, schools, enterprise)

PostgreSQL DR is available in three tiered configurations to match cost, recovery objectives, and operational capacity.

Default lane: backup-restore DR

Recommended for:

  • SMEs
  • schools
  • cost-sensitive environments
  • teams without 24x7 database specialists

Shape:

  • On-prem Patroni HA primary
  • Continuous pgBackRest backup + WAL archive to one object repository
  • Restore into self-managed cloud VMs during DR
  • Controlled failback back on-prem

Why this is the default:

  • lowest steady-state cost
  • easiest to explain and support
  • cleanest tarball-safe story
  • strongest fit for packaged deployments and reviewable drills

Premium lane: managed warm-standby DR

Recommended for:

  • enterprises with stricter RTO/RPO needs
  • customers willing to pay for lower recovery time and higher steady-state cost

Shape:

  • On-prem Patroni HA primary
  • One managed cloud PostgreSQL standby target
  • Controlled promotion during DR
  • Explicit failback by reseed or reverse replication

This is an upgrade path, not the default posture for all customers.

Advanced lane: multi-cloud resilience

Recommended only when justified by policy, budget, and operational maturity.

Shape:

  • one active warm standby target at a time
  • optional secondary backup copy to another cloud
  • decision-service-driven target selection later

Do not make dual warm-standby or dual read-only the default product shape.

Decision service contract (target state)

Decision service should output at least:

  • dr_mode: restore | warm_promote | deny
  • dr_target_cloud: gcp | azure
  • repo_backend: gcs | azure | s3
  • enable_secondary_backup_copy: true|false
  • rationale: string

HybridOps DR workflows then consume these outputs to select:

  • failover blueprint/inputs,
  • repository state reference (repo_state_ref),
  • whether to run secondary copy automation.

Implementation guidance

  • Treat PostgreSQL DR as two product lanes:
  • baseline self-managed restore lane
  • premium managed warm-standby lane
  • Keep existing failover/failback blueprints as deterministic restore paths.
  • Keep platform/postgresql-ha-backup as the standard backup configuration layer.
  • Keep the client-facing endpoint contract the same across both lanes:
  • endpoint_dns_name
  • endpoint_target
  • endpoint_target_type
  • endpoint_host
  • endpoint_port
  • endpoint_cutover_required
  • Keep the current GCP/Azure restore blueprints as the first shipped DR path to prove end-to-end recovery before introducing managed-DB replication.
  • Use optional pgBackRest repo2 in platform/postgresql-ha-backup for secondary backup copy:
  • secondary_enabled: true
  • secondary_repo_state_ref (recommended) or secondary_backend + secondary_* inputs
  • Keep secondary copy explicit and policy-driven, not coupled to primary DB write path.
  • Object repositories are provisioned with Terraform provider-native resources (AWS/GCP/Azure official providers), then consumed via repo_state_ref/secondary_repo_state_ref.
  • For rebuilt clusters reusing the same backup path, keep repo_mismatch_action=fail by default and use repo_mismatch_action=reset only for controlled stanza re-initialization.
  • Treat warm-standby as an upgrade path, not mandatory for all users.
  • Treat failback from managed PostgreSQL as a controlled reseed/new-lineage operation unless and until reverse replication is explicitly implemented and tested.

References