Skip to content

PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)

Purpose

Define a professional, deterministic DR model for PostgreSQL that:

  • avoids split-brain risk,
  • keeps operations predictable for packaged/tarball users,
  • supports decision-service-driven failover target selection,
  • scales to multi-cloud without forcing dual-primary or dual-warm complexity.
  1. Single active primary (on-prem during normal operations).
  2. Continuous pgBackRest backup + WAL archive to one primary object repository backend (gcs or azure or s3).
  3. One warm-standby failover target (GCP or Azure), selected by policy/decision service.
  4. Optional secondary backup copy (cross-cloud object replication/copy), for repository survivability.

This gives strong DR posture without introducing unnecessary operational risk.

DR modes

Mode A: Backup-restore DR (baseline, always required)

  • On DR event, provision/reuse target cluster and restore from pgBackRest repository.
  • Best for lower cost and simpler operations.
  • Higher RTO than warm-standby promotion.

Mode B: Warm-standby promotion DR (enterprise fast path)

  • Keep one target cloud cluster in read-only recovery posture.
  • On DR event, promote the standby and cut traffic.
  • Lower RTO, higher steady-state cost.

Mode C: Dual warm-standby (advanced, not default)

  • Two warm standbys in different clouds.
  • Not recommended by default due to cost, orchestration complexity, and increased operational blast radius.

Multi-cloud guidance (your specific question)

If decision service may choose between GCP and Azure, use:

  • One active warm standby target at a time (selected by policy),
  • plus secondary backup copy to the other cloud.

Do not run dual read-only PostgreSQL clusters by default unless you have a strict, tested requirement and operational capacity for it.

Customer-fit guidance (SME, schools, enterprise)

HybridOps should present PostgreSQL DR as tiered lanes, not one universal design.

Default lane: backup-restore DR

Recommended for:

  • SMEs
  • schools
  • cost-sensitive environments
  • teams without 24x7 database specialists

Shape:

  • On-prem Patroni HA primary
  • Continuous pgBackRest backup + WAL archive to one object repository
  • Restore into self-managed cloud VMs during DR
  • Controlled failback back on-prem

Why this is the default:

  • lowest steady-state cost
  • easiest to explain and support
  • cleanest tarball-safe story
  • strongest fit for packaged deployments and evidence-based drills

Premium lane: managed warm-standby DR

Recommended for:

  • enterprises with stricter RTO/RPO needs
  • customers willing to pay for lower recovery time and higher steady-state cost

Shape:

  • On-prem Patroni HA primary
  • One managed cloud PostgreSQL standby target
  • Controlled promotion during DR
  • Explicit failback by reseed or reverse replication

This is an upgrade path, not the default posture for all customers.

Advanced lane: multi-cloud resilience

Recommended only when justified by policy, budget, and operational maturity.

Shape:

  • one active warm standby target at a time
  • optional secondary backup copy to another cloud
  • decision-service-driven target selection later

Do not make dual warm-standby or dual read-only the default product shape.

Decision service contract (target state)

Decision service should output at least:

  • dr_mode: restore | warm_promote | deny
  • dr_target_cloud: gcp | azure
  • repo_backend: gcs | azure | s3
  • enable_secondary_backup_copy: true|false
  • rationale: string

HyOps DR workflows then consume these outputs to select:

  • failover blueprint/inputs,
  • repository state reference (repo_state_ref),
  • whether to run secondary copy automation.

Implementation notes for HybridOps

  • Treat PostgreSQL DR as two product lanes:
  • baseline self-managed restore lane
  • premium managed warm-standby lane
  • Keep existing failover/failback blueprints as deterministic restore paths.
  • Keep platform/onprem/postgresql-ha-backup as the standard backup configuration layer.
  • Keep the current GCP/Azure restore blueprints as the first shipped DR path to prove end-to-end recovery before introducing managed-DB replication.
  • Use optional pgBackRest repo2 in platform/onprem/postgresql-ha-backup for secondary backup copy:
  • secondary_enabled: true
  • secondary_repo_state_ref (recommended) or secondary_backend + secondary_* inputs
  • Keep secondary copy explicit and policy-driven, not coupled to primary DB write path.
  • Object repositories are provisioned with Terraform provider-native resources (AWS/GCP/Azure official providers), then consumed via repo_state_ref/secondary_repo_state_ref.
  • For rebuilt clusters reusing the same backup path, keep repo_mismatch_action=fail by default and use repo_mismatch_action=reset only for controlled stanza re-initialization.
  • Treat warm-standby as an upgrade path, not mandatory for all users.
  • Treat failback from managed PostgreSQL as a controlled reseed/new-lineage operation unless and until reverse replication is explicitly implemented and tested.

References