PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)¶
Purpose¶
Define a professional, deterministic DR model for PostgreSQL that:
- avoids split-brain risk,
- keeps operations predictable for packaged/tarball users,
- supports decision-service-driven failover target selection,
- scales to multi-cloud without forcing dual-primary or dual-warm complexity.
Recommended default (current standard)¶
- Single active primary (on-prem during normal operations).
- Continuous pgBackRest backup + WAL archive to one primary object repository backend (
gcsorazureors3). - One warm-standby failover target (GCP or Azure), selected by policy/decision service.
- Optional secondary backup copy (cross-cloud object replication/copy), for repository survivability.
This gives strong DR posture without introducing unnecessary operational risk.
DR modes¶
Mode A: Backup-restore DR (baseline, always required)¶
- On DR event, provision/reuse target cluster and restore from pgBackRest repository.
- Best for lower cost and simpler operations.
- Higher RTO than warm-standby promotion.
Mode B: Warm-standby promotion DR (enterprise fast path)¶
- Keep one target cloud cluster in read-only recovery posture.
- On DR event, promote the standby and cut traffic.
- Lower RTO, higher steady-state cost.
Mode C: Dual warm-standby (advanced, not default)¶
- Two warm standbys in different clouds.
- Not recommended by default due to cost, orchestration complexity, and increased operational blast radius.
Multi-cloud configuration¶
If decision service may choose between GCP and Azure, use:
- One active warm standby target at a time (selected by policy),
- plus secondary backup copy to the other cloud.
Do not run dual read-only PostgreSQL clusters by default unless you have a strict, tested requirement and operational capacity for it.
Tiered DR configurations (SME, schools, enterprise)¶
PostgreSQL DR is available in three tiered configurations to match cost, recovery objectives, and operational capacity.
Default lane: backup-restore DR¶
Recommended for:
- SMEs
- schools
- cost-sensitive environments
- teams without 24x7 database specialists
Shape:
- On-prem Patroni HA primary
- Continuous pgBackRest backup + WAL archive to one object repository
- Restore into self-managed cloud VMs during DR
- Controlled failback back on-prem
Why this is the default:
- lowest steady-state cost
- easiest to explain and support
- cleanest tarball-safe story
- strongest fit for packaged deployments and reviewable drills
Premium lane: managed warm-standby DR¶
Recommended for:
- enterprises with stricter RTO/RPO needs
- customers willing to pay for lower recovery time and higher steady-state cost
Shape:
- On-prem Patroni HA primary
- One managed cloud PostgreSQL standby target
- Controlled promotion during DR
- Explicit failback by reseed or reverse replication
This is an upgrade path, not the default posture for all customers.
Advanced lane: multi-cloud resilience¶
Recommended only when justified by policy, budget, and operational maturity.
Shape:
- one active warm standby target at a time
- optional secondary backup copy to another cloud
- decision-service-driven target selection later
Do not make dual warm-standby or dual read-only the default product shape.
Decision service contract (target state)¶
Decision service should output at least:
dr_mode:restore|warm_promote|denydr_target_cloud:gcp|azurerepo_backend:gcs|azure|s3enable_secondary_backup_copy:true|falserationale: string
HybridOps DR workflows then consume these outputs to select:
- failover blueprint/inputs,
- repository state reference (
repo_state_ref), - whether to run secondary copy automation.
Implementation guidance¶
- Treat PostgreSQL DR as two product lanes:
- baseline self-managed restore lane
- premium managed warm-standby lane
- Keep existing failover/failback blueprints as deterministic restore paths.
- Keep
platform/postgresql-ha-backupas the standard backup configuration layer. - Keep the client-facing endpoint contract the same across both lanes:
endpoint_dns_nameendpoint_targetendpoint_target_typeendpoint_hostendpoint_portendpoint_cutover_required- Keep the current GCP/Azure restore blueprints as the first shipped DR path to prove end-to-end recovery before introducing managed-DB replication.
- Use optional pgBackRest
repo2inplatform/postgresql-ha-backupfor secondary backup copy: secondary_enabled: truesecondary_repo_state_ref(recommended) orsecondary_backend+secondary_*inputs- Keep secondary copy explicit and policy-driven, not coupled to primary DB write path.
- Object repositories are provisioned with Terraform provider-native resources (AWS/GCP/Azure official providers), then consumed via
repo_state_ref/secondary_repo_state_ref. - For rebuilt clusters reusing the same backup path, keep
repo_mismatch_action=failby default and userepo_mismatch_action=resetonly for controlled stanza re-initialization. - Treat warm-standby as an upgrade path, not mandatory for all users.
- Treat failback from managed PostgreSQL as a controlled reseed/new-lineage operation unless and until reverse replication is explicitly implemented and tested.
References¶
- Repeatable PostgreSQL App-Data DR Drill
- Cleanup the PostgreSQL App-Data DR Validation Lanes
- Establish PostgreSQL Cloud SQL Standby in GCP (HyOps Blueprint)
- Promote PostgreSQL Cloud SQL DR in GCP (HyOps Blueprint)
- Failback PostgreSQL Cloud SQL DR to On-Prem (HyOps Blueprint)
- Failover PostgreSQL HA to GCP (HyOps Blueprint)
- Failback PostgreSQL HA to On-Prem (HyOps Blueprint)
- Runbook - Operate PostgreSQL HA Backup (pgBackRest) (HyOps)
- PostgreSQL DR Product Lanes Standard