PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)¶
Purpose¶
Define a professional, deterministic DR model for PostgreSQL that:
- avoids split-brain risk,
- keeps operations predictable for packaged/tarball users,
- supports decision-service-driven failover target selection,
- scales to multi-cloud without forcing dual-primary or dual-warm complexity.
Recommended default (current standard)¶
- Single active primary (on-prem during normal operations).
- Continuous pgBackRest backup + WAL archive to one primary object repository backend (
gcsorazureors3). - One warm-standby failover target (GCP or Azure), selected by policy/decision service.
- Optional secondary backup copy (cross-cloud object replication/copy), for repository survivability.
This gives strong DR posture without introducing unnecessary operational risk.
DR modes¶
Mode A: Backup-restore DR (baseline, always required)¶
- On DR event, provision/reuse target cluster and restore from pgBackRest repository.
- Best for lower cost and simpler operations.
- Higher RTO than warm-standby promotion.
Mode B: Warm-standby promotion DR (enterprise fast path)¶
- Keep one target cloud cluster in read-only recovery posture.
- On DR event, promote the standby and cut traffic.
- Lower RTO, higher steady-state cost.
Mode C: Dual warm-standby (advanced, not default)¶
- Two warm standbys in different clouds.
- Not recommended by default due to cost, orchestration complexity, and increased operational blast radius.
Multi-cloud guidance (your specific question)¶
If decision service may choose between GCP and Azure, use:
- One active warm standby target at a time (selected by policy),
- plus secondary backup copy to the other cloud.
Do not run dual read-only PostgreSQL clusters by default unless you have a strict, tested requirement and operational capacity for it.
Customer-fit guidance (SME, schools, enterprise)¶
HybridOps should present PostgreSQL DR as tiered lanes, not one universal design.
Default lane: backup-restore DR¶
Recommended for:
- SMEs
- schools
- cost-sensitive environments
- teams without 24x7 database specialists
Shape:
- On-prem Patroni HA primary
- Continuous pgBackRest backup + WAL archive to one object repository
- Restore into self-managed cloud VMs during DR
- Controlled failback back on-prem
Why this is the default:
- lowest steady-state cost
- easiest to explain and support
- cleanest tarball-safe story
- strongest fit for packaged deployments and evidence-based drills
Premium lane: managed warm-standby DR¶
Recommended for:
- enterprises with stricter RTO/RPO needs
- customers willing to pay for lower recovery time and higher steady-state cost
Shape:
- On-prem Patroni HA primary
- One managed cloud PostgreSQL standby target
- Controlled promotion during DR
- Explicit failback by reseed or reverse replication
This is an upgrade path, not the default posture for all customers.
Advanced lane: multi-cloud resilience¶
Recommended only when justified by policy, budget, and operational maturity.
Shape:
- one active warm standby target at a time
- optional secondary backup copy to another cloud
- decision-service-driven target selection later
Do not make dual warm-standby or dual read-only the default product shape.
Decision service contract (target state)¶
Decision service should output at least:
dr_mode:restore|warm_promote|denydr_target_cloud:gcp|azurerepo_backend:gcs|azure|s3enable_secondary_backup_copy:true|falserationale: string
HyOps DR workflows then consume these outputs to select:
- failover blueprint/inputs,
- repository state reference (
repo_state_ref), - whether to run secondary copy automation.
Implementation notes for HybridOps¶
- Treat PostgreSQL DR as two product lanes:
- baseline self-managed restore lane
- premium managed warm-standby lane
- Keep existing failover/failback blueprints as deterministic restore paths.
- Keep
platform/onprem/postgresql-ha-backupas the standard backup configuration layer. - Keep the current GCP/Azure restore blueprints as the first shipped DR path to prove end-to-end recovery before introducing managed-DB replication.
- Use optional pgBackRest
repo2inplatform/onprem/postgresql-ha-backupfor secondary backup copy: secondary_enabled: truesecondary_repo_state_ref(recommended) orsecondary_backend+secondary_*inputs- Keep secondary copy explicit and policy-driven, not coupled to primary DB write path.
- Object repositories are provisioned with Terraform provider-native resources (AWS/GCP/Azure official providers), then consumed via
repo_state_ref/secondary_repo_state_ref. - For rebuilt clusters reusing the same backup path, keep
repo_mismatch_action=failby default and userepo_mismatch_action=resetonly for controlled stanza re-initialization. - Treat warm-standby as an upgrade path, not mandatory for all users.
- Treat failback from managed PostgreSQL as a controlled reseed/new-lineage operation unless and until reverse replication is explicitly implemented and tested.