Use GCP Cloud SQL as the Managed PostgreSQL DR Target¶
Status¶
Accepted: GCP Cloud SQL (PostgreSQL 15) is the managed target for on-prem-to-GCP DR failover. Self-managed Patroni on GCP VMs is not used for DR; it remains available as an option for burst workloads that require identical cluster topology to on-prem.
1. Context¶
ADR-0501 established Patroni-managed PostgreSQL HA as the on-prem primary database platform. When a DR failover to GCP is required (planned or emergency) the platform requires a PostgreSQL target that:
- Is reachable via private IP from the GCP WAN hub (VPC peering or Cloud SQL Auth Proxy).
- Can receive WAL-based replication or be seeded from a pgBackRest backup.
- Does not require the platform team to operate Patroni, etcd, or cluster member provisioning inside GCP during an incident.
- Has a predictable cost model so DR drills can be authorised against a cost budget (ADR-0801).
Two options were evaluated:
Option A: Cloud SQL (managed)
GCP-managed PostgreSQL instance. Provisioned via org/gcp/cloudsql-postgresql. External
replica readiness assessed via org/gcp/cloudsql-external-replica.
Option B: Self-managed Patroni on GCP VMs Identical topology to on-prem HA. Provisioned manually or via Terraform + Ansible during a DR event.
2. Decision¶
Use Cloud SQL (Option A) for all GCP DR targets.
The three DR blueprints that implement this decision:
| Blueprint | Purpose |
|---|---|
dr/postgresql-cloudsql-standby-gcp@v1 |
Provision Cloud SQL instance and assess external replica readiness |
dr/postgresql-cloudsql-promote-gcp@v1 |
Promote Cloud SQL to primary and confirm application failover |
dr/postgresql-cloudsql-failback-onprem@v1 |
Seed on-prem HA from Cloud SQL state and revert DNS/connection |
3. Rationale¶
Against Option B (self-managed Patroni in GCP)
Running Patroni in GCP during a DR event adds operational scope at exactly the moment operator capacity is under pressure. Standing up etcd, bootstrapping cluster members, and validating replication adds 30–60 minutes to RTO. Keeping a standby Patroni cluster idle in GCP to avoid this adds ~$200–400/month in VM costs for a scenario that is exercised infrequently.
For Option A (Cloud SQL)
- Provisioning time:
org/gcp/cloudsql-postgresqlapplies in under five minutes. A pre-provisioned standby instance (kept stopped between drills) costs under $20/month on adb-g1-smalltier. - No cluster management: GCP handles HA, automatic failover within Cloud SQL, and patch maintenance. The operator manages the data, not the cluster.
- pgBackRest integration: the
platform/onprem/postgresql-ha-backupmodule writes WAL archives to a GCS bucket (org/gcp/pgbackrest-repo). Cloud SQL can be seeded from this bucket using thedr/postgresql-cloudsql-standby-gcp@v1blueprint. - External replica assessment:
org/gcp/cloudsql-external-replicaprobes whether the Cloud SQL instance is in a state that supports promotion. This produces a structured state record used as a preflight gate in the promote blueprint. - Cost signal: Cloud SQL instance cost is surfaced via the cost-signal framework (ADR-0801) so DR drills are not authorised without a cost acknowledgement.
Networking constraint
Cloud SQL private IP access from the on-prem or Hetzner environment requires the GCP hub
VPC to be peered with the Cloud SQL VPC (servicenetworking.googleapis.com). This is a
prerequisite of org/gcp/wan-hub-network. Do not attempt Cloud SQL failover without
confirming org/gcp/wan-hub-network state is ok and the google-managed-services-*
peer is established.
Compatibility constraint
Cloud SQL PostgreSQL major version must match the on-prem Patroni cluster major version. Current baseline: PostgreSQL 15. Version upgrades to the on-prem cluster must trigger a Cloud SQL version review before the next DR drill.
4. Consequences¶
org/gcp/cloudsql-postgresqlandorg/gcp/pgbackrest-repoare required in every environment that participates in GCP DR. They must reach stateokbefore the DR blueprints can run.- Cloud SQL instances must be kept in a known, provisionable state between drills. Use
skip_if_state_ok: trueon the standby blueprint step so idle instances are not reprovisioned unnecessarily. - The on-prem pgBackRest stanza must archive to the GCS bucket continuously, not only during DR events. A break in WAL archiving invalidates the Cloud SQL restore path.
- Application connection strings must use the HA proxy or a DNS name that can be repointed at cutover. Hard-coded on-prem IP addresses in application config block DR.
- Failback (
dr/postgresql-cloudsql-failback-onprem@v1) requires the on-prem Patroni cluster to be in a state that can accept a pgBackRest restore. Confirmpostgresql-hastate before initiating failback.
5. Implementation notes¶
Provisioning order within an environment:
org/gcp/project-factory: GCP project.org/gcp/wan-hub-network: hub VPC with Cloud SQL VPC peering.org/gcp/pgbackrest-repo: GCS bucket for WAL archives.org/gcp/cloudsql-postgresql: Cloud SQL instance (private IP, PostgreSQL 15).platform/onprem/postgresql-ha-backup: on-prem stanza configured to archive to the GCS bucket above.
The cloudsql-external-replica assessment module runs as step 1 of the promote blueprint.
It is not applied independently in normal operations.
References¶
- ADR-0501 – PostgreSQL on Dedicated VM with DR Replication
- ADR-0801 – Cost as a First-Class Signal for DR and Cloud Bursting
- ADR-0701 – GitHub Actions as Stateless DR Orchestrator
- Module: org/gcp/cloudsql-postgresql
- Module: org/gcp/cloudsql-external-replica
- Runbook – DR Cutover: On-Prem to Cloud
License: MIT-0 for code, CC-BY-4.0 for documentation