Skip to content

Use GCP Cloud SQL as the Managed PostgreSQL DR Target

Status

Accepted: GCP Cloud SQL (PostgreSQL 15) is the managed target for on-prem-to-GCP DR failover. Self-managed Patroni on GCP VMs is not used for DR; it remains available as an option for burst workloads that require identical cluster topology to on-prem.

1. Context

ADR-0501 established Patroni-managed PostgreSQL HA as the on-prem primary database platform. When a DR failover to GCP is required (planned or emergency) the platform requires a PostgreSQL target that:

  • Is reachable via private IP from the GCP WAN hub (VPC peering or Cloud SQL Auth Proxy).
  • Can receive WAL-based replication or be seeded from a pgBackRest backup.
  • Does not require the platform team to operate Patroni, etcd, or cluster member provisioning inside GCP during an incident.
  • Has a predictable cost model so DR drills can be authorised against a cost budget (ADR-0801).

Two options were evaluated:

Option A: Cloud SQL (managed) GCP-managed PostgreSQL instance. Provisioned via org/gcp/cloudsql-postgresql. External replica readiness assessed via org/gcp/cloudsql-external-replica.

Option B: Self-managed Patroni on GCP VMs Identical topology to on-prem HA. Provisioned manually or via Terraform + Ansible during a DR event.

2. Decision

Use Cloud SQL (Option A) for all GCP DR targets.

The three DR blueprints that implement this decision:

Blueprint Purpose
dr/postgresql-cloudsql-standby-gcp@v1 Provision Cloud SQL instance and assess external replica readiness
dr/postgresql-cloudsql-promote-gcp@v1 Promote Cloud SQL to primary and confirm application failover
dr/postgresql-cloudsql-failback-onprem@v1 Seed on-prem HA from Cloud SQL state and revert DNS/connection

3. Rationale

Against Option B (self-managed Patroni in GCP)

Running Patroni in GCP during a DR event adds operational scope at exactly the moment operator capacity is under pressure. Standing up etcd, bootstrapping cluster members, and validating replication adds 30–60 minutes to RTO. Keeping a standby Patroni cluster idle in GCP to avoid this adds ~$200–400/month in VM costs for a scenario that is exercised infrequently.

For Option A (Cloud SQL)

  • Provisioning time: org/gcp/cloudsql-postgresql applies in under five minutes. A pre-provisioned standby instance (kept stopped between drills) costs under $20/month on a db-g1-small tier.
  • No cluster management: GCP handles HA, automatic failover within Cloud SQL, and patch maintenance. The operator manages the data, not the cluster.
  • pgBackRest integration: the platform/onprem/postgresql-ha-backup module writes WAL archives to a GCS bucket (org/gcp/pgbackrest-repo). Cloud SQL can be seeded from this bucket using the dr/postgresql-cloudsql-standby-gcp@v1 blueprint.
  • External replica assessment: org/gcp/cloudsql-external-replica probes whether the Cloud SQL instance is in a state that supports promotion. This produces a structured state record used as a preflight gate in the promote blueprint.
  • Cost signal: Cloud SQL instance cost is surfaced via the cost-signal framework (ADR-0801) so DR drills are not authorised without a cost acknowledgement.

Networking constraint

Cloud SQL private IP access from the on-prem or Hetzner environment requires the GCP hub VPC to be peered with the Cloud SQL VPC (servicenetworking.googleapis.com). This is a prerequisite of org/gcp/wan-hub-network. Do not attempt Cloud SQL failover without confirming org/gcp/wan-hub-network state is ok and the google-managed-services-* peer is established.

Compatibility constraint

Cloud SQL PostgreSQL major version must match the on-prem Patroni cluster major version. Current baseline: PostgreSQL 15. Version upgrades to the on-prem cluster must trigger a Cloud SQL version review before the next DR drill.

4. Consequences

  • org/gcp/cloudsql-postgresql and org/gcp/pgbackrest-repo are required in every environment that participates in GCP DR. They must reach state ok before the DR blueprints can run.
  • Cloud SQL instances must be kept in a known, provisionable state between drills. Use skip_if_state_ok: true on the standby blueprint step so idle instances are not reprovisioned unnecessarily.
  • The on-prem pgBackRest stanza must archive to the GCS bucket continuously, not only during DR events. A break in WAL archiving invalidates the Cloud SQL restore path.
  • Application connection strings must use the HA proxy or a DNS name that can be repointed at cutover. Hard-coded on-prem IP addresses in application config block DR.
  • Failback (dr/postgresql-cloudsql-failback-onprem@v1) requires the on-prem Patroni cluster to be in a state that can accept a pgBackRest restore. Confirm postgresql-ha state before initiating failback.

5. Implementation notes

Provisioning order within an environment:

  1. org/gcp/project-factory: GCP project.
  2. org/gcp/wan-hub-network: hub VPC with Cloud SQL VPC peering.
  3. org/gcp/pgbackrest-repo: GCS bucket for WAL archives.
  4. org/gcp/cloudsql-postgresql: Cloud SQL instance (private IP, PostgreSQL 15).
  5. platform/onprem/postgresql-ha-backup: on-prem stanza configured to archive to the GCS bucket above.

The cloudsql-external-replica assessment module runs as step 1 of the promote blueprint. It is not applied independently in normal operations.

References


License: MIT-0 for code, CC-BY-4.0 for documentation