Repeatable PostgreSQL App-Data DR Validation Drill¶
-
Purpose: Prove the self-managed PostgreSQL DR path end to end by seeding deterministic application data, failing over to an isolated GCP lane, taking a fresh validation backup, and restoring back into a separate on-prem validation lane. Owner: Platform engineering / SRE
-
Trigger: Scheduled resilience drill or release-readiness verification for the PostgreSQL HA DR lane
- Impact: Creates and validates temporary validation clusters in GCP and on-prem drill lanes. The live primary database lane and live DNS remain untouched.
-
Severity: P2 Pre-reqs: GCP ops runner is healthy, on-prem drill source cluster is healthy, runtime vault secrets exist in
devanddrill, and operators have workstation access to the Proxmox bastion and GCP IAP. -
Rollback strategy: Use the dedicated cleanup runbook to destroy only the validation lanes and preserve the live primary, source drill cluster, shared DNS, and shared networking state.
Context¶
This drill proves two separate claims:
- infrastructure recovery works across the shipped self-managed PostgreSQL HA failover and failback blueprints
- application data survives the full cycle, not just the cluster rebuild
The run uses three distinct lanes:
- source drill lane: existing on-prem drill PostgreSQL HA cluster at
10.12.0.41 - GCP validation lane: isolated failover target restored into
platform/gcp/platform-vm#gcp_pg_vms_app_validation - on-prem failback validation lane: isolated return target restored into
platform/onprem/platform-vm#postgres_ha_vms_app_validation_drill
The deterministic validation dataset lives in:
$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql
Expected validation output for a successful run:
tenants=20services=100events=1000tenant_checksum=d7fe13ac01e157e8e5f01f4c0469debdservice_checksum=3825e7553d2dba7b6bef2dbca2b2be79event_checksum=3895de221035a795e853fad560a31078
Preconditions and safety checks¶
- Confirm the isolated source drill lane is healthy.
jq '.outputs.db_host, .outputs.cap_db_postgresql_ha' \ "$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/latest.json"
Expected result:
10.12.0.41-
ready -
Confirm the GCP ops runner exists.
jq '.status' \ "$HOME/.hybridops/envs/dev/state/modules/platform__linux__ops-runner/instances/gcp_ops_runner_bootstrap.json"
Expected result:
-
"ok" -
Confirm the required secrets exist before starting.
cd /home/user/hybridops-tech/hybridops-core ./.venv/bin/hyops secrets ensure --env dev \ PATRONI_SUPERUSER_PASSWORD \ PATRONI_REPLICATION_PASSWORD \ NETBOX_DB_PASSWORD \ PG_BACKUP_GCS_SA_JSON ./.venv/bin/hyops secrets ensure --env drill \ PATRONI_SUPERUSER_PASSWORD \ PATRONI_REPLICATION_PASSWORD \ NETBOX_DB_PASSWORD \ PG_BACKUP_GCS_SA_JSON -
Confirm the validation overlays exist and are not pointed at the live primary lane.
sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-validation.yml" sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-validation.yml" sed -n '1,260p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml"
Check specifically for these isolated state instances:
gcp_pg_vms_app_validationpostgresql_restore_gcp_app_validationpostgresql_backup_run_gcp_app_validationpostgres_ha_vms_app_validation_drillpostgresql_restore_onprem_app_validation_drillpostgresql_backup_config_onprem_app_validation_drill
Steps¶
- Seed the deterministic application dataset on the source drill leader
Action: load the seeded validation schema and data into the isolated on-prem source drill lane.
Command or procedure:
ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql"
Expected result:
drvalidation_appis recreated cleanly on10.12.0.41- no changes are made to the live primary
devlane
Run record:
- operator shell transcript
-
$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql -
Verify the source drill dataset before failover
Action: confirm row counts and checksums on the source drill leader.
Command or procedure:
ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql"
Expected result:
- counts match
20 / 100 / 1000 - checksums match the values listed in this runbook
Run record:
- operator shell transcript
-
$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql -
Restore the isolated GCP validation lane
Action: run the failover validation blueprint from the GCP runner.
Command or procedure:
cd /home/user/hybridops-tech/hybridops-core
HYOPS_CORE_ROOT=/home/user/hybridops-tech/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
--runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
--sync-env PATRONI_SUPERUSER_PASSWORD \
--sync-env PATRONI_REPLICATION_PASSWORD \
--sync-env NETBOX_DB_PASSWORD \
--sync-env PG_BACKUP_GCS_SA_JSON \
--file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-validation.yml" \
--execute
Expected result:
platform/gcp/platform-vm#gcp_pg_vms_app_validationreachesstatus=okplatform/postgresql-ha#postgresql_restore_gcp_app_validationreachesstatus=ok- restored leader endpoint is
10.72.16.27
Run record:
$HOME/.hybridops/envs/dev/logs/runner/$HOME/.hybridops/envs/dev/state/modules/platform__gcp__platform-vm/instances/gcp_pg_vms_app_validation.json-
$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_validation.json -
Verify the restored GCP validation lane
Action: confirm the restored validation data on the GCP leader through IAP.
Command or procedure:
GCP_PROJECT_ID="$(jq -r '.context.project_id' "$HOME/.hybridops/envs/dev/meta/gcp.ready.json")"
gcloud compute ssh platform-validation-pgapp-01 \
--project "$GCP_PROJECT_ID" \
--zone europe-west2-a \
--tunnel-through-iap \
--command 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql"
Expected result:
- counts match
20 / 100 / 1000 - checksums match the source drill lane
Run record:
- operator shell transcript
-
$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_validation.json -
Take a fresh pinned backup from the GCP validation leader
Action: run an on-demand validation backup after the GCP restore is validated.
Command or procedure:
cd /home/user/hybridops-tech/hybridops-core
HYOPS_CORE_ROOT=/home/user/hybridops-tech/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
--runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
--sync-env PATRONI_SUPERUSER_PASSWORD \
--sync-env PATRONI_REPLICATION_PASSWORD \
--sync-env PG_BACKUP_GCS_SA_JSON \
--file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-validation.yml" \
--execute
Expected result:
platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_validationreachesstatus=ok- a new pgBackRest backup label exists in the repository
- the backup label and timeline are published in module state for the failback step
Run record:
$HOME/.hybridops/envs/dev/logs/runner/-
$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha-backup/instances/postgresql_backup_run_gcp_app_validation.json -
Point the failback overlay at the fresh backup-run state
Action: update the failback overlay to consume the latest backup metadata from the backup-run state.
Command or procedure:
sed -n '1,220p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml"
Set these fields in the failback overlay:
backup_state_ref: platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_validationbackup_state_env: devallow_cross_env_state: true
Leave these fields blank when a backup-run state is supplying the latest safe restore selector:
restore_setrestore_target_timeline
Only set restore_target_timeline when you are intentionally pinning a
non-default lineage after inspecting the repository history.
Expected result:
- failback overlay points at the backup-run state created in the previous step
- no old backup label or timeline is left behind by mistake
Run record:
- operator shell transcript
-
$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml -
Restore the isolated on-prem failback validation lane
Action: rebuild and restore the separate on-prem validation lane from the fresh GCP validation backup.
Command or procedure:
cd /home/user/hybridops-tech/hybridops-core
HYOPS_CORE_ROOT=/home/user/hybridops-tech/hybridops-core \
./.venv/bin/hyops blueprint deploy --env drill \
--file "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml" \
--execute
Expected result:
platform/onprem/platform-vm#postgres_ha_vms_app_validation_drillreachesstatus=okplatform/postgresql-ha#postgresql_restore_onprem_app_validation_drillreachesstatus=ok- restored leader endpoint is
10.12.0.51
Run record:
$HOME/.hybridops/envs/drill/logs/module/platform__onprem__platform-vm/$HOME/.hybridops/envs/drill/logs/module/platform__postgresql-ha/-
$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_validation_drill.json -
Verify the on-prem failback validation lane
Action: confirm the same dataset now exists on the isolated failback validation leader.
Command or procedure:
ssh -J root@192.168.0.27 opsadmin@10.12.0.51 'sudo -u postgres psql' \
< "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql"
Expected result:
- counts match
20 / 100 / 1000 - checksums match the source drill lane and GCP validation lane
Run record:
- operator shell transcript
$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_validation_drill.json
Verification¶
Confirm all three checkpoints line up:
- source drill leader:
10.12.0.41 - GCP validation leader:
10.72.16.27 - on-prem failback validation leader:
10.12.0.51
Final success criteria:
- all three locations return identical counts and checksums
platform/postgresql-ha#postgresql_restore_gcp_app_validationisokplatform/postgresql-ha-backup#postgresql_backup_run_gcp_app_validationisokplatform/postgresql-ha#postgresql_restore_onprem_app_validation_drillisokplatform/postgresql-ha-backup#postgresql_backup_config_onprem_app_validation_drillisok
Useful state checks:
jq '.status, .outputs.db_host' \
"$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_validation.json"
jq '.status, .outputs.db_host' \
"$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_validation_drill.json"
Post-actions and clean-up¶
- Do not repoint live DNS from this drill.
- Do not destroy the source drill lane at
10.12.0.41from this procedure. - Preserve the validation logs and state until the drill report is accepted.
- When the drill is complete, use the companion cleanup flow:
- Cleanup the PostgreSQL App-Data DR Drill Lanes
Related¶
Related reading¶
- Cleanup the PostgreSQL App-Data DR Validation Lanes
- PostgreSQL DR Operating Model (Restore vs Warm Standby vs Multi-Cloud)
- Failover PostgreSQL HA to GCP (HyOps Blueprint)
- Failback PostgreSQL HA to On-Prem (HyOps Blueprint)
- PostgreSQL HA DR Cycle
References¶
~/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql~/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-validation.yml~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-validation.yml~/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml
License: MIT-0 for code, CC-BY-4.0 for documentation