Skip to content

Repeatable PostgreSQL App-Data DR Validation Drill

  • Purpose: Prove the self-managed PostgreSQL DR path end to end by seeding deterministic application data, failing over to an isolated GCP lane, taking a fresh validation backup, and restoring back into a separate on-prem validation lane. Owner: Platform engineering / SRE

  • Trigger: Scheduled resilience drill or release-readiness verification for the PostgreSQL HA DR lane

  • Impact: Creates and validates temporary validation clusters in GCP and on-prem drill lanes. The live primary database lane and live DNS remain untouched.
  • Severity: P2 Pre-reqs: GCP ops runner is healthy, on-prem drill source cluster is healthy, runtime vault secrets exist in dev and drill, and operators have workstation access to the Proxmox bastion and GCP IAP.

  • Rollback strategy: Use the dedicated cleanup runbook to destroy only the validation lanes and preserve the live primary, source drill cluster, shared DNS, and shared networking state.

Context

This drill proves two separate claims:

  • infrastructure recovery works across the shipped self-managed PostgreSQL HA failover and failback blueprints
  • application data survives the full cycle, not just the cluster rebuild

The run uses three distinct lanes:

  • source drill lane: existing on-prem drill PostgreSQL HA cluster at 10.12.0.41
  • GCP validation lane: isolated failover target restored into platform/gcp/platform-vm#gcp_pg_vms_app_validation
  • on-prem failback validation lane: isolated return target restored into platform/onprem/platform-vm#postgres_ha_vms_app_validation_drill

The deterministic validation dataset lives in:

  • $HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql
  • $HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql

Expected validation output for a successful run:

  • tenants=20
  • services=100
  • events=1000
  • tenant_checksum=d7fe13ac01e157e8e5f01f4c0469debd
  • service_checksum=3825e7553d2dba7b6bef2dbca2b2be79
  • event_checksum=3895de221035a795e853fad560a31078

Preconditions and safety checks

  1. Confirm the isolated source drill lane is healthy.
    jq '.outputs.db_host, .outputs.cap_db_postgresql_ha' \
      "$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/latest.json"
    

Expected result:

  • 10.12.0.41
  • ready

  • Confirm the GCP ops runner exists.

    jq '.status' \
      "$HOME/.hybridops/envs/dev/state/modules/platform__linux__ops-runner/instances/gcp_ops_runner_bootstrap.json"
    

Expected result:

  • "ok"

  • Confirm the required secrets exist before starting.

    cd /home/user/hybridops-tech/hybridops-core
    
    ./.venv/bin/hyops secrets ensure --env dev \
      PATRONI_SUPERUSER_PASSWORD \
      PATRONI_REPLICATION_PASSWORD \
      NETBOX_DB_PASSWORD \
      PG_BACKUP_GCS_SA_JSON
    
    ./.venv/bin/hyops secrets ensure --env drill \
      PATRONI_SUPERUSER_PASSWORD \
      PATRONI_REPLICATION_PASSWORD \
      NETBOX_DB_PASSWORD \
      PG_BACKUP_GCS_SA_JSON
    
  • Confirm the validation overlays exist and are not pointed at the live primary lane.

    sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-validation.yml"
    sed -n '1,220p' "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-validation.yml"
    sed -n '1,260p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml"
    

Check specifically for these isolated state instances:

  • gcp_pg_vms_app_validation
  • postgresql_restore_gcp_app_validation
  • postgresql_backup_run_gcp_app_validation
  • postgres_ha_vms_app_validation_drill
  • postgresql_restore_onprem_app_validation_drill
  • postgresql_backup_config_onprem_app_validation_drill

Steps

  1. Seed the deterministic application dataset on the source drill leader

Action: load the seeded validation schema and data into the isolated on-prem source drill lane.

Command or procedure:

ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql"

Expected result:

  • drvalidation_app is recreated cleanly on 10.12.0.41
  • no changes are made to the live primary dev lane

Run record:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql

  • Verify the source drill dataset before failover

Action: confirm row counts and checksums on the source drill leader.

Command or procedure:

ssh -J root@192.168.0.27 opsadmin@10.12.0.41 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql"

Expected result:

  • counts match 20 / 100 / 1000
  • checksums match the values listed in this runbook

Run record:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql

  • Restore the isolated GCP validation lane

Action: run the failover validation blueprint from the GCP runner.

Command or procedure:

cd /home/user/hybridops-tech/hybridops-core

HYOPS_CORE_ROOT=/home/user/hybridops-tech/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
  --runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
  --sync-env PATRONI_SUPERUSER_PASSWORD \
  --sync-env PATRONI_REPLICATION_PASSWORD \
  --sync-env NETBOX_DB_PASSWORD \
  --sync-env PG_BACKUP_GCS_SA_JSON \
  --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-validation.yml" \
  --execute

Expected result:

  • platform/gcp/platform-vm#gcp_pg_vms_app_validation reaches status=ok
  • platform/postgresql-ha#postgresql_restore_gcp_app_validation reaches status=ok
  • restored leader endpoint is 10.72.16.27

Run record:

  • $HOME/.hybridops/envs/dev/logs/runner/
  • $HOME/.hybridops/envs/dev/state/modules/platform__gcp__platform-vm/instances/gcp_pg_vms_app_validation.json
  • $HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_validation.json

  • Verify the restored GCP validation lane

Action: confirm the restored validation data on the GCP leader through IAP.

Command or procedure:

GCP_PROJECT_ID="$(jq -r '.context.project_id' "$HOME/.hybridops/envs/dev/meta/gcp.ready.json")"

gcloud compute ssh platform-validation-pgapp-01 \
  --project "$GCP_PROJECT_ID" \
  --zone europe-west2-a \
  --tunnel-through-iap \
  --command 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql"

Expected result:

  • counts match 20 / 100 / 1000
  • checksums match the source drill lane

Run record:

  • operator shell transcript
  • $HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_validation.json

  • Take a fresh pinned backup from the GCP validation leader

Action: run an on-demand validation backup after the GCP restore is validated.

Command or procedure:

cd /home/user/hybridops-tech/hybridops-core

HYOPS_CORE_ROOT=/home/user/hybridops-tech/hybridops-core \
./.venv/bin/hyops runner blueprint deploy --env dev \
  --runner-state-ref platform/linux/ops-runner#gcp_ops_runner_bootstrap \
  --sync-env PATRONI_SUPERUSER_PASSWORD \
  --sync-env PATRONI_REPLICATION_PASSWORD \
  --sync-env PG_BACKUP_GCS_SA_JSON \
  --file "$HOME/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-validation.yml" \
  --execute

Expected result:

  • platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_validation reaches status=ok
  • a new pgBackRest backup label exists in the repository
  • the backup label and timeline are published in module state for the failback step

Run record:

  • $HOME/.hybridops/envs/dev/logs/runner/
  • $HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha-backup/instances/postgresql_backup_run_gcp_app_validation.json

  • Point the failback overlay at the fresh backup-run state

Action: update the failback overlay to consume the latest backup metadata from the backup-run state.

Command or procedure:

sed -n '1,220p' "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml"

Set these fields in the failback overlay:

  • backup_state_ref: platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_validation
  • backup_state_env: dev
  • allow_cross_env_state: true

Leave these fields blank when a backup-run state is supplying the latest safe restore selector:

  • restore_set
  • restore_target_timeline

Only set restore_target_timeline when you are intentionally pinning a non-default lineage after inspecting the repository history.

Expected result:

  • failback overlay points at the backup-run state created in the previous step
  • no old backup label or timeline is left behind by mistake

Run record:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml

  • Restore the isolated on-prem failback validation lane

Action: rebuild and restore the separate on-prem validation lane from the fresh GCP validation backup.

Command or procedure:

cd /home/user/hybridops-tech/hybridops-core

HYOPS_CORE_ROOT=/home/user/hybridops-tech/hybridops-core \
./.venv/bin/hyops blueprint deploy --env drill \
  --file "$HOME/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml" \
  --execute

Expected result:

  • platform/onprem/platform-vm#postgres_ha_vms_app_validation_drill reaches status=ok
  • platform/postgresql-ha#postgresql_restore_onprem_app_validation_drill reaches status=ok
  • restored leader endpoint is 10.12.0.51

Run record:

  • $HOME/.hybridops/envs/drill/logs/module/platform__onprem__platform-vm/
  • $HOME/.hybridops/envs/drill/logs/module/platform__postgresql-ha/
  • $HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_validation_drill.json

  • Verify the on-prem failback validation lane

Action: confirm the same dataset now exists on the isolated failback validation leader.

Command or procedure:

ssh -J root@192.168.0.27 opsadmin@10.12.0.51 'sudo -u postgres psql' \
  < "$HOME/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql"

Expected result:

  • counts match 20 / 100 / 1000
  • checksums match the source drill lane and GCP validation lane

Run record:

  • operator shell transcript
  • $HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_validation_drill.json

Verification

Confirm all three checkpoints line up:

  • source drill leader: 10.12.0.41
  • GCP validation leader: 10.72.16.27
  • on-prem failback validation leader: 10.12.0.51

Final success criteria:

  • all three locations return identical counts and checksums
  • platform/postgresql-ha#postgresql_restore_gcp_app_validation is ok
  • platform/postgresql-ha-backup#postgresql_backup_run_gcp_app_validation is ok
  • platform/postgresql-ha#postgresql_restore_onprem_app_validation_drill is ok
  • platform/postgresql-ha-backup#postgresql_backup_config_onprem_app_validation_drill is ok

Useful state checks:

jq '.status, .outputs.db_host' \
  "$HOME/.hybridops/envs/dev/state/modules/platform__postgresql-ha/instances/postgresql_restore_gcp_app_validation.json"

jq '.status, .outputs.db_host' \
  "$HOME/.hybridops/envs/drill/state/modules/platform__postgresql-ha/instances/postgresql_restore_onprem_app_validation_drill.json"

Post-actions and clean-up

  • Do not repoint live DNS from this drill.
  • Do not destroy the source drill lane at 10.12.0.41 from this procedure.
  • Preserve the validation logs and state until the drill report is accepted.
  • When the drill is complete, use the companion cleanup flow:
  • Cleanup the PostgreSQL App-Data DR Drill Lanes

References

  • ~/.hybridops/envs/drill/config/drvalidation/drvalidation_seed.sql
  • ~/.hybridops/envs/drill/config/drvalidation/drvalidation_verify.sql
  • ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-failover-gcp-app-validation.yml
  • ~/.hybridops/envs/dev/config/blueprints/dr-postgresql-ha-backup-gcp-app-validation.yml
  • ~/.hybridops/envs/drill/config/blueprints/dr-postgresql-ha-failback-onprem-app-validation.yml

License: MIT-0 for code, CC-BY-4.0 for documentation