Operate PostgreSQL HA Cluster (HyOps)¶
Purpose: Deploy and operate an on-prem PostgreSQL HA cluster using Patroni + etcd via platform/onprem/postgresql-ha.
Owner: Platform engineering
Trigger: New environment bring-up, DB endpoint changes (VIP/allowlist), controlled rebuild, or maintenance window
Impact: Deploys/configures a multi-node Patroni cluster and publishes stable connection outputs for downstream modules
Severity: P2
Pre-reqs: Target hosts reachable via SSH (typically provisioned by platform/onprem/platform-vm), runtime vault decrypt works, Ansible deps installed.
Rollback strategy: Run hyops destroy with the same module and input overlay (or rebuild from clean VMs).
Context¶
This runbook covers module-level operations for:
- Module:
platform/onprem/postgresql-ha - Driver:
config/ansible - Upstream automation:
vitabaks.autobase(Ansible Galaxy) - Scope: configure PostgreSQL HA on existing Linux hosts (no VM provisioning)
Key behavior:
- Default
inputs.apply_mode=auto: - First run: bootstraps the cluster.
- Subsequent runs (state is
ok): switches to maintenance reconcile (safe day-2 operations).
Preconditions and safety checks¶
- Installed
hyops(viainstall.sh) can be run from any working directory. - If you want to use the shipped example overlays, set:
export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"
For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead.
- Correct environment selected (--env dev|staging|prod).
- Target hosts are reachable via SSH from the runner (or configure bastion/proxy jump).
- Ansible Galaxy deps installed for this module:
# If you installed via install.sh (default runs setup-all), this is already done.
# To (re)install Ansible Galaxy deps for an env:
hyops setup ansible --env dev
Secrets (recommended: runtime vault):
hyops secrets ensure --env dev \
PATRONI_SUPERUSER_PASSWORD \
PATRONI_REPLICATION_PASSWORD \
NETBOX_DB_PASSWORD
Steps¶
- Select an overlay
Use one of:
$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml(3 nodes; etcd co-located)$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.full.yml(dedicated etcd nodes)$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml(DR/failback restore from GCS; requiresapply_mode=restore)-
$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.s3.yml(DR/failback restore from S3; requiresapply_mode=restore) -
Preflight
hyops preflight --env dev --strict \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
- Bootstrap (first deployment)
Option A: let auto decide (recommended):
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
Option B: force bootstrap explicitly:
HYOPS_INPUT_apply_mode=bootstrap \
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
- Maintenance reconcile (day-2 updates)
Re-run apply (auto selects maintenance when prior state is ok), or force maintenance:
HYOPS_INPUT_apply_mode=maintenance \
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
Validated gate sequence (recommended before production promotion):
# idempotency
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml
# lifecycle: destroy -> apply
hyops destroy --env dev \
--module platform/onprem/postgresql-ha \
--inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml
# variant: maintenance mode
HYOPS_INPUT_apply_mode=maintenance \
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml
- Restore (DR / failback) (optional)
This bootstraps a new Patroni cluster from a pgBackRest repository (S3 or GCS).
Safety guardrails:
- Requires
HYOPS_INPUT_apply_mode=restore - Requires
HYOPS_INPUT_restore_confirm=true - Recommended: restore onto fresh VMs (do not run against a live/primary cluster)
Example (restore from GCS):
# Optional: provision a GCS repo bucket + service account (infra-only; no SA keys)
hyops apply --env dev \
--module org/gcp/pgbackrest-repo \
--inputs "$HYOPS_CORE_ROOT/modules/org/gcp/pgbackrest-repo/examples/inputs.min.yml"
# Store GCS SA JSON into vault (one-time per env)
export PG_BACKUP_GCS_SA_JSON="$(cat sa.json)"
hyops secrets set --env dev --from-env PG_BACKUP_GCS_SA_JSON
HYOPS_INPUT_apply_mode=restore \
HYOPS_INPUT_restore_confirm=true \
hyops preflight --env dev --strict \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml"
HYOPS_INPUT_apply_mode=restore \
HYOPS_INPUT_restore_confirm=true \
hyops apply --env dev \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml"
- Verify outputs and evidence
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__postgresql-ha/latest.json"
Check:
statusisokoutputs.cluster_vipis present when configuredoutputs.apps.<app>.db_*is present for each app contract-
evidence_direxists and includesansible.log -
Destroy (best-effort cleanup)
hyops destroy --env dev \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
- Rebuild (destroy then apply)
hyops rebuild --env dev --yes \
--confirm-module platform/onprem/postgresql-ha \
--module platform/onprem/postgresql-ha \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
Verification¶
Primary state:
$HOME/.hybridops/envs/<env>/state/modules/platform__onprem__postgresql-ha/latest.json
Primary logs:
$HOME/.hybridops/envs/<env>/logs/module/platform__onprem__postgresql-ha/<run_id>/ansible.log
Functional checks (example):
- From a host that can reach the management network:
nc -vz <cluster_vip> 5432 - On the leader:
sudo patronictl -c /etc/patroni/patroni.yml list
Common issues¶
Whoops! data directory ... is already initialized¶
Cause: running bootstrap against hosts that already have an initialized data dir.
Fix:
- Use
inputs.apply_mode=maintenance, or - Run
hyops destroyand reprovision fresh VMs.
No healthy leader node detected¶
Cause: Patroni leader REST API is not reachable / cluster unhealthy.
Fix:
- Confirm node reachability and that Patroni is running.
- Inspect leader node logs:
journalctl -u patroni -u etcd -u vip-manager. - If the cluster is partially deployed, recover manually or rebuild from clean VMs.
SSH timeouts / management network unreachable¶
Cause: runner lacks L3 access to the management network.
Fix:
- Get the runner onto the management network (VPN/route), or
- Configure bastion/proxy jump inputs (HyOps can auto-detect Proxmox as bastion when proxmox init exists):
ssh_proxy_jump_auto: true(default)- or set
ssh_proxy_jump_host,ssh_proxy_jump_user,ssh_proxy_jump_port.