Skip to content

Operate PostgreSQL HA Cluster (HyOps)

Purpose: Deploy and operate an on-prem PostgreSQL HA cluster using Patroni + etcd via platform/onprem/postgresql-ha.
Owner: Platform engineering
Trigger: New environment bring-up, DB endpoint changes (VIP/allowlist), controlled rebuild, or maintenance window
Impact: Deploys/configures a multi-node Patroni cluster and publishes stable connection outputs for downstream modules
Severity: P2
Pre-reqs: Target hosts reachable via SSH (typically provisioned by platform/onprem/platform-vm), runtime vault decrypt works, Ansible deps installed.
Rollback strategy: Run hyops destroy with the same module and input overlay (or rebuild from clean VMs).

Context

This runbook covers module-level operations for:

  • Module: platform/onprem/postgresql-ha
  • Driver: config/ansible
  • Upstream automation: vitabaks.autobase (Ansible Galaxy)
  • Scope: configure PostgreSQL HA on existing Linux hosts (no VM provisioning)

Key behavior:

  • Default inputs.apply_mode=auto:
  • First run: bootstraps the cluster.
  • Subsequent runs (state is ok): switches to maintenance reconcile (safe day-2 operations).

Preconditions and safety checks

  • Installed hyops (via install.sh) can be run from any working directory.
  • If you want to use the shipped example overlays, set:
export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"

For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead. - Correct environment selected (--env dev|staging|prod). - Target hosts are reachable via SSH from the runner (or configure bastion/proxy jump). - Ansible Galaxy deps installed for this module:

# If you installed via install.sh (default runs setup-all), this is already done.
# To (re)install Ansible Galaxy deps for an env:
hyops setup ansible --env dev

Secrets (recommended: runtime vault):

hyops secrets ensure --env dev \
  PATRONI_SUPERUSER_PASSWORD \
  PATRONI_REPLICATION_PASSWORD \
  NETBOX_DB_PASSWORD

Steps

  1. Select an overlay

Use one of:

  • $HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml (3 nodes; etcd co-located)
  • $HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.full.yml (dedicated etcd nodes)
  • $HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml (DR/failback restore from GCS; requires apply_mode=restore)
  • $HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.s3.yml (DR/failback restore from S3; requires apply_mode=restore)

  • Preflight

hyops preflight --env dev --strict \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
  1. Bootstrap (first deployment)

Option A: let auto decide (recommended):

hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Option B: force bootstrap explicitly:

HYOPS_INPUT_apply_mode=bootstrap \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
  1. Maintenance reconcile (day-2 updates)

Re-run apply (auto selects maintenance when prior state is ok), or force maintenance:

HYOPS_INPUT_apply_mode=maintenance \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Validated gate sequence (recommended before production promotion):

# idempotency
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

# lifecycle: destroy -> apply
hyops destroy --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

# variant: maintenance mode
HYOPS_INPUT_apply_mode=maintenance \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml
  1. Restore (DR / failback) (optional)

This bootstraps a new Patroni cluster from a pgBackRest repository (S3 or GCS).

Safety guardrails:

  • Requires HYOPS_INPUT_apply_mode=restore
  • Requires HYOPS_INPUT_restore_confirm=true
  • Recommended: restore onto fresh VMs (do not run against a live/primary cluster)

Example (restore from GCS):

# Optional: provision a GCS repo bucket + service account (infra-only; no SA keys)
hyops apply --env dev \
  --module org/gcp/pgbackrest-repo \
  --inputs "$HYOPS_CORE_ROOT/modules/org/gcp/pgbackrest-repo/examples/inputs.min.yml"

# Store GCS SA JSON into vault (one-time per env)
export PG_BACKUP_GCS_SA_JSON="$(cat sa.json)"
hyops secrets set --env dev --from-env PG_BACKUP_GCS_SA_JSON

HYOPS_INPUT_apply_mode=restore \
HYOPS_INPUT_restore_confirm=true \
hyops preflight --env dev --strict \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml"

HYOPS_INPUT_apply_mode=restore \
HYOPS_INPUT_restore_confirm=true \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml"
  1. Verify outputs and evidence
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__postgresql-ha/latest.json"

Check:

  • status is ok
  • outputs.cluster_vip is present when configured
  • outputs.apps.<app>.db_* is present for each app contract
  • evidence_dir exists and includes ansible.log

  • Destroy (best-effort cleanup)

hyops destroy --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"
  1. Rebuild (destroy then apply)
hyops rebuild --env dev --yes \
  --confirm-module platform/onprem/postgresql-ha \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Verification

Primary state:

  • $HOME/.hybridops/envs/<env>/state/modules/platform__onprem__postgresql-ha/latest.json

Primary logs:

  • $HOME/.hybridops/envs/<env>/logs/module/platform__onprem__postgresql-ha/<run_id>/ansible.log

Functional checks (example):

  • From a host that can reach the management network: nc -vz <cluster_vip> 5432
  • On the leader: sudo patronictl -c /etc/patroni/patroni.yml list

Common issues

Whoops! data directory ... is already initialized

Cause: running bootstrap against hosts that already have an initialized data dir.

Fix:

  • Use inputs.apply_mode=maintenance, or
  • Run hyops destroy and reprovision fresh VMs.

No healthy leader node detected

Cause: Patroni leader REST API is not reachable / cluster unhealthy.

Fix:

  • Confirm node reachability and that Patroni is running.
  • Inspect leader node logs: journalctl -u patroni -u etcd -u vip-manager.
  • If the cluster is partially deployed, recover manually or rebuild from clean VMs.

SSH timeouts / management network unreachable

Cause: runner lacks L3 access to the management network.

Fix:

  • Get the runner onto the management network (VPN/route), or
  • Configure bastion/proxy jump inputs (HyOps can auto-detect Proxmox as bastion when proxmox init exists):
  • ssh_proxy_jump_auto: true (default)
  • or set ssh_proxy_jump_host, ssh_proxy_jump_user, ssh_proxy_jump_port.

References