Operate PostgreSQL HA Cluster (HyOps)¶

Purpose: Deploy and operate an on-prem PostgreSQL HA cluster using Patroni + etcd via platform/onprem/postgresql-ha.
Owner: Platform engineering
Trigger: New environment bring-up, DB endpoint changes (VIP/allowlist), controlled rebuild, or maintenance window
Impact: Deploys/configures a multi-node Patroni cluster and publishes stable connection outputs for downstream modules
Severity: P2
Pre-reqs: Target hosts reachable via SSH (typically provisioned by platform/onprem/platform-vm), runtime vault decrypt works, Ansible deps installed.
Rollback strategy: Run hyops destroy with the same module and input overlay (or rebuild from clean VMs).

Context¶

This runbook covers module-level operations for:

Module: platform/onprem/postgresql-ha
Driver: config/ansible
Upstream automation: vitabaks.autobase (Ansible Galaxy)
Scope: configure PostgreSQL HA on existing Linux hosts (no VM provisioning)

Key behavior:

Default inputs.apply_mode=auto:
First run: bootstraps the cluster.
Subsequent runs (state is ok): switches to maintenance reconcile (safe day-2 operations).

Preconditions and safety checks¶

Installed hyops (via install.sh) can be run from any working directory.
If you want to use the shipped example overlays, set:

export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"

For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead. - Correct environment selected (--env dev|staging|prod). - Target hosts are reachable via SSH from the runner (or configure bastion/proxy jump). - Ansible Galaxy deps installed for this module:

# If you installed via install.sh (default runs setup-all), this is already done.
# To (re)install Ansible Galaxy deps for an env:
hyops setup ansible --env dev

Secrets (recommended: runtime vault):

hyops secrets ensure --env dev \
  PATRONI_SUPERUSER_PASSWORD \
  PATRONI_REPLICATION_PASSWORD \
  NETBOX_DB_PASSWORD

Steps¶

Select an overlay

Use one of:

$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml (3 nodes; etcd co-located)
$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.full.yml (dedicated etcd nodes)
$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml (DR/failback restore from GCS; requires apply_mode=restore)
$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.s3.yml (DR/failback restore from S3; requires apply_mode=restore)
Preflight

hyops preflight --env dev --strict \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Bootstrap (first deployment)

Option A: let auto decide (recommended):

hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Option B: force bootstrap explicitly:

HYOPS_INPUT_apply_mode=bootstrap \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Maintenance reconcile (day-2 updates)

Re-run apply (auto selects maintenance when prior state is ok), or force maintenance:

HYOPS_INPUT_apply_mode=maintenance \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Validated gate sequence (recommended before production promotion):

# idempotency
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

# lifecycle: destroy -> apply
hyops destroy --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

# variant: maintenance mode
HYOPS_INPUT_apply_mode=maintenance \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs ~/.hybridops/envs/dev/work/blueprint-inputs/onprem_postgresql-ha_v1/postgresql_ha.inputs.yml

Restore (DR / failback) (optional)

This bootstraps a new Patroni cluster from a pgBackRest repository (S3 or GCS).

Safety guardrails:

Requires HYOPS_INPUT_apply_mode=restore
Requires HYOPS_INPUT_restore_confirm=true
Recommended: restore onto fresh VMs (do not run against a live/primary cluster)

Example (restore from GCS):

# Optional: provision a GCS repo bucket + service account (infra-only; no SA keys)
hyops apply --env dev \
  --module org/gcp/pgbackrest-repo \
  --inputs "$HYOPS_CORE_ROOT/modules/org/gcp/pgbackrest-repo/examples/inputs.min.yml"

# Store GCS SA JSON into vault (one-time per env)
export PG_BACKUP_GCS_SA_JSON="$(cat sa.json)"
hyops secrets set --env dev --from-env PG_BACKUP_GCS_SA_JSON

HYOPS_INPUT_apply_mode=restore \
HYOPS_INPUT_restore_confirm=true \
hyops preflight --env dev --strict \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml"

HYOPS_INPUT_apply_mode=restore \
HYOPS_INPUT_restore_confirm=true \
hyops apply --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.restore.gcs.yml"

Verify outputs and evidence

cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__postgresql-ha/latest.json"

Check:

status is ok
outputs.cluster_vip is present when configured
outputs.apps.<app>.db_* is present for each app contract
evidence_dir exists and includes ansible.log
Destroy (best-effort cleanup)

hyops destroy --env dev \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Rebuild (destroy then apply)

hyops rebuild --env dev --yes \
  --confirm-module platform/onprem/postgresql-ha \
  --module platform/onprem/postgresql-ha \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/postgresql-ha/examples/inputs.min.yml"

Verification¶

Primary state:

$HOME/.hybridops/envs/<env>/state/modules/platform__onprem__postgresql-ha/latest.json

Primary logs:

$HOME/.hybridops/envs/<env>/logs/module/platform__onprem__postgresql-ha/<run_id>/ansible.log

Functional checks (example):

From a host that can reach the management network: nc -vz <cluster_vip> 5432
On the leader: sudo patronictl -c /etc/patroni/patroni.yml list

Common issues¶

`Whoops! data directory ... is already initialized`¶

Cause: running bootstrap against hosts that already have an initialized data dir.

Fix:

Use inputs.apply_mode=maintenance, or
Run hyops destroy and reprovision fresh VMs.

`No healthy leader node detected`¶

Cause: Patroni leader REST API is not reachable / cluster unhealthy.

Fix:

Confirm node reachability and that Patroni is running.
Inspect leader node logs: journalctl -u patroni -u etcd -u vip-manager.
If the cluster is partially deployed, recover manually or rebuild from clean VMs.

SSH timeouts / management network unreachable¶

Cause: runner lacks L3 access to the management network.

Fix:

Get the runner onto the management network (VPN/route), or
Configure bastion/proxy jump inputs (HyOps can auto-detect Proxmox as bastion when proxmox init exists):
ssh_proxy_jump_auto: true (default)
or set ssh_proxy_jump_host, ssh_proxy_jump_user, ssh_proxy_jump_port.

Operate PostgreSQL HA Cluster (HyOps)¶

Context¶

Preconditions and safety checks¶

Steps¶

Verification¶

Common issues¶

Whoops! data directory ... is already initialized¶

No healthy leader node detected¶

SSH timeouts / management network unreachable¶

References¶

`Whoops! data directory ... is already initialized`¶

`No healthy leader node detected`¶