PostgreSQL LXC (db-01) Failure and Promotion¶
Purpose: Provide a clear procedure for handling failure of the primary PostgreSQL LXC (db-01), promoting a replica or alternate instance, and safely repointing dependent workloads (for example, NetBox on RKE2) while capturing evidence.
Owner: Platform / SRE team (HybridOps.Studio)
Trigger: db-01 is unavailable, failing health checks, or deemed unsafe to continue as primary.
Impact: Services that depend on PostgreSQL (for example, NetBox as Source of Truth) may be degraded or unavailable. Data integrity and recovery point objective (RPO) must be considered.
Severity: P1 – database primary outage.
This runbook aligns with:
- ADR-0013 – PostgreSQL Runs in LXC (State on Host-Mounted Storage; Backups First-Class)
- ADR-0202 – Adopt RKE2 as Primary Runtime for Platform and Applications
- ADR-0701 – Use GitHub Actions as Stateless DR Orchestrator
- ADR-0801 – Treat Cost as a First-Class Signal for DR and Cloud Bursting
1. Scenario overview¶
PostgreSQL is hosted primarily in an LXC container (db-01) on Proxmox, with:
- Host-mounted storage to make state visible and backup-friendly.
- Regular backups (for example, WAL-G, snapshot-based backups, or both).
- Optionally one of:
- A standby/replica LXC (db-02) on another Proxmox node, and/or
- A cloud PostgreSQL replica used for DR.
This runbook covers:
- Rapid triage of db-01 failure.
- Promotion of a replica or restoration to an alternate instance.
- Repointing workloads that depend on PostgreSQL (for example, NetBox on RKE2).
- Verification and evidence capture.
Note: If the failure is part of a wider on-prem outage, coordinate with the DR cutover runbook:
Runbook – DR Cutover: On-Prem RKE2 to Cloud Cluster
Evidence for this runbook should be stored under:
output/artifacts/data/postgresql/- Application-specific proof folders (for example,
output/artifacts/apps/netbox/)
2. Preconditions and safety checks¶
Before making changes, establish:
-
Confirm the nature of the failure
-
Is db-01:
- Completely unreachable (LXC not running / Proxmox node down)?
- Running but PostgreSQL service unhealthy?
- Experiencing data corruption or disk issues?
-
From the Proxmox node that should host db-01:
pct status <db01-ctid> -
If the Proxmox node itself is down or unstable, treat this as broader infra failure and consider DR as well as this runbook.
-
Confirm backup and replica status
-
Identify last successful backup:
- Time, size, and method (WAL-G, snapshot, etc.).
-
Identify available replicas:
- On-prem standby LXC (db-02) or
- Cloud PostgreSQL replica.
-
Identify critical dependent services
-
At minimum:
- NetBox (running on RKE2 or Docker, depending on current state).
-
Confirm which workloads are allowed to operate in read-only mode vs must remain offline until promotion completes.
-
Declare incident and freeze risky changes
-
Open or update the incident ticket.
- Pause non-essential schema changes and database migrations.
-
Communicate potential downtime / degraded mode to stakeholders.
-
Create evidence folder
-
Choose an event-specific folder under:
mkdir -p output/artifacts/data/postgresql/db01-failover-<date>/ -
Replace
<date>with a timestamp (for example,2025-12-02T193000Z).
3. Phase 1 – Triage db-01¶
Goal: Quickly decide if db-01 can be recovered in place or if you must switch to a replica/restore.
- Check LXC container state
# From Proxmox host
pct status <db01-ctid>
pct console <db01-ctid>
- If the LXC is stopped and can be safely started:
pct start <db01-ctid> -
Capture any errors to:
output/artifacts/data/postgresql/db01-failover-<date>/db01-pct-status.txt
-
Check PostgreSQL service state
Inside db-01 (if reachable):
systemctl status postgresql
journalctl -u postgresql --since "30 min ago"
-
Look for:
- Out-of-disk issues.
- Data corruption messages.
- Repeated crash loops.
-
Check disk and filesystem
df -h
dmesg | tail -n 50
-
If disk is full, consider emergency clean-up of logs or non-critical data, then attempt restart.
-
Decision point: in-place recovery vs failover
-
Attempt in-place recovery first if:
- The underlying storage is healthy.
- Issues look transient (for example, disk full, misconfiguration).
- Switch to failover (promotion/restore) if:
- LXC host is down for an extended period.
- Strong signs of disk or data corruption.
- Recovery risk is high relative to RPO/RTO commitments.
Record the decision and rationale in the incident ticket and in a short text file in the proof folder.
4. Phase 2 – Promote replica or restore to new primary¶
Goal: Establish a new PostgreSQL primary that can take over db-01’s role with an acceptable RPO.
4.1 Promote an existing standby (preferred path)¶
If you have a hot/warm standby (for example, LXC db-02 or a cloud replica):
-
Confirm standby health
-
Check service status and replication lag:
systemctl status postgresql # Example (may differ based on tooling): psql -U postgres -c "SELECT pg_is_in_recovery(), now() - pg_last_xact_replay_timestamp();" -
Promote standby to primary
Use the appropriate promotion command for your setup (for example):
# Example using pg_ctl on the standby:
pg_ctlcluster <version> main promote
Or the tool provided by your backup/replication stack.
-
Record the promotion
-
Capture the promotion command and logs to:
output/artifacts/data/postgresql/db01-failover-<date>/promotion-logs.txt
-
Update DNS / connection endpoints (if applicable)
-
If applications refer to
db-01via DNS, you may:- Update DNS to point to the new primary, or
- Use a connection string/endpoint that already abstracts this.
4.2 Restore from backup (fallback path)¶
If no healthy standby is available:
-
Provision a new PostgreSQL LXC or instance
-
Create
db-newLXC (or equivalent) with host-mounted storage aligned to ADR-0013. -
Install PostgreSQL at the expected version.
-
Restore from last good backup
-
Use your chosen restore mechanism (for example, WAL-G, base backup + WAL):
# Pseudocode; adapt to your tooling wal-g backup-fetch /var/lib/postgresql/data LATEST -
Confirm PostgreSQL starts cleanly after restore.
-
Record RPO impact
-
Determine how much time/data was lost compared to the current time.
-
Document RPO breach (if any) in the incident ticket.
-
Prepare the new instance to act as primary
-
Confirm
pg_is_in_recovery()returnsfalse(not in recovery). - Confirm you can connect as the NetBox and other app users.
5. Phase 3 – Repoint dependent workloads¶
Goal: Safely reconnect applications (for example, NetBox on RKE2) to the new PostgreSQL primary.
- Identify affected applications
At minimum:
- NetBox on RKE2 (see HOWTO – Deploy NetBox on RKE2 Using PostgreSQL LXC).
-
Any other workloads configured to talk to
db-01. -
Update connection endpoints
Depending on your architecture:
- If using DNS (for example,
db-01.internal.local):- Update DNS to point at the new primary.
-
If using explicit hostnames in K8s secrets:
- Update the relevant Secrets or ExternalSecrets and apply via GitOps/CI.
-
Restart or roll pods
For K8s workloads (example for NetBox):
kubectl rollout restart deploy/netbox -n network-platform
Ensure workloads pick up new connection settings.
-
Verify application connectivity
-
Check application logs for successful DB connections.
-
Perform basic functional checks (for example, NetBox login, query, create/delete a test record).
-
Record application-level evidence
-
kubectl get podsoutputs. - Application health endpoints.
- Store under:
output/artifacts/apps/netbox/output/artifacts/data/postgresql/db01-failover-<date>/apps-checks.txt
6. Phase 4 – Stabilisation and follow-up¶
-
Monitor new primary
-
Check PostgreSQL logs and metrics for:
- Errors.
- High replication lag (if you re-establish a standby).
- Resource saturation.
-
Decide what to do with the old db-01
-
If db-01 returns:
- Consider reinitialising it as a standby using fresh base backup from the new primary.
-
Ensure you do not inadvertently bring it back as a split-brain primary.
-
Re-establish backup and replication
-
Ensure regular backup jobs point to the new primary.
-
Restore standby/replica topology.
-
Update diagrams and documentation if topology changed
-
If the long-term primary host has changed, update:
- ADR-0013 links or notes (if required).
- Platform diagrams.
- Any static documentation that assumes db-01 hostname as the only primary.
7. Evidence and close-out¶
Before closing this runbook:
-
Ensure the following locations contain up-to-date artefacts for this event:
output/artifacts/apps/netbox/-
Any additional app-specific proof folders.
-
Update the incident ticket with:
-
Timeline (failure → decision → promotion/restore → repoint → stable).
- RPO/RTO observations.
- Root cause (if known) and contributing factors.
-
Follow-up tasks (for example, capacity, hardware, backup tuning).
-
If this was a drill:
-
Capture lessons learned.
- Validate that steps were realistic and repeatable.
- Align future DR drills to use this runbook as a baseline.
8. Validation checklist¶
- [ ] Nature of db-01 failure identified and documented.
- [ ] Decision (recover in place vs promote/restore) made and recorded with rationale.
- [ ] New primary (standby promoted or restored instance) is healthy and accepting connections.
- [ ] Dependent workloads (for example, NetBox on RKE2) have been repointed and validated.
- [ ] Backups and, if applicable, standby/replica topology have been re-established.
- [ ] Evidence artefacts stored under
output/artifacts/data/postgresql/and relevant application proof folders. - [ ] Incident ticket updated with findings and follow-up actions.
References¶
- ADR-0013 – PostgreSQL Runs in LXC (State on Host-Mounted Storage; Backups First-Class)
- ADR-0202 – Adopt RKE2 as Primary Runtime for Platform and Applications
- ADR-0701 – Use GitHub Actions as Stateless DR Orchestrator
- ADR-0801 – Treat Cost as a First-Class Signal for DR and Cloud Bursting
- Evidence 3 – Source of Truth and Network Automation
- Evidence 4 – Delivery Platform, GitOps and Cluster Operations
output/artifacts/data/postgresql/output/artifacts/apps/netbox/
Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation