PostgreSQL LXC (db-01) Failure and Promotion¶

Purpose: Provide a clear procedure for handling failure of the primary PostgreSQL LXC (db-01), promoting a replica or alternate instance, and safely repointing dependent workloads (for example, NetBox on RKE2) while capturing evidence.
Owner: Platform / SRE team (HybridOps.Studio)
Trigger: db-01 is unavailable, failing health checks, or deemed unsafe to continue as primary.
Impact: Services that depend on PostgreSQL (for example, NetBox as Source of Truth) may be degraded or unavailable. Data integrity and recovery point objective (RPO) must be considered.
Severity: P1 – database primary outage.

This runbook aligns with:

1. Scenario overview¶

PostgreSQL is hosted primarily in an LXC container (db-01) on Proxmox, with:

Host-mounted storage to make state visible and backup-friendly.
Regular backups (for example, WAL-G, snapshot-based backups, or both).
Optionally one of:
A standby/replica LXC (db-02) on another Proxmox node, and/or
A cloud PostgreSQL replica used for DR.

This runbook covers:

Rapid triage of db-01 failure.
Promotion of a replica or restoration to an alternate instance.
Repointing workloads that depend on PostgreSQL (for example, NetBox on RKE2).
Verification and evidence capture.

Note: If the failure is part of a wider on-prem outage, coordinate with the DR cutover runbook:
Runbook – DR Cutover: On-Prem RKE2 to Cloud Cluster

Evidence for this runbook should be stored under:

output/artifacts/data/postgresql/
Application-specific proof folders (for example, output/artifacts/apps/netbox/)

2. Preconditions and safety checks¶

Before making changes, establish:

Confirm the nature of the failure
Is db-01:
- Completely unreachable (LXC not running / Proxmox node down)?
- Running but PostgreSQL service unhealthy?
- Experiencing data corruption or disk issues?
From the Proxmox node that should host db-01:
```
pct status <db01-ctid>
```
If the Proxmox node itself is down or unstable, treat this as broader infra failure and consider DR as well as this runbook.
Confirm backup and replica status
Identify last successful backup:
- Time, size, and method (WAL-G, snapshot, etc.).
Identify available replicas:
- On-prem standby LXC (db-02) or
- Cloud PostgreSQL replica.
Identify critical dependent services
At minimum:
- NetBox (running on RKE2 or Docker, depending on current state).
Confirm which workloads are allowed to operate in read-only mode vs must remain offline until promotion completes.
Declare incident and freeze risky changes
Open or update the incident ticket.
Pause non-essential schema changes and database migrations.
Communicate potential downtime / degraded mode to stakeholders.
Create evidence folder

Choose an event-specific folder under:

mkdir -p output/artifacts/data/postgresql/db01-failover-<date>/

Replace <date> with a timestamp (for example, 2025-12-02T193000Z).

3. Phase 1 – Triage db-01¶

Goal: Quickly decide if db-01 can be recovered in place or if you must switch to a replica/restore.

Check LXC container state

# From Proxmox host
pct status <db01-ctid>
pct console <db01-ctid>

If the LXC is stopped and can be safely started:
```
pct start <db01-ctid>
```
Capture any errors to:
- output/artifacts/data/postgresql/db01-failover-<date>/db01-pct-status.txt
Check PostgreSQL service state

Inside db-01 (if reachable):

systemctl status postgresql
journalctl -u postgresql --since "30 min ago"

Look for:
- Out-of-disk issues.
- Data corruption messages.
- Repeated crash loops.
Check disk and filesystem

df -h
dmesg | tail -n 50

If disk is full, consider emergency clean-up of logs or non-critical data, then attempt restart.
Decision point: in-place recovery vs failover
Attempt in-place recovery first if:
- The underlying storage is healthy.
- Issues look transient (for example, disk full, misconfiguration).
Switch to failover (promotion/restore) if:
- LXC host is down for an extended period.
- Strong signs of disk or data corruption.
- Recovery risk is high relative to RPO/RTO commitments.

Record the decision and rationale in the incident ticket and in a short text file in the proof folder.

4. Phase 2 – Promote replica or restore to new primary¶

Goal: Establish a new PostgreSQL primary that can take over db-01’s role with an acceptable RPO.

4.1 Promote an existing standby (preferred path)¶

If you have a hot/warm standby (for example, LXC db-02 or a cloud replica):

Confirm standby health

Check service status and replication lag:

systemctl status postgresql
# Example (may differ based on tooling):
psql -U postgres -c "SELECT pg_is_in_recovery(), now() - pg_last_xact_replay_timestamp();"

Promote standby to primary

Use the appropriate promotion command for your setup (for example):

# Example using pg_ctl on the standby:
pg_ctlcluster <version> main promote

Or the tool provided by your backup/replication stack.

Record the promotion
Capture the promotion command and logs to:
- output/artifacts/data/postgresql/db01-failover-<date>/promotion-logs.txt
Update DNS / connection endpoints (if applicable)
If applications refer to db-01 via DNS, you may:
- Update DNS to point to the new primary, or
- Use a connection string/endpoint that already abstracts this.

4.2 Restore from backup (fallback path)¶

If no healthy standby is available:

Provision a new PostgreSQL LXC or instance
Create db-new LXC (or equivalent) with host-mounted storage aligned to ADR-0013.
Install PostgreSQL at the expected version.
Restore from last good backup

Use your chosen restore mechanism (for example, WAL-G, base backup + WAL):

# Pseudocode; adapt to your tooling
wal-g backup-fetch /var/lib/postgresql/data LATEST

Confirm PostgreSQL starts cleanly after restore.
Record RPO impact
Determine how much time/data was lost compared to the current time.
Document RPO breach (if any) in the incident ticket.
Prepare the new instance to act as primary
Confirm pg_is_in_recovery() returns false (not in recovery).
Confirm you can connect as the NetBox and other app users.

5. Phase 3 – Repoint dependent workloads¶

Goal: Safely reconnect applications (for example, NetBox on RKE2) to the new PostgreSQL primary.

Identify affected applications

At minimum:

NetBox on RKE2 (see HOWTO – Deploy NetBox on RKE2 Using PostgreSQL LXC).
Any other workloads configured to talk to db-01.
Update connection endpoints

Depending on your architecture:

If using DNS (for example, db-01.internal.local):
- Update DNS to point at the new primary.
If using explicit hostnames in K8s secrets:
- Update the relevant Secrets or ExternalSecrets and apply via GitOps/CI.
Restart or roll pods

For K8s workloads (example for NetBox):

kubectl rollout restart deploy/netbox -n network-platform

Ensure workloads pick up new connection settings.

Verify application connectivity
Check application logs for successful DB connections.
Perform basic functional checks (for example, NetBox login, query, create/delete a test record).
Record application-level evidence
kubectl get pods outputs.
Application health endpoints.
Store under:
- output/artifacts/apps/netbox/
- output/artifacts/data/postgresql/db01-failover-<date>/apps-checks.txt

6. Phase 4 – Stabilisation and follow-up¶

Monitor new primary
Check PostgreSQL logs and metrics for:
- Errors.
- High replication lag (if you re-establish a standby).
- Resource saturation.
Decide what to do with the old db-01
If db-01 returns:
- Consider reinitialising it as a standby using fresh base backup from the new primary.
Ensure you do not inadvertently bring it back as a split-brain primary.
Re-establish backup and replication
Ensure regular backup jobs point to the new primary.
Restore standby/replica topology.
Update diagrams and documentation if topology changed
If the long-term primary host has changed, update:
- ADR-0013 links or notes (if required).
- Platform diagrams.
- Any static documentation that assumes db-01 hostname as the only primary.

7. Evidence and close-out¶

Before closing this runbook:

Ensure the following locations contain up-to-date artefacts for this event:
output/artifacts/data/postgresql/
output/artifacts/apps/netbox/
Any additional app-specific proof folders.
Update the incident ticket with:
Timeline (failure → decision → promotion/restore → repoint → stable).
RPO/RTO observations.
Root cause (if known) and contributing factors.
Follow-up tasks (for example, capacity, hardware, backup tuning).
If this was a drill:
Capture lessons learned.
Validate that steps were realistic and repeatable.
Align future DR drills to use this runbook as a baseline.

8. Validation checklist¶

[ ] Nature of db-01 failure identified and documented.
[ ] Decision (recover in place vs promote/restore) made and recorded with rationale.
[ ] New primary (standby promoted or restored instance) is healthy and accepting connections.
[ ] Dependent workloads (for example, NetBox on RKE2) have been repointed and validated.
[ ] Backups and, if applicable, standby/replica topology have been re-established.
[ ] Evidence artefacts stored under output/artifacts/data/postgresql/ and relevant application proof folders.
[ ] Incident ticket updated with findings and follow-up actions.

References¶

Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation