PostgreSQL Runs on Dedicated VM with Host-Managed Storage and DR Replication¶

Status¶

Accepted — PostgreSQL for HybridOps.Studio’s central shared database tier runs on a dedicated VM on an enterprise hypervisor, with host-managed storage and a warm DR replica in cloud. The earlier LXC-based pattern recorded in ADR-0013 is retained for labs only.

1. Context¶

HybridOps.Studio needs a durable, portable and auditable database tier for:

NetBox (source of truth and IPAM).
Platform services that require relational state.
Future Academy workloads (for example Moodle) that rely on PostgreSQL.

Early designs (ADR-0013) standardised on PostgreSQL in an LXC container with host-mounted storage. While functional and lightweight, this raised several concerns for the primary shared database:

Isolation and security expectations for a central DB are typically VM-level or higher.
Enterprise assessors and teams are more familiar with VM-based PostgreSQL than LXC-based.
Kernel, cgroup and privilege nuances inside LXC are an additional variable during DR drills.

At the same time, the platform:

Uses RKE2 as the primary runtime for stateless services (ADR-0202, ADR-0204).
Treats Kubernetes as stateless compute, with critical state hosted externally.
Relies on a clear DR story across on-prem hypervisors and cloud environments.

The database tier must therefore:

Live outside Kubernetes.
Be easy to back up, replicate and promote across sites.
Look and feel like standard enterprise PostgreSQL deployments.

2. Decision¶

HybridOps.Studio standardises on the following pattern:

PostgreSQL runs on a dedicated VM (db-01) on the enterprise hypervisor (for example Proxmox VE, VMware, or XCP-ng).
The VM uses host-managed storage:
ZFS/ceph/ext4 dataset or SAN/LUN presented as a virtio disk.
Clear mapping between VM disks and storage pools for backup and replication.
A warm DR replica runs in cloud (for example, a matching VM), fed by WAL shipping and/or logical replication.
Backups and DR orchestration are executed by:
Jenkins pipelines (ADR-0603) and/or cost-aware DR automation,
With evidence captured under output/artifacts/data/postgresql-vm/ and DR folders.

Scope boundaries:

This ADR covers the primary shared PostgreSQL tier used by NetBox and other core services.
It does not forbid:
Additional per-app PostgreSQL instances.
LXC-based PostgreSQL used for throwaway labs (still documented in ADR-0013).
Managed cloud databases (for example Azure Database for PostgreSQL) remain possible future options but are not the baseline.

3. Rationale¶

3.1 Why a dedicated VM?¶

Isolation and familiarity
Matches how many enterprises run PostgreSQL today.
Easier to reason about hardening, patching and incident response.
Portability
VM images and disks can be replicated to other hypervisors or clouds (OVA/OVF, qcow2, etc.).
Aligns with on-prem → cloud DR narratives.
Clear separation of compute and storage
Storage is owned by the hypervisor/storage layer, not by the guest.
Fits the “state outside Kubernetes” principle from ADR-0202 / ADR-0204.

3.2 Why not the LXC pattern for the primary tier?¶

The LXC pattern (ADR-0013) remains useful as a teaching and lab pattern, but:
Kernel and cgroup behaviour differs from a full VM.
Isolation expectations for a central shared DB are better met with a VM.
Enterprise readers are more comfortable with “DB VM + storage pool” than “DB LXC + host bind mounts”.

3.3 Why not PostgreSQL in Kubernetes?¶

For the central shared DB tier:
Embedding PostgreSQL inside RKE2 would blur the separation between stateless cluster and stateful tier.
DR runbooks would need to recover both the cluster and the DB simultaneously.
Keeping PostgreSQL on a VM:
Allows RKE2 clusters to be rebuilt or replaced (ADR-0202, ADR-0204).
Keeps DB failover and promotion logic independent of any single cluster.

4. Consequences¶

4.1 Positive consequences¶

Enterprise-aligned pattern
Dedicated DB VM with clear storage mapping is easy to understand and defend.
Aligns with common ITIL/ISO-style operational controls.
Stronger DR story
Replication and backup flows target a VM, which can be moved or recreated independently of RKE2.
Warm replica in cloud can be promoted even if on-prem RKE2 is partially or fully down.
Clear boundaries
Kubernetes is used for stateless and soft-state workloads.
PostgreSQL VM is explicitly the source of record for key datasets.

4.2 Negative consequences / risks¶

Slightly higher resource footprint
Full VM vs LXC means more overhead in a small homelab.
More moving parts in DR
Hypervisor, storage and VM replication must be orchestrated correctly.
Risk of configuration drift
Without configuration-as-code, the VM could accumulate manual changes.

Mitigations:

Use Packer templates and Ansible roles for VM build and provisioning.
Keep PostgreSQL configuration in Git as far as practical (for example postgresql.conf, pg_hba.conf templates).
Automate DR validation in Jenkins and/or GitHub Actions with clear evidence outputs.

5. Alternatives considered¶

ADR-0013 – PostgreSQL in LXC with host-mounted storage
Retained only for lab use; superseded for the central DB tier by this ADR.
PostgreSQL inside Kubernetes (RKE2) with Longhorn volumes
Rejected for the primary shared DB tier to avoid coupling cluster lifecycle to DB lifecycle.
Still possible for non-critical or per-app databases.
Managed cloud PostgreSQL as primary
Attractive for some organisations, but:
- Weakens the on-prem-first, hybrid narrative of HybridOps.Studio.
- Adds cost and provider lock-in for the baseline blueprint.
Kept as an option for future hybrid variants.

6. Implementation notes¶

Implementation appears across:

Packer / templates
infra/packer-multi-os/:
- Rocky Linux 9 base image used for db-01.
Terraform
infra/terraform/modules/<hypervisor>/vm-postgresql/:
- VM definition (CPU, RAM, disk, networks).
- Attachment to appropriate storage pools and networks.
Ansible
deployment/baseline/database/:
- playbooks/bootstrap-postgres.yml
- vars/main.yml describing instance parameters, backup endpoints, retention.
DR and backup
Jenkins pipelines for:
- Regular base backups and WAL archiving (for example via WAL-G).
- DR promotion drills on cloud replica.
Evidence stored under:
- output/artifacts/data/postgresql-vm/
- output/artifacts/dr/

7. Operational impact and validation¶

Operational impact:

Platform/SRE teams treat db-01 as a tier-1 asset:
Regular patching and security updates.
Monitoring via Prometheus (postgres_exporter) and alerting via Alertmanager.
DR drills include:
On-prem outage with cloud promotion.
Cloud outage with on-prem remaining primary.

Validation:

Runbooks:
../ops/runbooks/data/postgresql-vm-backup-and-dr.md
Evidence:
../evidence/evidence-03-source-of-truth-netbox-automation.md (NetBox on PostgreSQL VM).
../evidence/evidence-04-delivery-platform-gitops-cluster-operations.md (platform and apps depending on VM-based PostgreSQL during DR drills).

8. References¶

Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.