PostgreSQL Runs on Dedicated VM with Host-Managed Storage and DR Replication¶
Status¶
Accepted — PostgreSQL for HybridOps.Studio’s central shared database tier runs on a dedicated VM on an enterprise hypervisor, with host-managed storage and a warm DR replica in cloud. The earlier LXC-based pattern recorded in ADR-0013 is retained for labs only.
1. Context¶
HybridOps.Studio needs a durable, portable and auditable database tier for:
- NetBox (source of truth and IPAM).
- Platform services that require relational state.
- Future Academy workloads (for example Moodle) that rely on PostgreSQL.
Early designs (ADR-0013) standardised on PostgreSQL in an LXC container with host-mounted storage. While functional and lightweight, this raised several concerns for the primary shared database:
- Isolation and security expectations for a central DB are typically VM-level or higher.
- Enterprise assessors and teams are more familiar with VM-based PostgreSQL than LXC-based.
- Kernel, cgroup and privilege nuances inside LXC are an additional variable during DR drills.
At the same time, the platform:
- Uses RKE2 as the primary runtime for stateless services (ADR-0202, ADR-0204).
- Treats Kubernetes as stateless compute, with critical state hosted externally.
- Relies on a clear DR story across on-prem hypervisors and cloud environments.
The database tier must therefore:
- Live outside Kubernetes.
- Be easy to back up, replicate and promote across sites.
- Look and feel like standard enterprise PostgreSQL deployments.
2. Decision¶
HybridOps.Studio standardises on the following pattern:
- PostgreSQL runs on a dedicated VM (
db-01) on the enterprise hypervisor (for example Proxmox VE, VMware, or XCP-ng). - The VM uses host-managed storage:
- ZFS/ceph/ext4 dataset or SAN/LUN presented as a virtio disk.
- Clear mapping between VM disks and storage pools for backup and replication.
- A warm DR replica runs in cloud (for example, a matching VM), fed by WAL shipping and/or logical replication.
- Backups and DR orchestration are executed by:
- Jenkins pipelines (ADR-0603) and/or cost-aware DR automation,
- With evidence captured under
output/artifacts/data/postgresql-vm/and DR folders.
Scope boundaries:
- This ADR covers the primary shared PostgreSQL tier used by NetBox and other core services.
- It does not forbid:
- Additional per-app PostgreSQL instances.
- LXC-based PostgreSQL used for throwaway labs (still documented in ADR-0013).
- Managed cloud databases (for example Azure Database for PostgreSQL) remain possible future options but are not the baseline.
3. Rationale¶
3.1 Why a dedicated VM?¶
- Isolation and familiarity
- Matches how many enterprises run PostgreSQL today.
- Easier to reason about hardening, patching and incident response.
- Portability
- VM images and disks can be replicated to other hypervisors or clouds (OVA/OVF, qcow2, etc.).
- Aligns with on-prem → cloud DR narratives.
- Clear separation of compute and storage
- Storage is owned by the hypervisor/storage layer, not by the guest.
- Fits the “state outside Kubernetes” principle from ADR-0202 / ADR-0204.
3.2 Why not the LXC pattern for the primary tier?¶
- The LXC pattern (ADR-0013) remains useful as a teaching and lab pattern, but:
- Kernel and cgroup behaviour differs from a full VM.
- Isolation expectations for a central shared DB are better met with a VM.
- Enterprise readers are more comfortable with “DB VM + storage pool” than “DB LXC + host bind mounts”.
3.3 Why not PostgreSQL in Kubernetes?¶
- For the central shared DB tier:
- Embedding PostgreSQL inside RKE2 would blur the separation between stateless cluster and stateful tier.
- DR runbooks would need to recover both the cluster and the DB simultaneously.
- Keeping PostgreSQL on a VM:
- Allows RKE2 clusters to be rebuilt or replaced (ADR-0202, ADR-0204).
- Keeps DB failover and promotion logic independent of any single cluster.
4. Consequences¶
4.1 Positive consequences¶
- Enterprise-aligned pattern
- Dedicated DB VM with clear storage mapping is easy to understand and defend.
-
Aligns with common ITIL/ISO-style operational controls.
-
Stronger DR story
- Replication and backup flows target a VM, which can be moved or recreated independently of RKE2.
-
Warm replica in cloud can be promoted even if on-prem RKE2 is partially or fully down.
-
Clear boundaries
- Kubernetes is used for stateless and soft-state workloads.
- PostgreSQL VM is explicitly the source of record for key datasets.
4.2 Negative consequences / risks¶
- Slightly higher resource footprint
- Full VM vs LXC means more overhead in a small homelab.
- More moving parts in DR
- Hypervisor, storage and VM replication must be orchestrated correctly.
- Risk of configuration drift
- Without configuration-as-code, the VM could accumulate manual changes.
Mitigations:
- Use Packer templates and Ansible roles for VM build and provisioning.
- Keep PostgreSQL configuration in Git as far as practical (for example
postgresql.conf,pg_hba.conftemplates). - Automate DR validation in Jenkins and/or GitHub Actions with clear evidence outputs.
5. Alternatives considered¶
- ADR-0013 – PostgreSQL in LXC with host-mounted storage
-
Retained only for lab use; superseded for the central DB tier by this ADR.
-
PostgreSQL inside Kubernetes (RKE2) with Longhorn volumes
- Rejected for the primary shared DB tier to avoid coupling cluster lifecycle to DB lifecycle.
-
Still possible for non-critical or per-app databases.
-
Managed cloud PostgreSQL as primary
- Attractive for some organisations, but:
- Weakens the on-prem-first, hybrid narrative of HybridOps.Studio.
- Adds cost and provider lock-in for the baseline blueprint.
- Kept as an option for future hybrid variants.
6. Implementation notes¶
Implementation appears across:
- Packer / templates
infra/packer-multi-os/:- Rocky Linux 9 base image used for
db-01.
- Rocky Linux 9 base image used for
- Terraform
infra/terraform/modules/<hypervisor>/vm-postgresql/:- VM definition (CPU, RAM, disk, networks).
- Attachment to appropriate storage pools and networks.
- Ansible
deployment/baseline/database/:playbooks/bootstrap-postgres.ymlvars/main.ymldescribing instance parameters, backup endpoints, retention.
- DR and backup
- Jenkins pipelines for:
- Regular base backups and WAL archiving (for example via WAL-G).
- DR promotion drills on cloud replica.
- Evidence stored under:
output/artifacts/data/postgresql-vm/output/artifacts/dr/
7. Operational impact and validation¶
Operational impact:
- Platform/SRE teams treat
db-01as a tier-1 asset: - Regular patching and security updates.
- Monitoring via Prometheus (
postgres_exporter) and alerting via Alertmanager. - DR drills include:
- On-prem outage with cloud promotion.
- Cloud outage with on-prem remaining primary.
Validation:
- Runbooks:
../ops/runbooks/data/postgresql-vm-backup-and-dr.md- Evidence:
../evidence/evidence-03-source-of-truth-netbox-automation.md(NetBox on PostgreSQL VM).../evidence/evidence-04-delivery-platform-gitops-cluster-operations.md(platform and apps depending on VM-based PostgreSQL during DR drills).
8. References¶
- ADR-0001 – ADR Process and Conventions
- ADR-0013 – PostgreSQL Runs in LXC (State on Host-Mounted Storage; Backups First-Class)
- ADR-0202 – Adopt RKE2 as Primary Runtime for Platform and Applications
- ADR-0204 – RKE2 Runs on Rocky VMs on Enterprise Hypervisors
- ADR-0603 – Run Jenkins Controller on Control Node, Agents on RKE2
Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.