Skip to content

PostgreSQL Runs on Dedicated VM with Host-Managed Storage and DR Replication

Status

Accepted — PostgreSQL for HybridOps.Studio’s central shared database tier runs on a dedicated VM on an enterprise hypervisor, with host-managed storage and a warm DR replica in cloud. The earlier LXC-based pattern recorded in ADR-0013 is retained for labs only.


1. Context

HybridOps.Studio needs a durable, portable and auditable database tier for:

  • NetBox (source of truth and IPAM).
  • Platform services that require relational state.
  • Future Academy workloads (for example Moodle) that rely on PostgreSQL.

Early designs (ADR-0013) standardised on PostgreSQL in an LXC container with host-mounted storage. While functional and lightweight, this raised several concerns for the primary shared database:

  • Isolation and security expectations for a central DB are typically VM-level or higher.
  • Enterprise assessors and teams are more familiar with VM-based PostgreSQL than LXC-based.
  • Kernel, cgroup and privilege nuances inside LXC are an additional variable during DR drills.

At the same time, the platform:

  • Uses RKE2 as the primary runtime for stateless services (ADR-0202, ADR-0204).
  • Treats Kubernetes as stateless compute, with critical state hosted externally.
  • Relies on a clear DR story across on-prem hypervisors and cloud environments.

The database tier must therefore:

  • Live outside Kubernetes.
  • Be easy to back up, replicate and promote across sites.
  • Look and feel like standard enterprise PostgreSQL deployments.

2. Decision

HybridOps.Studio standardises on the following pattern:

  • PostgreSQL runs on a dedicated VM (db-01) on the enterprise hypervisor (for example Proxmox VE, VMware, or XCP-ng).
  • The VM uses host-managed storage:
  • ZFS/ceph/ext4 dataset or SAN/LUN presented as a virtio disk.
  • Clear mapping between VM disks and storage pools for backup and replication.
  • A warm DR replica runs in cloud (for example, a matching VM), fed by WAL shipping and/or logical replication.
  • Backups and DR orchestration are executed by:
  • Jenkins pipelines (ADR-0603) and/or cost-aware DR automation,
  • With evidence captured under output/artifacts/data/postgresql-vm/ and DR folders.

Scope boundaries:

  • This ADR covers the primary shared PostgreSQL tier used by NetBox and other core services.
  • It does not forbid:
  • Additional per-app PostgreSQL instances.
  • LXC-based PostgreSQL used for throwaway labs (still documented in ADR-0013).
  • Managed cloud databases (for example Azure Database for PostgreSQL) remain possible future options but are not the baseline.

3. Rationale

3.1 Why a dedicated VM?

  • Isolation and familiarity
  • Matches how many enterprises run PostgreSQL today.
  • Easier to reason about hardening, patching and incident response.
  • Portability
  • VM images and disks can be replicated to other hypervisors or clouds (OVA/OVF, qcow2, etc.).
  • Aligns with on-prem → cloud DR narratives.
  • Clear separation of compute and storage
  • Storage is owned by the hypervisor/storage layer, not by the guest.
  • Fits the “state outside Kubernetes” principle from ADR-0202 / ADR-0204.

3.2 Why not the LXC pattern for the primary tier?

  • The LXC pattern (ADR-0013) remains useful as a teaching and lab pattern, but:
  • Kernel and cgroup behaviour differs from a full VM.
  • Isolation expectations for a central shared DB are better met with a VM.
  • Enterprise readers are more comfortable with “DB VM + storage pool” than “DB LXC + host bind mounts”.

3.3 Why not PostgreSQL in Kubernetes?

  • For the central shared DB tier:
  • Embedding PostgreSQL inside RKE2 would blur the separation between stateless cluster and stateful tier.
  • DR runbooks would need to recover both the cluster and the DB simultaneously.
  • Keeping PostgreSQL on a VM:
  • Allows RKE2 clusters to be rebuilt or replaced (ADR-0202, ADR-0204).
  • Keeps DB failover and promotion logic independent of any single cluster.

4. Consequences

4.1 Positive consequences

  • Enterprise-aligned pattern
  • Dedicated DB VM with clear storage mapping is easy to understand and defend.
  • Aligns with common ITIL/ISO-style operational controls.

  • Stronger DR story

  • Replication and backup flows target a VM, which can be moved or recreated independently of RKE2.
  • Warm replica in cloud can be promoted even if on-prem RKE2 is partially or fully down.

  • Clear boundaries

  • Kubernetes is used for stateless and soft-state workloads.
  • PostgreSQL VM is explicitly the source of record for key datasets.

4.2 Negative consequences / risks

  • Slightly higher resource footprint
  • Full VM vs LXC means more overhead in a small homelab.
  • More moving parts in DR
  • Hypervisor, storage and VM replication must be orchestrated correctly.
  • Risk of configuration drift
  • Without configuration-as-code, the VM could accumulate manual changes.

Mitigations:

  • Use Packer templates and Ansible roles for VM build and provisioning.
  • Keep PostgreSQL configuration in Git as far as practical (for example postgresql.conf, pg_hba.conf templates).
  • Automate DR validation in Jenkins and/or GitHub Actions with clear evidence outputs.

5. Alternatives considered

  • ADR-0013 – PostgreSQL in LXC with host-mounted storage
  • Retained only for lab use; superseded for the central DB tier by this ADR.

  • PostgreSQL inside Kubernetes (RKE2) with Longhorn volumes

  • Rejected for the primary shared DB tier to avoid coupling cluster lifecycle to DB lifecycle.
  • Still possible for non-critical or per-app databases.

  • Managed cloud PostgreSQL as primary

  • Attractive for some organisations, but:
    • Weakens the on-prem-first, hybrid narrative of HybridOps.Studio.
    • Adds cost and provider lock-in for the baseline blueprint.
  • Kept as an option for future hybrid variants.

6. Implementation notes

Implementation appears across:

  • Packer / templates
  • infra/packer-multi-os/:
    • Rocky Linux 9 base image used for db-01.
  • Terraform
  • infra/terraform/modules/<hypervisor>/vm-postgresql/:
    • VM definition (CPU, RAM, disk, networks).
    • Attachment to appropriate storage pools and networks.
  • Ansible
  • deployment/baseline/database/:
    • playbooks/bootstrap-postgres.yml
    • vars/main.yml describing instance parameters, backup endpoints, retention.
  • DR and backup
  • Jenkins pipelines for:
    • Regular base backups and WAL archiving (for example via WAL-G).
    • DR promotion drills on cloud replica.
  • Evidence stored under:
    • output/artifacts/data/postgresql-vm/
    • output/artifacts/dr/

7. Operational impact and validation

Operational impact:

  • Platform/SRE teams treat db-01 as a tier-1 asset:
  • Regular patching and security updates.
  • Monitoring via Prometheus (postgres_exporter) and alerting via Alertmanager.
  • DR drills include:
  • On-prem outage with cloud promotion.
  • Cloud outage with on-prem remaining primary.

Validation:

  • Runbooks:
  • ../ops/runbooks/data/postgresql-vm-backup-and-dr.md
  • Evidence:
  • ../evidence/evidence-03-source-of-truth-netbox-automation.md (NetBox on PostgreSQL VM).
  • ../evidence/evidence-04-delivery-platform-gitops-cluster-operations.md (platform and apps depending on VM-based PostgreSQL during DR drills).

8. References


Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.