Skip to content

Control Node Runs as a VM (Cloud-Init); LXC Reserved for Light Helpers

Status

Accepted — The primary control plane (ctrl-01) is provisioned as a full VM on an enterprise hypervisor (Proxmox VE today), bootstrapped via Cloud-Init or equivalent image metadata, while lightweight helper functions remain on LXC (or similar) containers.


1. Context

Early experiments used LXC containers for both control and execution nodes to save resources.

For the long-term HybridOps.Studio blueprint, the control node must:

  • Run Jenkins, Terraform, Packer, Ansible, and related tooling reliably.
  • Drive RKE2 clusters running on full VMs (see ADR-0014 and ADR-0204).
  • Orchestrate DR workflows, GitOps controllers, and evidence collection.

Containerised control nodes introduced subtle issues:

  • Missing or constrained cgroup / kernel features.
  • Less predictable systemd behaviour.
  • Friction when using providers or tools that assume “full OS” semantics.

This ADR defines ctrl-01 as a VM on the chosen hypervisor (Proxmox VE today), while keeping LXCs for small helper workloads only.


2. Decision

HybridOps.Studio standardises on the following pattern:

  • The primary control node (ctrl-01) runs as a full VM on an enterprise hypervisor:
  • Proxmox VE in the homelab implementation.
  • Pattern remains portable to VMware, KVM, and cloud VMs.

  • The VM is built from a cloud-init capable image (for example Ubuntu or Rocky) using Packer (see ADR-0016) and provisioned by Terraform.

  • LXC containers (or equivalent “lightweight guests”) are reserved for non-critical helpers, such as:

  • Log processing helpers.
  • Documentation generators.
  • Lightweight demo workloads that are not part of the control plane.

ctrl-01 is treated as part of the “platform control plane” alongside RKE2 clusters and external PostgreSQL, not as a disposable lab node.


3. Rationale

Why a full VM for ctrl-01:

  • Predictable OS behaviour
  • Full systemd, cgroups, and kernel modules available.
  • Fewer surprises when running Terraform providers, Docker, or nested tooling.

  • Portability

  • VM images can be exported/imported to other hypervisors or clouds.
  • Aligns with ADR-0014 / ADR-0204 where RKE2 itself runs on VMs.

  • DR story

  • Snapshots and backups at VM level are straightforward.
  • In DR drills, ctrl-01 can be rebuilt from Packer image + automation, restoring Jenkins and orchestration tools.

Why keep LXCs at all:

  • They remain useful as lightweight helpers:
  • Cheap to spin up and tear down.
  • Good for density and “sidecar” style utilities.
  • But they are explicitly not where the control plane or shared state lives.

4. Consequences

4.1 Positive consequences

  • Clear separation of concerns
  • Control plane tools and CI orchestration live on a VM with full OS semantics.
  • LXCs are for helpers and demos, not for platform-critical services.

  • Stronger DR and evidence story

  • VM-level snapshots and exports make it easy to demonstrate rebuilds.
  • Bootstrap logs and artefacts can be captured from a single, well-defined host.

  • Alignment with other ADRs

  • Matches the pattern in ADR-0014 and ADR-0204 (RKE2 on full VMs).
  • Provides a stable base for Jenkins per ADR-0603.

4.2 Negative consequences and risks

  • Higher resource footprint
  • VMs consume more CPU/RAM than LXCs.
  • On small homelab hardware, capacity planning matters.

  • Ctrl-01 becomes a critical dependency

  • Outages on ctrl-01 impact CI orchestration and infra changes.
  • Requires monitoring, backups, and change control.

Mitigations:

  • Treat ctrl-01 as part of the core platform, with:
  • Regular backups (VM and configuration as code).
  • Runbooks for bootstrap and recovery.
  • Use LXCs only where failure is acceptable and easy to re-create.

5. Alternatives considered

  • Control node as LXC
  • Lower overhead, but:
    • Kernel / cgroup / systemd limitations caused toolchain issues.
    • Less portable as a DR artefact to other hypervisors.
  • Rejected for the primary control plane.

  • Multiple smaller control nodes instead of one ctrl-01

  • More complex to operate and explain in the homelab context.
  • Harder to maintain a single, clear DR story and evidence trail.

  • Running ctrl-01 directly on bare metal

  • Would remove hypervisor indirection, but:
    • Less representative of typical enterprise layouts.
    • Harder to snapshot, clone, and move between environments.
  • Rejected in favour of “VM on a hypervisor” pattern.

6. Implementation notes

  • Image build
  • Packer templates build a cloud-init ready base image (see ADR-0016).
  • Image is used both for homelab and for DR replicas on other hypervisors.

  • Provisioning

  • Terraform provisions the VM (CPU/RAM/disk, networks).
  • Cloud-init handles first-boot config (users, SSH keys, base packages).

  • Configuration

  • Ansible and/or Jenkins bootstrap ctrl-01 with:

    • Packer, Terraform, Ansible toolchain.
    • Docker runtime for Jenkins controller and helpers (ADR-0603).
    • Connectivity to Proxmox, RKE2 clusters, AKV, and PostgreSQL.
  • Evidence

  • Bootstrap logs and validation outputs are stored under:
    • output/artifacts/ctrl01-bootstrap/<timestamp>/.

7. Operational impact and validation

Operational impact:

  • Platform/SRE operators must:
  • Monitor ctrl-01 health and capacity.
  • Maintain its Packer template and Terraform/Ansible definitions.
  • Include ctrl-01 in DR tests and backup validation.

Validation:

  • Runbook: bootstrap-ctrl01-node.md demonstrates:
  • VM creation from the image.
  • Successful toolchain bootstrap.
  • Additional validation:
  • Jenkins operational on ctrl-01 (ADR-0603).
  • RKE2 clusters and PostgreSQL reachable from ctrl-01.

8. References


Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.