Control Node Runs as a VM (Cloud-Init); LXC Reserved for Light Helpers¶
Status¶
Accepted — The primary control plane (ctrl-01) is provisioned as a full VM on an enterprise hypervisor (Proxmox VE today), bootstrapped via Cloud-Init or equivalent image metadata, while lightweight helper functions remain on LXC (or similar) containers.
1. Context¶
Early experiments used LXC containers for both control and execution nodes to save resources.
For the long-term HybridOps.Studio blueprint, the control node must:
- Run Jenkins, Terraform, Packer, Ansible, and related tooling reliably.
- Drive RKE2 clusters running on full VMs (see ADR-0014 and ADR-0204).
- Orchestrate DR workflows, GitOps controllers, and evidence collection.
Containerised control nodes introduced subtle issues:
- Missing or constrained cgroup / kernel features.
- Less predictable systemd behaviour.
- Friction when using providers or tools that assume “full OS” semantics.
This ADR defines ctrl-01 as a VM on the chosen hypervisor (Proxmox VE today), while keeping LXCs for small helper workloads only.
2. Decision¶
HybridOps.Studio standardises on the following pattern:
- The primary control node (
ctrl-01) runs as a full VM on an enterprise hypervisor: - Proxmox VE in the homelab implementation.
-
Pattern remains portable to VMware, KVM, and cloud VMs.
-
The VM is built from a cloud-init capable image (for example Ubuntu or Rocky) using Packer (see ADR-0016) and provisioned by Terraform.
-
LXC containers (or equivalent “lightweight guests”) are reserved for non-critical helpers, such as:
- Log processing helpers.
- Documentation generators.
- Lightweight demo workloads that are not part of the control plane.
ctrl-01 is treated as part of the “platform control plane” alongside RKE2 clusters and external PostgreSQL, not as a disposable lab node.
3. Rationale¶
Why a full VM for ctrl-01:
- Predictable OS behaviour
- Full systemd, cgroups, and kernel modules available.
-
Fewer surprises when running Terraform providers, Docker, or nested tooling.
-
Portability
- VM images can be exported/imported to other hypervisors or clouds.
-
Aligns with ADR-0014 / ADR-0204 where RKE2 itself runs on VMs.
-
DR story
- Snapshots and backups at VM level are straightforward.
- In DR drills,
ctrl-01can be rebuilt from Packer image + automation, restoring Jenkins and orchestration tools.
Why keep LXCs at all:
- They remain useful as lightweight helpers:
- Cheap to spin up and tear down.
- Good for density and “sidecar” style utilities.
- But they are explicitly not where the control plane or shared state lives.
4. Consequences¶
4.1 Positive consequences¶
- Clear separation of concerns
- Control plane tools and CI orchestration live on a VM with full OS semantics.
-
LXCs are for helpers and demos, not for platform-critical services.
-
Stronger DR and evidence story
- VM-level snapshots and exports make it easy to demonstrate rebuilds.
-
Bootstrap logs and artefacts can be captured from a single, well-defined host.
-
Alignment with other ADRs
- Matches the pattern in ADR-0014 and ADR-0204 (RKE2 on full VMs).
- Provides a stable base for Jenkins per ADR-0603.
4.2 Negative consequences and risks¶
- Higher resource footprint
- VMs consume more CPU/RAM than LXCs.
-
On small homelab hardware, capacity planning matters.
-
Ctrl-01 becomes a critical dependency
- Outages on
ctrl-01impact CI orchestration and infra changes. - Requires monitoring, backups, and change control.
Mitigations:
- Treat
ctrl-01as part of the core platform, with: - Regular backups (VM and configuration as code).
- Runbooks for bootstrap and recovery.
- Use LXCs only where failure is acceptable and easy to re-create.
5. Alternatives considered¶
- Control node as LXC
- Lower overhead, but:
- Kernel / cgroup / systemd limitations caused toolchain issues.
- Less portable as a DR artefact to other hypervisors.
-
Rejected for the primary control plane.
-
Multiple smaller control nodes instead of one
ctrl-01 - More complex to operate and explain in the homelab context.
-
Harder to maintain a single, clear DR story and evidence trail.
-
Running
ctrl-01directly on bare metal - Would remove hypervisor indirection, but:
- Less representative of typical enterprise layouts.
- Harder to snapshot, clone, and move between environments.
- Rejected in favour of “VM on a hypervisor” pattern.
6. Implementation notes¶
- Image build
- Packer templates build a cloud-init ready base image (see ADR-0016).
-
Image is used both for homelab and for DR replicas on other hypervisors.
-
Provisioning
- Terraform provisions the VM (CPU/RAM/disk, networks).
-
Cloud-init handles first-boot config (users, SSH keys, base packages).
-
Configuration
-
Ansible and/or Jenkins bootstrap
ctrl-01with:- Packer, Terraform, Ansible toolchain.
- Docker runtime for Jenkins controller and helpers (ADR-0603).
- Connectivity to Proxmox, RKE2 clusters, AKV, and PostgreSQL.
-
Evidence
- Bootstrap logs and validation outputs are stored under:
output/artifacts/ctrl01-bootstrap/<timestamp>/.
7. Operational impact and validation¶
Operational impact:
- Platform/SRE operators must:
- Monitor
ctrl-01health and capacity. - Maintain its Packer template and Terraform/Ansible definitions.
- Include
ctrl-01in DR tests and backup validation.
Validation:
- Runbook:
bootstrap-ctrl01-node.mddemonstrates: - VM creation from the image.
- Successful toolchain bootstrap.
- Additional validation:
- Jenkins operational on
ctrl-01(ADR-0603). - RKE2 clusters and PostgreSQL reachable from
ctrl-01.
8. References¶
- ADR-0001 – ADR Process & Conventions
- ADR-0014 – RKE2 Runs on Full VMs
- ADR-0202 – Adopt RKE2 as Primary Runtime for Platform & Applications
- ADR-0204 – RKE2 Runs on Rocky VMs on Enterprise Hypervisors
- ADR-0603 – Run Jenkins Controller on Control Node, Agents on RKE2
- ADR-0608 – Docker Engine baseline
- Runbook – ctrl-01 Bootstrap / Verification
- Diagram – Control Plane Architecture
Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.