Skip to content

RKE2 Runs on Rocky VMs on Enterprise Hypervisors

Status

Accepted — all RKE2 nodes now run as full Rocky Linux 9.x VMs on an enterprise hypervisor (for example Proxmox VE, VMware vSphere, or similar) for production and DR consistency.


1. Context

ADR-0202 – Adopt RKE2 as Primary Runtime for Platform and Applications establishes RKE2 as the standard Kubernetes runtime for HybridOps.Studio.

Early prototypes deployed RKE2 on:

  • LXC containers on Proxmox VE.
  • Mixed Ubuntu/Rocky VMs on different hypervisors.
  • Ad-hoc lab nodes without a consistent OS baseline.

While lightweight, these approaches introduced issues:

  • Kernel and cgroup limits inside LXCs impacting Kubernetes components.
  • Inconsistent SELinux/AppArmor behaviour affecting Longhorn and CNI add-ons.
  • Snapshot/export incompatibilities across different enterprise hypervisors during DR tests.
  • Difficulty presenting a clean, repeatable “enterprise-style” story to assessors and Academy students.

To achieve a realistic enterprise baseline that can be reproduced on any enterprise hypervisor, HybridOps.Studio needs to standardise both:

  • The guest OS used for RKE2 nodes.
  • The expectation that RKE2 runs on full VMs, not LXCs, for production and DR clusters.

2. Decision

HybridOps.Studio standardises on:

  • Rocky Linux 9.x as the base OS for all RKE2 control-plane and worker nodes.
  • Full virtual machines on an enterprise hypervisor (for example Proxmox VE, VMware vSphere, or similar), not LXC containers, for all production and DR RKE2 clusters.

Additional details:

  • VMs are built via Packer templates and provisioned through Terraform and Ansible, in line with ADR-0202.
  • This ADR refines ADR-0202 by specifying the concrete OS and virtualisation pattern for RKE2 nodes.
  • ADR-0014 is superseded and replaced by this decision.

3. Rationale

3.1 Why Rocky Linux 9.x?

Rocky Linux 9.x provides:

  • Enterprise familiarity — RHEL-compatible stack that is recognisable to enterprise teams.
  • Long-term stability — predictable lifecycle and security update cadence.
  • No subscription cost — avoids licensing friction in homelab and small-team scenarios.
  • Tooling compatibility — works cleanly with Terraform, Packer, Ansible and RKE2 installers.

3.2 Why full VMs on enterprise hypervisors?

Running RKE2 on full VMs instead of LXCs:

  • Avoids container-in-container and cgroup edge cases that appear when running Kubernetes inside LXC.
  • Provides clean, OS-level isolation between RKE2 nodes.
  • Aligns with how many organisations deploy Kubernetes on VMware, Hyper-V or bare metal with a virtualisation layer.

Using a pattern that applies equally to Proxmox VE and VMware vSphere:

  • Makes DR scenarios in Evidence 4 easier to reason about (snapshot, replicate, restore VMs).
  • Allows the same ADR to apply whether HybridOps.Studio is running on a homelab hypervisor or a corporate platform.

3.3 Decision drivers

  • Portability — VM images are exportable as OVA/OVF or QCOW2 for use on other hypervisors or cloud.
  • Predictability — stable SELinux, systemd and kernel interfaces across all RKE2 nodes.
  • Governance — matches security controls and expectations in ITIL/ISO-aligned organisations.

4. Consequences

4.1 Positive consequences

  • Predictable enterprise-grade behaviour
  • All RKE2 nodes share the same OS baseline (Rocky 9.x) and virtualisation pattern.
  • Easier to document, support and teach.

  • Improved resilience and DR options

  • VM-based nodes snapshot and replicate cleanly across hypervisors.
  • DR stories can focus on RKE2 cluster recreation and data restoration rather than debugging LXC kernel quirks.

  • Portability to different environments

  • The same VM images and automation patterns can be used on Proxmox VE, VMware vSphere, or other enterprise hypervisors with minimal change.

4.2 Negative consequences / risks

  • Heavier resource footprint
  • Full VMs consume more RAM and disk than LXCs.
  • On small homelabs, cluster size must be tuned carefully.

  • Longer build times

  • Packer builds for Rocky 9.x base images and RKE2-specific images are slower than LXC template creation.

  • Hypervisor dependency

  • If the underlying hypervisor is misconfigured or unavailable, all RKE2 nodes are affected.

Mitigations:

  • Use resource-efficient VM sizing for non-production clusters.
  • Invest in a small but clear set of Packer templates to keep build times manageable.
  • Treat the hypervisor as part of the core control plane and include it in DR and monitoring stories.

5. Alternatives considered

RKE2 on LXC containers

  • Rejected due to kernel and cgroup boundary issues.
  • Adds complexity when troubleshooting storage or networking problems.
  • Harder to explain as an “enterprise pattern” to students or assessors.

Mixed OS base (Ubuntu + Rocky)

  • Prototype showed that mixed OS baselines increased debugging surface area.
  • Longhorn and CNI behaviour differed between distributions.
  • Single OS baseline makes upgrade and validation paths much simpler.

Vendor-specific appliance images

  • Some distributions or vendors ship appliance-style images.
  • These were rejected to keep the stack transparent and teachable:
  • HybridOps.Studio prefers “plain” OS + automation, rather than black-box appliances.

6. Implementation notes

Core stack:

  • OS: Rocky Linux 9.x.
  • Hypervisors: Enterprise hypervisors such as Proxmox VE or VMware vSphere.
  • Load Balancer: MetalLB (L2 mode) or hypervisor-agnostic LB in front of the control-plane.
  • CNI: Cilium (default), Canal as a fallback option.
  • Storage: Longhorn (default StorageClass).
  • Provisioning: Packer + Terraform + Ansible, coordinated by Jenkins per ADR-0603.

Typical sizing (subject to homelab constraints):

  • Control plane: 1–3 VMs (2 vCPU / 4–8 GB RAM / 40–60 GB disk).
  • Workers: N VMs (2 vCPU / 4–8 GB RAM / ≥60 GB disk).

Automation layout (illustrative):

  • Packer templates under infra/packer-multi-os/rocky/ for RKE2-ready base images.
  • Terraform modules under infra/terraform/modules/rke2/ for VM allocation and networking.
  • Ansible roles under deployment/containerization/kubernetes/rke2/playbooks/ for RKE2 installation and add-ons.

7. Operational impact and validation

Operational impact:

  • Platform and SRE teams standardise on Rocky 9.x for RKE2 nodes.
  • Hypervisor and RKE2 upgrades must be co-ordinated and tested.
  • Node troubleshooting assumes a consistent Rocky 9.x baseline, simplifying runbooks.

Validation:

  • Runbooks:
  • ../ops/runbooks/kubernetes/rke2-vm-deploy.md — standing up RKE2 on Rocky VMs.
  • RKE2 upgrade and failure-handling runbooks (to be added).
  • Evidence:
  • ../proof/kubernetes/rke2-vm/ — screenshots, logs and CLI transcripts of RKE2 on Rocky VMs.
  • Metrics and dashboards:
  • Prometheus + Grafana panels showing node health, control-plane status and Longhorn volume status.

Successful DR and upgrade drills that touch RKE2 clusters on Rocky VMs will validate this ADR in practice.


8. References


Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.