Operate RKE2 Cluster Module (HyOps)¶

Purpose: Install/validate/destroy RKE2 on provisioned VMs through a single module lifecycle. Owner: Platform engineering
Trigger: Cluster build, rebuild, or cleanup for an environment
Impact: Changes Kubernetes control-plane/worker runtime on target VMs
Severity: P2 Pre-reqs: platform/onprem/platform-vm state is ok; vault decrypt works; SSH reachability exists (direct or bastion).
Rollback strategy: Run module destroy, then re-apply using known-good inputs.

Context¶

Module ref: platform/onprem/rke2-cluster

This module consumes VM inventory from module state and uses the Ansible driver to converge RKE2 roles.

Preconditions and safety checks¶

Path behavior:

Installed hyops (via install.sh) can be run from any working directory.

If you want to use the shipped example overlays, set:

export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"

For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead.

Ensure required secret key exists:

hyops secrets ensure --env dev RKE2_TOKEN

Ensure VM inventory module is ready:

cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__platform-vm/latest.json"

Expect "status": "ok".

Validate module inputs before apply:

hyops preflight --env dev --strict \
  --module platform/onprem/rke2-cluster \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"

Steps¶

Apply (typical)

hyops apply --env dev \
  --module platform/onprem/rke2-cluster \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"

Verify module state and kubeconfig path

cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"

Expect:

"status": "ok"
outputs.cap.k8s.rke2 = "ready"
outputs.kubeconfig_path exists
Verify cluster nodes

If your workstation has direct L3 reachability to the cluster management subnet:

KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide

If the cluster only sits behind the Proxmox management path, use a temporary SSH tunnel first and point kubectl at the tunneled endpoint:

ssh -f -N -L 16443:10.50.0.61:6443 \
  -J root@<proxmox-host> \
  opsadmin@<control-plane-ip>

python3 - <<'PY'
from pathlib import Path
src = Path.home() / ".hybridops/envs/dev/state/kubeconfigs/rke2.yaml"
dst = Path("/tmp/rke2-via-bastion.yaml")
text = src.read_text()
lines = []
for line in text.splitlines():
    if line.lstrip().startswith("server: https://"):
        indent = line[: len(line) - len(line.lstrip())]
        lines.append(f"{indent}server: https://127.0.0.1:16443")
    else:
        lines.append(line)
dst.write_text("\n".join(lines) + "\n")
print(dst)
PY

KUBECONFIG=/tmp/rke2-via-bastion.yaml kubectl get nodes -o wide

Destroy (cleanup)

hyops destroy --env dev \
  --module platform/onprem/rke2-cluster \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"

Verification¶

Success indicators:

Apply exits 0 and writes module state to:
$HOME/.hybridops/envs/<env>/state/modules/platform__onprem__rke2-cluster/latest.json
Run-record path is printed during run under:
$HOME/.hybridops/envs/<env>/logs/module/platform__onprem__rke2-cluster/<run_id>/
Driver log file exists:
ansible.log

Troubleshooting¶

Connectivity failures (cannot reach ...:22):
Ensure workstation has L3 reachability or configure bastion settings (ssh_proxy_jump_*).
kubectl hangs but the module reports status=ok:
The exported kubeconfig points at the first control-plane node management IP.
Ensure your workstation can reach that subnet directly, or use a bastion/temporary SSH tunnel.
Inventory state not ready (status=destroyed or missing):
Re-apply platform/onprem/platform-vm first.
Long-running phases:
Follow printed progress: logs=... path.
Optional: export HYOPS_PROGRESS_INTERVAL_S=30
First converge is usually slower after destroy/rebuild because each node pulls RKE2 runtime/control-plane images and waits for CNI readiness.
This is normal when nodes have cold image caches.
If convergence is consistently too slow, increase VM sizing (recommended baseline: control-plane 4 vCPU / 8 GiB, worker 2 vCPU / 4 GiB) and verify registry egress/DNS latency.