Skip to content

Operate RKE2 Cluster Module (HyOps)

  • Purpose: Install/validate/destroy RKE2 on provisioned VMs through a single module lifecycle. Owner: Platform engineering

  • Trigger: Cluster build, rebuild, or cleanup for an environment

  • Impact: Changes Kubernetes control-plane/worker runtime on target VMs
  • Severity: P2 Pre-reqs: platform/onprem/platform-vm state is ok; vault decrypt works; SSH reachability exists (direct or bastion).

  • Rollback strategy: Run module destroy, then re-apply using known-good inputs.

Context

Module ref: platform/onprem/rke2-cluster

This module consumes VM inventory from module state and uses the Ansible driver to converge RKE2 roles.

Preconditions and safety checks

Path behavior:

  • Installed hyops (via install.sh) can be run from any working directory.
  • If you want to use the shipped example overlays, set:
    export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"
    

For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead.

  1. Ensure required secret key exists:

    hyops secrets ensure --env dev RKE2_TOKEN
    
  2. Ensure VM inventory module is ready:

    cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__platform-vm/latest.json"
    

Expect "status": "ok".

  1. Validate module inputs before apply:
    hyops preflight --env dev --strict \
      --module platform/onprem/rke2-cluster \
      --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"
    

Steps

  1. Apply (typical)

    hyops apply --env dev \
      --module platform/onprem/rke2-cluster \
      --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"
    
  2. Verify module state and kubeconfig path

    cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
    

Expect:

  • "status": "ok"
  • outputs.cap.k8s.rke2 = "ready"
  • outputs.kubeconfig_path exists

  • Verify cluster nodes

If your workstation has direct L3 reachability to the cluster management subnet:

KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide

If the cluster only sits behind the Proxmox management path, use a temporary SSH tunnel first and point kubectl at the tunneled endpoint:

ssh -f -N -L 16443:10.50.0.61:6443 \
  -J root@<proxmox-host> \
  opsadmin@<control-plane-ip>

python3 - <<'PY'
from pathlib import Path
src = Path.home() / ".hybridops/envs/dev/state/kubeconfigs/rke2.yaml"
dst = Path("/tmp/rke2-via-bastion.yaml")
text = src.read_text()
lines = []
for line in text.splitlines():
    if line.lstrip().startswith("server: https://"):
        indent = line[: len(line) - len(line.lstrip())]
        lines.append(f"{indent}server: https://127.0.0.1:16443")
    else:
        lines.append(line)
dst.write_text("\n".join(lines) + "\n")
print(dst)
PY

KUBECONFIG=/tmp/rke2-via-bastion.yaml kubectl get nodes -o wide
  1. Destroy (cleanup)
    hyops destroy --env dev \
      --module platform/onprem/rke2-cluster \
      --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"
    

Verification

Success indicators:

  • Apply exits 0 and writes module state to:
  • $HOME/.hybridops/envs/<env>/state/modules/platform__onprem__rke2-cluster/latest.json
  • Run-record path is printed during run under:
  • $HOME/.hybridops/envs/<env>/logs/module/platform__onprem__rke2-cluster/<run_id>/
  • Driver log file exists:
  • ansible.log

Troubleshooting

  • Connectivity failures (cannot reach ...:22):
  • Ensure workstation has L3 reachability or configure bastion settings (ssh_proxy_jump_*).
  • kubectl hangs but the module reports status=ok:
  • The exported kubeconfig points at the first control-plane node management IP.
  • Ensure your workstation can reach that subnet directly, or use a bastion/temporary SSH tunnel.
  • Inventory state not ready (status=destroyed or missing):
  • Re-apply platform/onprem/platform-vm first.
  • Long-running phases:
  • Follow printed progress: logs=... path.
  • Optional: export HYOPS_PROGRESS_INTERVAL_S=30
  • First converge is usually slower after destroy/rebuild because each node pulls RKE2 runtime/control-plane images and waits for CNI readiness.
  • This is normal when nodes have cold image caches.
  • If convergence is consistently too slow, increase VM sizing (recommended baseline: control-plane 4 vCPU / 8 GiB, worker 2 vCPU / 4 GiB) and verify registry egress/DNS latency.

References