Operate RKE2 Cluster Module (HyOps)¶
Purpose: Install/validate/destroy RKE2 on provisioned VMs through a single module lifecycle.
Owner: Platform engineering
Trigger: Cluster build, rebuild, or cleanup for an environment
Impact: Changes Kubernetes control-plane/worker runtime on target VMs
Severity: P2
Pre-reqs: platform/onprem/platform-vm state is ok; vault decrypt works; SSH reachability exists (direct or bastion).
Rollback strategy: Run module destroy, then re-apply using known-good inputs.
Context¶
Module ref: platform/onprem/rke2-cluster
This module consumes VM inventory from module state and uses the Ansible driver to converge RKE2 roles.
Preconditions and safety checks¶
Path behavior:
- Installed
hyops(viainstall.sh) can be run from any working directory. - If you want to use the shipped example overlays, set:
export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"
For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead.
- Ensure required secret key exists:
hyops secrets ensure --env dev RKE2_TOKEN
- Ensure VM inventory module is ready:
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__platform-vm/latest.json"
Expect "status": "ok".
- Validate module inputs before apply:
hyops preflight --env dev --strict \
--module platform/onprem/rke2-cluster \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"
Steps¶
- Apply (typical)
hyops apply --env dev \
--module platform/onprem/rke2-cluster \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"
- Verify module state and kubeconfig path
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
Expect:
"status": "ok"outputs.cap.k8s.rke2 = "ready"-
outputs.kubeconfig_pathexists -
Verify cluster nodes
KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide
- Destroy (cleanup)
hyops destroy --env dev \
--module platform/onprem/rke2-cluster \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/rke2-cluster/examples/inputs.typical.yml"
Verification¶
Success indicators:
- Apply exits
0and writes module state to: $HOME/.hybridops/envs/<env>/state/modules/platform__onprem__rke2-cluster/latest.json- Evidence path is printed during run under:
$HOME/.hybridops/envs/<env>/logs/module/platform__onprem__rke2-cluster/<run_id>/- Driver log file exists:
ansible.log
Troubleshooting¶
- Connectivity failures (
cannot reach ...:22): - Ensure workstation has L3 reachability or configure bastion settings (
ssh_proxy_jump_*). - Inventory state not ready (
status=destroyedor missing): - Re-apply
platform/onprem/platform-vmfirst. - Long-running phases:
- Follow printed
progress: logs=...path. - Optional:
export HYOPS_PROGRESS_INTERVAL_S=30 - First converge is usually slower after destroy/rebuild because each node pulls RKE2 runtime/control-plane images and waits for CNI readiness.
- This is normal when nodes have cold image caches.
- If convergence is consistently too slow, increase VM sizing (recommended baseline: control-plane
4 vCPU / 8 GiB, worker2 vCPU / 4 GiB) and verify registry egress/DNS latency.