Deploy RKE2 Cluster (HyOps Blueprint)¶
-
Purpose: Deploy a complete on-prem RKE2 cluster through a single governed blueprint run. Owner: Platform engineering
-
Trigger: New environment bring-up, rebuild, or repeatable E2E validation
- Impact: Consumes shared SDN/NetBox authority and converges image, VM infrastructure, and RKE2 cluster software
-
Severity: P2 Pre-reqs: Proxmox init complete for target env, vault decrypt works, NetBox authority is ready (authoritative IPAM), SSH key access exists, and
RKE2_TOKENis available in vault. -
Rollback strategy: Destroy modules in reverse order or run controlled rebuild from the same blueprint inputs.
Context¶
Blueprint ref: onprem/rke2@v1
Location: hybridops-core/blueprints/onprem/rke2@v1/blueprint.yml
Default step flow:
core/onprem/template-image(Rocky 9)platform/onprem/platform-vm(IPAM from NetBox + shared SDN state)platform/onprem/rke2-cluster
Preconditions and safety checks¶
Path behavior:
- Installed
hyops(viainstall.sh) can be run from any working directory. -
Source-checkout usage should export
HYOPS_CORE_ROOT=/path/to/hybridops-core. -
Ensure NetBox authority is ready (required for IPAM)
This blueprint allocates VM IPs from NetBox (no hardcoded per-VM IPs) and consumes the shared SDN authority state.
Network placement defaults:
- RKE2 control-plane VMs are dual-homed:
- primary NIC on vnetmgmt (operator/API reachability)
- secondary NIC on vnetenv (env workload network)
- RKE2 worker VMs use vnetenv
The vnetenv bridge alias is resolved from --env:
- dev -> vnetdev (VLAN 20)
- staging -> vnetstag (VLAN 30)
- prod -> vnetprod (VLAN 40)
This keeps the generated kubeconfig endpoint on the management network while keeping workload traffic on the environment VLAN.
By default, HyOps expects NetBox authority in --env shared. If it is not ready yet, run:
hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute
This also seeds the shared SDN foundation and NetBox IPAM/inventory datasets consumed by this blueprint.
-
Ensure RKE2 token exists in runtime vault:
hyops secrets ensure --env dev RKE2_TOKEN -
Validate and plan blueprint:
hyops blueprint validate --ref onprem/rke2@v1 hyops blueprint plan --ref onprem/rke2@v1 -
Run preflight (contracts + module resolvability):
hyops blueprint preflight --env dev --ref onprem/rke2@v1
Recovery note:
template_image_rocky9andrke2_vmsare skip-safe infrastructure steps.- If Proxmox guests still exist but the
platform/onprem/platform-vm#rke2_vmsstate slot was lost or markeddestroyed, repair the slot withhyops importfirst, then rerun the blueprint. - Do not force a raw
platform-vmreconcile immediately after importing clone-backed Proxmox VMs; imported provider state does not preserve all create-time clone metadata and may plan replacement. The blueprint is designed to skip the recovered VM step once the slot is healthy again.
Steps¶
- Execute full blueprint
hyops blueprint deploy --env dev \ --ref onprem/rke2@v1 \ --execute
If HyOps detects existing step state (rerun/replacement risk), it may prompt
for confirmation before executing the blueprint. Use --yes for
non-interactive runs.
-
Observe progress during long phases
-
HyOps prints preflight summary before step execution.
- Each module step prints
progress: logs=...with the active log file path. - Long-running streamed operations emit periodic heartbeat lines.
- Long-running streamed operations also print a one-time log-watch hint (
tail -f ...).
Optional tuning:
export HYOPS_PROGRESS_INTERVAL_S=30
Optional live terminal streaming:
hyops --verbose blueprint deploy --env dev \
--ref onprem/rke2@v1 \
--execute
- Verify cluster state
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json" KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide
If your workstation is not routed to the cluster management subnet, tunnel the API through the Proxmox jump host first:
ssh -f -N -L 16443:10.50.0.61:6443 \
-J root@<proxmox-host> \
opsadmin@<control-plane-ip>
python3 - <<'PY'
from pathlib import Path
src = Path.home() / ".hybridops/envs/dev/state/kubeconfigs/rke2.yaml"
dst = Path("/tmp/rke2-via-bastion.yaml")
text = src.read_text()
lines = []
for line in text.splitlines():
if line.lstrip().startswith("server: https://"):
indent = line[: len(line) - len(line.lstrip())]
lines.append(f"{indent}server: https://127.0.0.1:16443")
else:
lines.append(line)
dst.write_text("\n".join(lines) + "\n")
print(dst)
PY
KUBECONFIG=/tmp/rke2-via-bastion.yaml kubectl get nodes -o wide
If you need to verify directly on the control-plane host instead of tunneling the local kubeconfig, use the bundled RKE2 client path:
ssh -J root@<proxmox-host> opsadmin@<control-plane-ip> \
'sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes -o wide'
Verification¶
Success indicators:
- Blueprint summary ends with
status=ok. platform/onprem/rke2-clusterstate showsstatus: ok.outputs.cap.k8s.rke2isready.- All expected nodes report
Ready.
Run record paths:
- Blueprint step records and module logs under:
$HOME/.hybridops/envs/<env>/logs/module/...- Cluster kubeconfig:
$HOME/.hybridops/envs/<env>/state/kubeconfigs/rke2.yaml- By default, the kubeconfig server endpoint points to the first control-plane node management IP (reachable from the operator management path).
Troubleshooting¶
- If deploy stops before execution with
preflight_status=failed, fix required failures and rerun. - Use
--skip-preflightonly for break-glass runs where preflight failure is intentionally overridden. - For SSH timeout/connectivity errors, ensure management network routing or configure bastion/proxy jump.
- Recent driver failures include an
open: <run-record-path>/<driver>.loghint in the error summary. Start with that file first. - If
kubectlhangs but the RKE2 module reportsstatus=ok, check the kubeconfigserver:endpoint and confirm your workstation can reach the management subnet (vnetmgmt, typically10.10.0.0/24). - For NetBox API reachability errors in IPAM mode:
- Ensure NetBox is
status=okin the NetBox authority env (default:shared). - If your workstation is not routed to the management subnet, HyOps may auto-tunnel NetBox API via the Proxmox host (requires
hyops init proxmox --bootstrapand working SSH access). - For slow first-time convergence after destroy/rebuild:
- Expect longer runtime while nodes pull RKE2 images and CNI settles.
- Check the step run record
ansible.logforStart rke2-server/Start rke2-agentprogression. - If this remains consistently slow, increase VM sizing (recommended baseline: control-plane
4 vCPU / 8 GiB, worker2 vCPU / 4 GiB) and verify registry egress/DNS performance. - For recovered Proxmox VMs after state drift:
- Re-import the missing
platform/onprem/platform-vm#rke2_vmsresources into the HyOps slot withhyops import. - Confirm the slot is back to
status=ok. - Rerun the blueprint so
rke2_vmsis skipped safely andplatform/onprem/rke2-clusterreconciles the software layer on the recovered nodes.