Deploy RKE2 Cluster (HyOps Blueprint)¶
Purpose: Deploy a complete on-prem RKE2 cluster through a single governed blueprint run.
Owner: Platform engineering
Trigger: New environment bring-up, rebuild, or repeatable E2E validation
Impact: Consumes shared SDN/NetBox authority and converges image, VM infrastructure, and RKE2 cluster software
Severity: P2
Pre-reqs: Proxmox init complete for target env, vault decrypt works, NetBox authority is ready (authoritative IPAM), SSH key access exists, and RKE2_TOKEN is available in vault.
Rollback strategy: Destroy modules in reverse order or run controlled rebuild from the same blueprint inputs.
Context¶
Blueprint ref: onprem/rke2@v1
Location: hybridops-core/blueprints/onprem/rke2@v1/blueprint.yml
Default step flow:
core/onprem/template-image(Rocky 9)platform/onprem/platform-vm(IPAM from NetBox + shared SDN state)platform/onprem/rke2-cluster
Preconditions and safety checks¶
Path behavior:
- Installed
hyops(viainstall.sh) can be run from any working directory. -
Source-checkout usage should export
HYOPS_CORE_ROOT=/path/to/hybridops-core. -
Ensure NetBox authority is ready (required for IPAM)
This blueprint allocates VM IPs from NetBox (no hardcoded per-VM IPs) and consumes the shared SDN authority state.
Network placement defaults:
- RKE2 control-plane VMs are dual-homed:
- primary NIC on vnetmgmt (operator/API reachability)
- secondary NIC on vnetenv (env workload network)
- RKE2 worker VMs use vnetenv
The vnetenv bridge alias is resolved from --env:
- dev -> vnetdev (VLAN 20)
- staging -> vnetstag (VLAN 30)
- prod -> vnetprod (VLAN 40)
This keeps the generated kubeconfig endpoint on the management network while keeping workload traffic on the environment VLAN.
By default, HyOps expects NetBox authority in --env shared. If it is not ready yet, run:
hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute
This also seeds the shared SDN foundation and NetBox IPAM/inventory datasets consumed by this blueprint.
- Ensure RKE2 token exists in runtime vault:
hyops secrets ensure --env dev RKE2_TOKEN
- Validate and plan blueprint:
hyops blueprint validate --ref onprem/rke2@v1
hyops blueprint plan --ref onprem/rke2@v1
- Run preflight (contracts + module resolvability):
hyops blueprint preflight --env dev --ref onprem/rke2@v1
Steps¶
- Execute full blueprint
hyops blueprint deploy --env dev \
--ref onprem/rke2@v1 \
--execute
If HyOps detects existing step state (rerun/replacement risk), it may prompt
for confirmation before executing the blueprint. Use --yes for
non-interactive runs.
-
Observe progress during long phases
-
HyOps prints preflight summary before step execution.
- Each module step prints
progress: logs=...with the active log file path. - Long-running streamed operations emit periodic heartbeat lines.
- Long-running streamed operations also print a one-time log-watch hint (
tail -f ...).
Optional tuning:
export HYOPS_PROGRESS_INTERVAL_S=30
Optional live terminal streaming:
hyops --verbose blueprint deploy --env dev \
--ref onprem/rke2@v1 \
--execute
- Verify cluster state
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide
Verification¶
Success indicators:
- Blueprint summary ends with
status=ok. platform/onprem/rke2-clusterstate showsstatus: ok.outputs.cap.k8s.rke2isready.- All expected nodes report
Ready.
Evidence paths:
- Blueprint step evidence and module logs under:
$HOME/.hybridops/envs/<env>/logs/module/...- Cluster kubeconfig:
$HOME/.hybridops/envs/<env>/state/kubeconfigs/rke2.yaml- By default, the kubeconfig server endpoint points to the first control-plane node management IP (reachable from the operator management path).
Troubleshooting¶
- If deploy stops before execution with
preflight_status=failed, fix required failures and rerun. - Use
--skip-preflightonly for break-glass runs where preflight failure is intentionally overridden. - For SSH timeout/connectivity errors, ensure management network routing or configure bastion/proxy jump.
- Recent driver failures include an
open: <evidence>/<driver>.loghint in the error summary. Start with that file first. - If
kubectlhangs but the RKE2 module reportsstatus=ok, check the kubeconfigserver:endpoint and confirm your workstation can reach the management subnet (vnetmgmt, typically10.10.0.0/24). - For NetBox API reachability errors in IPAM mode:
- Ensure NetBox is
status=okin the NetBox authority env (default:shared). - If your workstation is not routed to the management subnet, HyOps may auto-tunnel NetBox API via the Proxmox host (requires
hyops init proxmox --bootstrapand working SSH access). - For slow first-time convergence after destroy/rebuild:
- Expect longer runtime while nodes pull RKE2 images and CNI settles.
- Check the step evidence
ansible.logforStart rke2-server/Start rke2-agentprogression. - If this remains consistently slow, increase VM sizing (recommended baseline: control-plane
4 vCPU / 8 GiB, worker2 vCPU / 4 GiB) and verify registry egress/DNS performance.