Deploy RKE2 Cluster (HyOps Blueprint)¶

Purpose: Deploy a complete on-prem RKE2 cluster through a single governed blueprint run.
Owner: Platform engineering
Trigger: New environment bring-up, rebuild, or repeatable E2E validation
Impact: Consumes shared SDN/NetBox authority and converges image, VM infrastructure, and RKE2 cluster software
Severity: P2
Pre-reqs: Proxmox init complete for target env, vault decrypt works, NetBox authority is ready (authoritative IPAM), SSH key access exists, and RKE2_TOKEN is available in vault.
Rollback strategy: Destroy modules in reverse order or run controlled rebuild from the same blueprint inputs.

Context¶

Blueprint ref: onprem/rke2@v1
Location: hybridops-core/blueprints/onprem/rke2@v1/blueprint.yml

Default step flow:

core/onprem/template-image (Rocky 9)
platform/onprem/platform-vm (IPAM from NetBox + shared SDN state)
platform/onprem/rke2-cluster

Preconditions and safety checks¶

Path behavior:

Installed hyops (via install.sh) can be run from any working directory.
Source-checkout usage should export HYOPS_CORE_ROOT=/path/to/hybridops-core.
Ensure NetBox authority is ready (required for IPAM)

This blueprint allocates VM IPs from NetBox (no hardcoded per-VM IPs) and consumes the shared SDN authority state.

Network placement defaults: - RKE2 control-plane VMs are dual-homed: - primary NIC on vnetmgmt (operator/API reachability) - secondary NIC on vnetenv (env workload network) - RKE2 worker VMs use vnetenv

The vnetenv bridge alias is resolved from --env: - dev -> vnetdev (VLAN 20) - staging -> vnetstag (VLAN 30) - prod -> vnetprod (VLAN 40)

This keeps the generated kubeconfig endpoint on the management network while keeping workload traffic on the environment VLAN.

By default, HyOps expects NetBox authority in --env shared. If it is not ready yet, run:

hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute

This also seeds the shared SDN foundation and NetBox IPAM/inventory datasets consumed by this blueprint.

Ensure RKE2 token exists in runtime vault:

hyops secrets ensure --env dev RKE2_TOKEN

Validate and plan blueprint:

hyops blueprint validate --ref onprem/rke2@v1
hyops blueprint plan --ref onprem/rke2@v1

Run preflight (contracts + module resolvability):

hyops blueprint preflight --env dev --ref onprem/rke2@v1

Steps¶

Execute full blueprint

hyops blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute

If HyOps detects existing step state (rerun/replacement risk), it may prompt for confirmation before executing the blueprint. Use --yes for non-interactive runs.

Observe progress during long phases
HyOps prints preflight summary before step execution.
Each module step prints progress: logs=... with the active log file path.
Long-running streamed operations emit periodic heartbeat lines.
Long-running streamed operations also print a one-time log-watch hint (tail -f ...).

Optional tuning:

export HYOPS_PROGRESS_INTERVAL_S=30

Optional live terminal streaming:

hyops --verbose blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute

Verify cluster state

cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide

Verification¶

Success indicators:

Blueprint summary ends with status=ok.
platform/onprem/rke2-cluster state shows status: ok.
outputs.cap.k8s.rke2 is ready.
All expected nodes report Ready.

Evidence paths:

Blueprint step evidence and module logs under:
$HOME/.hybridops/envs/<env>/logs/module/...
Cluster kubeconfig:
$HOME/.hybridops/envs/<env>/state/kubeconfigs/rke2.yaml
By default, the kubeconfig server endpoint points to the first control-plane node management IP (reachable from the operator management path).

Troubleshooting¶

If deploy stops before execution with preflight_status=failed, fix required failures and rerun.
Use --skip-preflight only for break-glass runs where preflight failure is intentionally overridden.
For SSH timeout/connectivity errors, ensure management network routing or configure bastion/proxy jump.
Recent driver failures include an open: <evidence>/<driver>.log hint in the error summary. Start with that file first.
If kubectl hangs but the RKE2 module reports status=ok, check the kubeconfig server: endpoint and confirm your workstation can reach the management subnet (vnetmgmt, typically 10.10.0.0/24).
For NetBox API reachability errors in IPAM mode:
Ensure NetBox is status=ok in the NetBox authority env (default: shared).
If your workstation is not routed to the management subnet, HyOps may auto-tunnel NetBox API via the Proxmox host (requires hyops init proxmox --bootstrap and working SSH access).
For slow first-time convergence after destroy/rebuild:
Expect longer runtime while nodes pull RKE2 images and CNI settles.
Check the step evidence ansible.log for Start rke2-server / Start rke2-agent progression.
If this remains consistently slow, increase VM sizing (recommended baseline: control-plane 4 vCPU / 8 GiB, worker 2 vCPU / 4 GiB) and verify registry egress/DNS performance.