Deploy RKE2 Cluster (HyOps Blueprint)¶

Purpose: Deploy a complete on-prem RKE2 cluster through a single governed blueprint run. Owner: Platform engineering
Trigger: New environment bring-up, rebuild, or repeatable E2E validation
Impact: Consumes shared SDN/NetBox authority and converges image, VM infrastructure, and RKE2 cluster software
Severity: P2 Pre-reqs: Proxmox init complete for target env, vault decrypt works, NetBox authority is ready (authoritative IPAM), SSH key access exists, and RKE2_TOKEN is available in vault.
Rollback strategy: Destroy modules in reverse order or run controlled rebuild from the same blueprint inputs.

Context¶

Blueprint ref: onprem/rke2@v1 Location: hybridops-core/blueprints/onprem/rke2@v1/blueprint.yml

Default step flow:

core/onprem/template-image (Rocky 9)
platform/onprem/platform-vm (IPAM from NetBox + shared SDN state)
platform/onprem/rke2-cluster

Preconditions and safety checks¶

Path behavior:

Installed hyops (via install.sh) can be run from any working directory.
Source-checkout usage should export HYOPS_CORE_ROOT=/path/to/hybridops-core.
Ensure NetBox authority is ready (required for IPAM)

This blueprint allocates VM IPs from NetBox (no hardcoded per-VM IPs) and consumes the shared SDN authority state.

Network placement defaults: - RKE2 control-plane VMs are dual-homed: - primary NIC on vnetmgmt (operator/API reachability) - secondary NIC on vnetenv (env workload network) - RKE2 worker VMs use vnetenv

The vnetenv bridge alias is resolved from --env: - dev -> vnetdev (VLAN 20) - staging -> vnetstag (VLAN 30) - prod -> vnetprod (VLAN 40)

This keeps the generated kubeconfig endpoint on the management network while keeping workload traffic on the environment VLAN.

By default, HyOps expects NetBox authority in --env shared. If it is not ready yet, run:

hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute

This also seeds the shared SDN foundation and NetBox IPAM/inventory datasets consumed by this blueprint.

Ensure RKE2 token exists in runtime vault:

hyops secrets ensure --env dev RKE2_TOKEN

Validate and plan blueprint:

hyops blueprint validate --ref onprem/rke2@v1
hyops blueprint plan --ref onprem/rke2@v1

Run preflight (contracts + module resolvability):

hyops blueprint preflight --env dev --ref onprem/rke2@v1

Recovery note:

template_image_rocky9 and rke2_vms are skip-safe infrastructure steps.
If Proxmox guests still exist but the platform/onprem/platform-vm#rke2_vms state slot was lost or marked destroyed, repair the slot with hyops import first, then rerun the blueprint.
Do not force a raw platform-vm reconcile immediately after importing clone-backed Proxmox VMs; imported provider state does not preserve all create-time clone metadata and may plan replacement. The blueprint is designed to skip the recovered VM step once the slot is healthy again.

Steps¶

Execute full blueprint

hyops blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute

If HyOps detects existing step state (rerun/replacement risk), it may prompt for confirmation before executing the blueprint. Use --yes for non-interactive runs.

Observe progress during long phases
HyOps prints preflight summary before step execution.
Each module step prints progress: logs=... with the active log file path.
Long-running streamed operations emit periodic heartbeat lines.
Long-running streamed operations also print a one-time log-watch hint (tail -f ...).

Optional tuning:

export HYOPS_PROGRESS_INTERVAL_S=30

Optional live terminal streaming:

hyops --verbose blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute

Verify cluster state

cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide

If your workstation is not routed to the cluster management subnet, tunnel the API through the Proxmox jump host first:

ssh -f -N -L 16443:10.50.0.61:6443 \
  -J root@<proxmox-host> \
  opsadmin@<control-plane-ip>

python3 - <<'PY'
from pathlib import Path
src = Path.home() / ".hybridops/envs/dev/state/kubeconfigs/rke2.yaml"
dst = Path("/tmp/rke2-via-bastion.yaml")
text = src.read_text()
lines = []
for line in text.splitlines():
    if line.lstrip().startswith("server: https://"):
        indent = line[: len(line) - len(line.lstrip())]
        lines.append(f"{indent}server: https://127.0.0.1:16443")
    else:
        lines.append(line)
dst.write_text("\n".join(lines) + "\n")
print(dst)
PY

KUBECONFIG=/tmp/rke2-via-bastion.yaml kubectl get nodes -o wide

If you need to verify directly on the control-plane host instead of tunneling the local kubeconfig, use the bundled RKE2 client path:

ssh -J root@<proxmox-host> opsadmin@<control-plane-ip> \
  'sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes -o wide'

Verification¶

Success indicators:

Blueprint summary ends with status=ok.
platform/onprem/rke2-cluster state shows status: ok.
outputs.cap.k8s.rke2 is ready.
All expected nodes report Ready.

Run record paths:

Blueprint step records and module logs under:
$HOME/.hybridops/envs/<env>/logs/module/...
Cluster kubeconfig:
$HOME/.hybridops/envs/<env>/state/kubeconfigs/rke2.yaml
By default, the kubeconfig server endpoint points to the first control-plane node management IP (reachable from the operator management path).

Troubleshooting¶

If deploy stops before execution with preflight_status=failed, fix required failures and rerun.
Use --skip-preflight only for break-glass runs where preflight failure is intentionally overridden.
For SSH timeout/connectivity errors, ensure management network routing or configure bastion/proxy jump.
Recent driver failures include an open: <run-record-path>/<driver>.log hint in the error summary. Start with that file first.
If kubectl hangs but the RKE2 module reports status=ok, check the kubeconfig server: endpoint and confirm your workstation can reach the management subnet (vnetmgmt, typically 10.10.0.0/24).
For NetBox API reachability errors in IPAM mode:
Ensure NetBox is status=ok in the NetBox authority env (default: shared).
If your workstation is not routed to the management subnet, HyOps may auto-tunnel NetBox API via the Proxmox host (requires hyops init proxmox --bootstrap and working SSH access).
For slow first-time convergence after destroy/rebuild:
Expect longer runtime while nodes pull RKE2 images and CNI settles.
Check the step run record ansible.log for Start rke2-server / Start rke2-agent progression.
If this remains consistently slow, increase VM sizing (recommended baseline: control-plane 4 vCPU / 8 GiB, worker 2 vCPU / 4 GiB) and verify registry egress/DNS performance.
For recovered Proxmox VMs after state drift:
Re-import the missing platform/onprem/platform-vm#rke2_vms resources into the HyOps slot with hyops import.
Confirm the slot is back to status=ok.
Rerun the blueprint so rke2_vms is skipped safely and platform/onprem/rke2-cluster reconciles the software layer on the recovered nodes.