Skip to content

Deploy RKE2 Cluster (HyOps Blueprint)

  • Purpose: Deploy a complete on-prem RKE2 cluster through a single governed blueprint run. Owner: Platform engineering

  • Trigger: New environment bring-up, rebuild, or repeatable E2E validation

  • Impact: Consumes shared SDN/NetBox authority and converges image, VM infrastructure, and RKE2 cluster software
  • Severity: P2 Pre-reqs: Proxmox init complete for target env, vault decrypt works, NetBox authority is ready (authoritative IPAM), SSH key access exists, and RKE2_TOKEN is available in vault.

  • Rollback strategy: Destroy modules in reverse order or run controlled rebuild from the same blueprint inputs.

Context

Blueprint ref: onprem/rke2@v1 Location: hybridops-core/blueprints/onprem/rke2@v1/blueprint.yml

Default step flow:

  1. core/onprem/template-image (Rocky 9)
  2. platform/onprem/platform-vm (IPAM from NetBox + shared SDN state)
  3. platform/onprem/rke2-cluster

Preconditions and safety checks

Path behavior:

  • Installed hyops (via install.sh) can be run from any working directory.
  • Source-checkout usage should export HYOPS_CORE_ROOT=/path/to/hybridops-core.

  • Ensure NetBox authority is ready (required for IPAM)

This blueprint allocates VM IPs from NetBox (no hardcoded per-VM IPs) and consumes the shared SDN authority state.

Network placement defaults: - RKE2 control-plane VMs are dual-homed: - primary NIC on vnetmgmt (operator/API reachability) - secondary NIC on vnetenv (env workload network) - RKE2 worker VMs use vnetenv

The vnetenv bridge alias is resolved from --env: - dev -> vnetdev (VLAN 20) - staging -> vnetstag (VLAN 30) - prod -> vnetprod (VLAN 40)

This keeps the generated kubeconfig endpoint on the management network while keeping workload traffic on the environment VLAN.

By default, HyOps expects NetBox authority in --env shared. If it is not ready yet, run:

hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute

This also seeds the shared SDN foundation and NetBox IPAM/inventory datasets consumed by this blueprint.

  1. Ensure RKE2 token exists in runtime vault:

    hyops secrets ensure --env dev RKE2_TOKEN
    
  2. Validate and plan blueprint:

    hyops blueprint validate --ref onprem/rke2@v1
    hyops blueprint plan --ref onprem/rke2@v1
    
  3. Run preflight (contracts + module resolvability):

    hyops blueprint preflight --env dev --ref onprem/rke2@v1
    

Recovery note:

  • template_image_rocky9 and rke2_vms are skip-safe infrastructure steps.
  • If Proxmox guests still exist but the platform/onprem/platform-vm#rke2_vms state slot was lost or marked destroyed, repair the slot with hyops import first, then rerun the blueprint.
  • Do not force a raw platform-vm reconcile immediately after importing clone-backed Proxmox VMs; imported provider state does not preserve all create-time clone metadata and may plan replacement. The blueprint is designed to skip the recovered VM step once the slot is healthy again.

Steps

  1. Execute full blueprint
    hyops blueprint deploy --env dev \
      --ref onprem/rke2@v1 \
      --execute
    

If HyOps detects existing step state (rerun/replacement risk), it may prompt for confirmation before executing the blueprint. Use --yes for non-interactive runs.

  1. Observe progress during long phases

  2. HyOps prints preflight summary before step execution.

  3. Each module step prints progress: logs=... with the active log file path.
  4. Long-running streamed operations emit periodic heartbeat lines.
  5. Long-running streamed operations also print a one-time log-watch hint (tail -f ...).

Optional tuning:

export HYOPS_PROGRESS_INTERVAL_S=30

Optional live terminal streaming:

hyops --verbose blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute
  1. Verify cluster state
    cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
    KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide
    

If your workstation is not routed to the cluster management subnet, tunnel the API through the Proxmox jump host first:

ssh -f -N -L 16443:10.50.0.61:6443 \
  -J root@<proxmox-host> \
  opsadmin@<control-plane-ip>

python3 - <<'PY'
from pathlib import Path
src = Path.home() / ".hybridops/envs/dev/state/kubeconfigs/rke2.yaml"
dst = Path("/tmp/rke2-via-bastion.yaml")
text = src.read_text()
lines = []
for line in text.splitlines():
    if line.lstrip().startswith("server: https://"):
        indent = line[: len(line) - len(line.lstrip())]
        lines.append(f"{indent}server: https://127.0.0.1:16443")
    else:
        lines.append(line)
dst.write_text("\n".join(lines) + "\n")
print(dst)
PY

KUBECONFIG=/tmp/rke2-via-bastion.yaml kubectl get nodes -o wide

If you need to verify directly on the control-plane host instead of tunneling the local kubeconfig, use the bundled RKE2 client path:

ssh -J root@<proxmox-host> opsadmin@<control-plane-ip> \
  'sudo /var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml get nodes -o wide'

Verification

Success indicators:

  • Blueprint summary ends with status=ok.
  • platform/onprem/rke2-cluster state shows status: ok.
  • outputs.cap.k8s.rke2 is ready.
  • All expected nodes report Ready.

Run record paths:

  • Blueprint step records and module logs under:
  • $HOME/.hybridops/envs/<env>/logs/module/...
  • Cluster kubeconfig:
  • $HOME/.hybridops/envs/<env>/state/kubeconfigs/rke2.yaml
  • By default, the kubeconfig server endpoint points to the first control-plane node management IP (reachable from the operator management path).

Troubleshooting

  • If deploy stops before execution with preflight_status=failed, fix required failures and rerun.
  • Use --skip-preflight only for break-glass runs where preflight failure is intentionally overridden.
  • For SSH timeout/connectivity errors, ensure management network routing or configure bastion/proxy jump.
  • Recent driver failures include an open: <run-record-path>/<driver>.log hint in the error summary. Start with that file first.
  • If kubectl hangs but the RKE2 module reports status=ok, check the kubeconfig server: endpoint and confirm your workstation can reach the management subnet (vnetmgmt, typically 10.10.0.0/24).
  • For NetBox API reachability errors in IPAM mode:
  • Ensure NetBox is status=ok in the NetBox authority env (default: shared).
  • If your workstation is not routed to the management subnet, HyOps may auto-tunnel NetBox API via the Proxmox host (requires hyops init proxmox --bootstrap and working SSH access).
  • For slow first-time convergence after destroy/rebuild:
  • Expect longer runtime while nodes pull RKE2 images and CNI settles.
  • Check the step run record ansible.log for Start rke2-server / Start rke2-agent progression.
  • If this remains consistently slow, increase VM sizing (recommended baseline: control-plane 4 vCPU / 8 GiB, worker 2 vCPU / 4 GiB) and verify registry egress/DNS performance.
  • For recovered Proxmox VMs after state drift:
  • Re-import the missing platform/onprem/platform-vm#rke2_vms resources into the HyOps slot with hyops import.
  • Confirm the slot is back to status=ok.
  • Rerun the blueprint so rke2_vms is skipped safely and platform/onprem/rke2-cluster reconciles the software layer on the recovered nodes.

References