Skip to content

Deploy RKE2 Cluster (HyOps Blueprint)

Purpose: Deploy a complete on-prem RKE2 cluster through a single governed blueprint run.
Owner: Platform engineering
Trigger: New environment bring-up, rebuild, or repeatable E2E validation
Impact: Consumes shared SDN/NetBox authority and converges image, VM infrastructure, and RKE2 cluster software
Severity: P2
Pre-reqs: Proxmox init complete for target env, vault decrypt works, NetBox authority is ready (authoritative IPAM), SSH key access exists, and RKE2_TOKEN is available in vault.
Rollback strategy: Destroy modules in reverse order or run controlled rebuild from the same blueprint inputs.

Context

Blueprint ref: onprem/rke2@v1
Location: hybridops-core/blueprints/onprem/rke2@v1/blueprint.yml

Default step flow:

  1. core/onprem/template-image (Rocky 9)
  2. platform/onprem/platform-vm (IPAM from NetBox + shared SDN state)
  3. platform/onprem/rke2-cluster

Preconditions and safety checks

Path behavior:

  • Installed hyops (via install.sh) can be run from any working directory.
  • Source-checkout usage should export HYOPS_CORE_ROOT=/path/to/hybridops-core.

  • Ensure NetBox authority is ready (required for IPAM)

This blueprint allocates VM IPs from NetBox (no hardcoded per-VM IPs) and consumes the shared SDN authority state.

Network placement defaults: - RKE2 control-plane VMs are dual-homed: - primary NIC on vnetmgmt (operator/API reachability) - secondary NIC on vnetenv (env workload network) - RKE2 worker VMs use vnetenv

The vnetenv bridge alias is resolved from --env: - dev -> vnetdev (VLAN 20) - staging -> vnetstag (VLAN 30) - prod -> vnetprod (VLAN 40)

This keeps the generated kubeconfig endpoint on the management network while keeping workload traffic on the environment VLAN.

By default, HyOps expects NetBox authority in --env shared. If it is not ready yet, run:

hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute

This also seeds the shared SDN foundation and NetBox IPAM/inventory datasets consumed by this blueprint.

  1. Ensure RKE2 token exists in runtime vault:
hyops secrets ensure --env dev RKE2_TOKEN
  1. Validate and plan blueprint:
hyops blueprint validate --ref onprem/rke2@v1
hyops blueprint plan --ref onprem/rke2@v1
  1. Run preflight (contracts + module resolvability):
hyops blueprint preflight --env dev --ref onprem/rke2@v1

Steps

  1. Execute full blueprint
hyops blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute

If HyOps detects existing step state (rerun/replacement risk), it may prompt for confirmation before executing the blueprint. Use --yes for non-interactive runs.

  1. Observe progress during long phases

  2. HyOps prints preflight summary before step execution.

  3. Each module step prints progress: logs=... with the active log file path.
  4. Long-running streamed operations emit periodic heartbeat lines.
  5. Long-running streamed operations also print a one-time log-watch hint (tail -f ...).

Optional tuning:

export HYOPS_PROGRESS_INTERVAL_S=30

Optional live terminal streaming:

hyops --verbose blueprint deploy --env dev \
  --ref onprem/rke2@v1 \
  --execute
  1. Verify cluster state
cat "$HOME/.hybridops/envs/dev/state/modules/platform__onprem__rke2-cluster/latest.json"
KUBECONFIG="$HOME/.hybridops/envs/dev/state/kubeconfigs/rke2.yaml" kubectl get nodes -o wide

Verification

Success indicators:

  • Blueprint summary ends with status=ok.
  • platform/onprem/rke2-cluster state shows status: ok.
  • outputs.cap.k8s.rke2 is ready.
  • All expected nodes report Ready.

Evidence paths:

  • Blueprint step evidence and module logs under:
  • $HOME/.hybridops/envs/<env>/logs/module/...
  • Cluster kubeconfig:
  • $HOME/.hybridops/envs/<env>/state/kubeconfigs/rke2.yaml
  • By default, the kubeconfig server endpoint points to the first control-plane node management IP (reachable from the operator management path).

Troubleshooting

  • If deploy stops before execution with preflight_status=failed, fix required failures and rerun.
  • Use --skip-preflight only for break-glass runs where preflight failure is intentionally overridden.
  • For SSH timeout/connectivity errors, ensure management network routing or configure bastion/proxy jump.
  • Recent driver failures include an open: <evidence>/<driver>.log hint in the error summary. Start with that file first.
  • If kubectl hangs but the RKE2 module reports status=ok, check the kubeconfig server: endpoint and confirm your workstation can reach the management subnet (vnetmgmt, typically 10.10.0.0/24).
  • For NetBox API reachability errors in IPAM mode:
  • Ensure NetBox is status=ok in the NetBox authority env (default: shared).
  • If your workstation is not routed to the management subnet, HyOps may auto-tunnel NetBox API via the Proxmox host (requires hyops init proxmox --bootstrap and working SSH access).
  • For slow first-time convergence after destroy/rebuild:
  • Expect longer runtime while nodes pull RKE2 images and CNI settles.
  • Check the step evidence ansible.log for Start rke2-server / Start rke2-agent progression.
  • If this remains consistently slow, increase VM sizing (recommended baseline: control-plane 4 vCPU / 8 GiB, worker 2 vCPU / 4 GiB) and verify registry egress/DNS performance.

References