Operate Generic Platform VMs (HyOps)¶
Purpose: Run the generic VM lifecycle for platform/onprem/platform-vm in a controlled environment.
Owner: Platform engineering
Trigger: New VM pool request, baseline refresh, environment rebuild, or drift correction
Impact: Creates, replaces, or removes platform VMs on the target Proxmox environment
Severity: P2
Pre-reqs: Proxmox init completed for the environment, vault decrypt working, template-image state available.
Rollback strategy: Run hyops destroy with the same module and input overlay.
Context¶
This runbook covers module-level operations for:
- Module:
platform/onprem/platform-vm - Driver:
iac/terragrunt - Template source:
core/onprem/template-imageviatemplate_state_ref
The expected operating model is state-ref first:
template_state_ref: "core/onprem/template-image"template_key: "<template-key>"
Direct template_vm_id input is override-only and not required for normal use.
Naming model (important on shared Proxmox metal):
- Logical VM names remain the same as your
inputs.vmskeys (for examplepgha-01,rke2-cp-01) and are used by HyOps state + blueprint contracts. - Physical Proxmox VM display names are env-scoped by default to avoid collisions:
- prefix precedence:
name_prefix->context_id->--env - example:
dev-pgha-01,staging-pgha-01,prod-pgha-01 - NetBox VM sync uses the physical VM name exported by Terraform, so NetBox VM names also stay distinct across envs.
- If needed, set
inputs.vms.<logical_name>.vm_nameto force an exact physical name.
Preconditions and safety checks¶
- Installed
hyops(viainstall.sh) can be run from any working directory. - If you want to use the shipped example overlays, set:
export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"
For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead.
- Target environment selected correctly (--env dev|staging|prod).
- Template module state exists and is healthy:
- $HOME/.hybridops/envs/<env>/state/modules/core__onprem__template-image/latest.json
- Input overlay contains either:
- vms (for pool mode), or
- vm_name + single-VM fields (for shorthand mode).
- Proxmox API and SSH are reachable from the runner.
IP Addressing Modes¶
platform/onprem/platform-vm supports:
addressing.mode: static(operator provides per-VM IPs in the overlay)addressing.mode: ipam(HyOps allocates VM IPs from authoritative NetBox IPAM)
Default shipped behavior is IPAM-first:
- Example overlays in
modules/platform/onprem/platform-vm/examples/userequire_ipam: true. - They define bridge mappings only; HyOps allocates IPv4 addresses from NetBox.
bridge: vnetenvis allowed and resolves by runtime env:dev -> vnetdev,staging -> vnetstag,prod -> vnetprodbridge: vnetenvdatais allowed and resolves by runtime env:dev -> vnetddev,staging -> vnetdstg,prod -> vnetdprd- Use explicit bridges (
vnetmgmt,vnetdata, etc.) for non-workload/shared cases. - Use static mode only as an explicit override (
require_ipam: false) for bootstrap/break-glass paths.
If you use addressing.mode: ipam, NetBox authority must be ready (default: shared):
hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute
And the runtime env must have a NetBox API token available to HyOps:
hyops secrets ensure --env dev NETBOX_API_TOKEN
Steps¶
- Select an overlay
Use one of:
$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.min.yml$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml-
$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.enterprise.yml -
Preflight
hyops preflight --env dev --strict \
--module platform/onprem/platform-vm \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"
- Deploy / converge
hyops apply --env dev \
--module platform/onprem/platform-vm \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"
platform-vm apply now includes a built-in post-apply SSH readiness gate for Linux VMs (default enabled/required). This fails the VM provisioning step early if clones never become reachable, instead of leaving downstream modules (for example postgresql-core, netbox, rke2-cluster) to fail later.
- Verify outputs and evidence
cat $HOME/.hybridops/envs/dev/state/modules/platform__onprem__platform-vm/latest.json
Check:
statusisokoutputs.vm_ids,outputs.vm_keys, andoutputs.vm_namesare presentoutputs.vm_keysare the logical VM identifiers used inside HyOpsoutputs.vm_namesare the physical/env-prefixed VM names created on the target platform-
evidence_direxists -
Destroy
hyops destroy --env dev \
--module platform/onprem/platform-vm \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"
- Rebuild (destroy then apply)
hyops rebuild --env dev --yes \
--confirm-module platform/onprem/platform-vm \
--module platform/onprem/platform-vm \
--inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"
Verification¶
Primary state:
$HOME/.hybridops/envs/<env>/state/modules/platform__onprem__platform-vm/latest.json
Primary logs:
$HOME/.hybridops/envs/<env>/logs/module/platform__onprem__platform-vm/<run_id>/
Terragrunt evidence artifacts to inspect:
terragrunt_init.stdout.txtterragrunt_apply.stdout.txtterragrunt_destroy.stdout.txtdriver_result.jsonpost_apply_ssh_readiness.jsonconnectivity_proxy_nc.*/connectivity_ssh_auth.*(when readiness probes run)hook_netbox_sync.json(apply and destroy sync paths when NetBox sync is enabled)
Post-actions and clean-up¶
- Archive
run_idand evidence directory in the related change/incident record. - If this was a transient test, run
hyops destroywith the same module and overlay. - NetBox destroy behavior (default): VM records are soft-retired (offline + stale tag) during destroy-sync, not hard-deleted.
- Optional hard delete (advanced):
export HYOPS_NETBOX_SYNC_DESTROY_HARD_DELETE=truebeforehyops destroy. - Confirm
latest.jsonreflects intended steady state after the final action. - If VM replacement occurred, notify dependent teams of updated runtime details (for example MAC/IP changes).
Common issues¶
inputs.template_state_ref is required¶
Cause: overlay omitted both template_state_ref and template_vm_id.
Fix: include:
template_state_ref: "core/onprem/template-image"
template_key: "ubuntu-24.04"
either inputs.vms (non-empty map) or inputs.vm_name is required¶
Cause: overlay does not define VM target(s).
Fix: add either:
vms: { ... }for pool mode, orvm_nameand single-VM networking fields for shorthand mode.
template_state_ref not found in env state¶
Cause: template-image was not built in the same environment.
Fix: run template build first:
hyops apply --env <env> \
--module core/onprem/template-image \
--inputs "$HYOPS_CORE_ROOT/modules/core/onprem/template-image/examples/inputs.min.yml"
vm set collision detected ... requested VM names differ from existing managed VM names¶
Cause: you are applying a different VM set against the same module state key (env + module_ref), which would replace currently managed VMs.
Fix:
- Preferred: use a separate environment or module scope for the new VM set.
- Intentional replacement only: set
allow_vm_set_replace: truein your input overlay.
addressing.mode=ipam fails with missing NetBox details¶
Cause: NetBox authority is not ready (default: shared), or the NetBox API token is not available to HyOps.
Fix:
- Bootstrap NetBox foundation:
hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute
Re-applied VM gets a different IP than before¶
Cause:
- NetBox IPAM avoids conflicts, but same-IP reuse depends on the prior reservation still existing.
- HyOps reuses reservations by a stable identity key (zone + logical VM key + bridge + NIC index).
- If the original IP record was removed, a new free IP may be allocated.
Expected default behavior:
- With the default destroy-sync soft-retire model, NetBox VM records are retired and IP conflicts are avoided.
- HyOps will reuse the same reserved IP when the NetBox IP record still exists.
- Confirm
platform/onprem/netboxstate isstatus=okin the authority env. - If your workstation is not routed to the management subnet, configure a bastion (
inputs.ssh_proxy_jump_host) or rely on HyOps auto-tunnel via Proxmox when available.
post-apply SSH readiness failed¶
Cause: VM clone was created on Proxmox but never became SSH reachable within the configured wait budget (template issue, guest boot issue, or network path issue).
Fix:
- Inspect
post_apply_ssh_readiness.jsonandconnectivity_*evidence in the module run directory. - Verify the template itself (rebuild
core/onprem/template-image; automatic template smoke should pass). - Increase wait budget only if boot is legitimately slow:
post_apply_ssh_readiness:
connectivity_wait_s: 600