Skip to content

Operate Generic Platform VMs (HyOps)

Purpose: Run the generic VM lifecycle for platform/onprem/platform-vm in a controlled environment.
Owner: Platform engineering
Trigger: New VM pool request, baseline refresh, environment rebuild, or drift correction
Impact: Creates, replaces, or removes platform VMs on the target Proxmox environment
Severity: P2
Pre-reqs: Proxmox init completed for the environment, vault decrypt working, template-image state available.
Rollback strategy: Run hyops destroy with the same module and input overlay.

Context

This runbook covers module-level operations for:

  • Module: platform/onprem/platform-vm
  • Driver: iac/terragrunt
  • Template source: core/onprem/template-image via template_state_ref

The expected operating model is state-ref first:

  • template_state_ref: "core/onprem/template-image"
  • template_key: "<template-key>"

Direct template_vm_id input is override-only and not required for normal use.

Naming model (important on shared Proxmox metal):

  • Logical VM names remain the same as your inputs.vms keys (for example pgha-01, rke2-cp-01) and are used by HyOps state + blueprint contracts.
  • Physical Proxmox VM display names are env-scoped by default to avoid collisions:
  • prefix precedence: name_prefix -> context_id -> --env
  • example: dev-pgha-01, staging-pgha-01, prod-pgha-01
  • NetBox VM sync uses the physical VM name exported by Terraform, so NetBox VM names also stay distinct across envs.
  • If needed, set inputs.vms.<logical_name>.vm_name to force an exact physical name.

Preconditions and safety checks

  • Installed hyops (via install.sh) can be run from any working directory.
  • If you want to use the shipped example overlays, set:
export HYOPS_CORE_ROOT="${HYOPS_CORE_ROOT:-$HOME/.hybridops/core/app}"

For source checkout usage, set HYOPS_CORE_ROOT to your hybridops-core checkout root instead. - Target environment selected correctly (--env dev|staging|prod). - Template module state exists and is healthy: - $HOME/.hybridops/envs/<env>/state/modules/core__onprem__template-image/latest.json - Input overlay contains either: - vms (for pool mode), or - vm_name + single-VM fields (for shorthand mode). - Proxmox API and SSH are reachable from the runner.

IP Addressing Modes

platform/onprem/platform-vm supports:

  • addressing.mode: static (operator provides per-VM IPs in the overlay)
  • addressing.mode: ipam (HyOps allocates VM IPs from authoritative NetBox IPAM)

Default shipped behavior is IPAM-first:

  • Example overlays in modules/platform/onprem/platform-vm/examples/ use require_ipam: true.
  • They define bridge mappings only; HyOps allocates IPv4 addresses from NetBox.
  • bridge: vnetenv is allowed and resolves by runtime env:
  • dev -> vnetdev, staging -> vnetstag, prod -> vnetprod
  • bridge: vnetenvdata is allowed and resolves by runtime env:
  • dev -> vnetddev, staging -> vnetdstg, prod -> vnetdprd
  • Use explicit bridges (vnetmgmt, vnetdata, etc.) for non-workload/shared cases.
  • Use static mode only as an explicit override (require_ipam: false) for bootstrap/break-glass paths.

If you use addressing.mode: ipam, NetBox authority must be ready (default: shared):

hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute

And the runtime env must have a NetBox API token available to HyOps:

hyops secrets ensure --env dev NETBOX_API_TOKEN

Steps

  1. Select an overlay

Use one of:

  • $HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.min.yml
  • $HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml
  • $HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.enterprise.yml

  • Preflight

hyops preflight --env dev --strict \
  --module platform/onprem/platform-vm \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"
  1. Deploy / converge
hyops apply --env dev \
  --module platform/onprem/platform-vm \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"

platform-vm apply now includes a built-in post-apply SSH readiness gate for Linux VMs (default enabled/required). This fails the VM provisioning step early if clones never become reachable, instead of leaving downstream modules (for example postgresql-core, netbox, rke2-cluster) to fail later.

  1. Verify outputs and evidence
cat $HOME/.hybridops/envs/dev/state/modules/platform__onprem__platform-vm/latest.json

Check:

  • status is ok
  • outputs.vm_ids, outputs.vm_keys, and outputs.vm_names are present
  • outputs.vm_keys are the logical VM identifiers used inside HyOps
  • outputs.vm_names are the physical/env-prefixed VM names created on the target platform
  • evidence_dir exists

  • Destroy

hyops destroy --env dev \
  --module platform/onprem/platform-vm \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"
  1. Rebuild (destroy then apply)
hyops rebuild --env dev --yes \
  --confirm-module platform/onprem/platform-vm \
  --module platform/onprem/platform-vm \
  --inputs "$HYOPS_CORE_ROOT/modules/platform/onprem/platform-vm/examples/inputs.typical.yml"

Verification

Primary state:

  • $HOME/.hybridops/envs/<env>/state/modules/platform__onprem__platform-vm/latest.json

Primary logs:

  • $HOME/.hybridops/envs/<env>/logs/module/platform__onprem__platform-vm/<run_id>/

Terragrunt evidence artifacts to inspect:

  • terragrunt_init.stdout.txt
  • terragrunt_apply.stdout.txt
  • terragrunt_destroy.stdout.txt
  • driver_result.json
  • post_apply_ssh_readiness.json
  • connectivity_proxy_nc.* / connectivity_ssh_auth.* (when readiness probes run)
  • hook_netbox_sync.json (apply and destroy sync paths when NetBox sync is enabled)

Post-actions and clean-up

  • Archive run_id and evidence directory in the related change/incident record.
  • If this was a transient test, run hyops destroy with the same module and overlay.
  • NetBox destroy behavior (default): VM records are soft-retired (offline + stale tag) during destroy-sync, not hard-deleted.
  • Optional hard delete (advanced): export HYOPS_NETBOX_SYNC_DESTROY_HARD_DELETE=true before hyops destroy.
  • Confirm latest.json reflects intended steady state after the final action.
  • If VM replacement occurred, notify dependent teams of updated runtime details (for example MAC/IP changes).

Common issues

inputs.template_state_ref is required

Cause: overlay omitted both template_state_ref and template_vm_id.

Fix: include:

template_state_ref: "core/onprem/template-image"
template_key: "ubuntu-24.04"

either inputs.vms (non-empty map) or inputs.vm_name is required

Cause: overlay does not define VM target(s).

Fix: add either:

  • vms: { ... } for pool mode, or
  • vm_name and single-VM networking fields for shorthand mode.

template_state_ref not found in env state

Cause: template-image was not built in the same environment.

Fix: run template build first:

hyops apply --env <env> \
  --module core/onprem/template-image \
  --inputs "$HYOPS_CORE_ROOT/modules/core/onprem/template-image/examples/inputs.min.yml"

vm set collision detected ... requested VM names differ from existing managed VM names

Cause: you are applying a different VM set against the same module state key (env + module_ref), which would replace currently managed VMs.

Fix:

  • Preferred: use a separate environment or module scope for the new VM set.
  • Intentional replacement only: set allow_vm_set_replace: true in your input overlay.

addressing.mode=ipam fails with missing NetBox details

Cause: NetBox authority is not ready (default: shared), or the NetBox API token is not available to HyOps.

Fix:

  • Bootstrap NetBox foundation: hyops blueprint deploy --env shared --ref onprem/bootstrap-netbox@v1 --execute

Re-applied VM gets a different IP than before

Cause:

  • NetBox IPAM avoids conflicts, but same-IP reuse depends on the prior reservation still existing.
  • HyOps reuses reservations by a stable identity key (zone + logical VM key + bridge + NIC index).
  • If the original IP record was removed, a new free IP may be allocated.

Expected default behavior:

  • With the default destroy-sync soft-retire model, NetBox VM records are retired and IP conflicts are avoided.
  • HyOps will reuse the same reserved IP when the NetBox IP record still exists.
  • Confirm platform/onprem/netbox state is status=ok in the authority env.
  • If your workstation is not routed to the management subnet, configure a bastion (inputs.ssh_proxy_jump_host) or rely on HyOps auto-tunnel via Proxmox when available.

post-apply SSH readiness failed

Cause: VM clone was created on Proxmox but never became SSH reachable within the configured wait budget (template issue, guest boot issue, or network path issue).

Fix:

  • Inspect post_apply_ssh_readiness.json and connectivity_* evidence in the module run directory.
  • Verify the template itself (rebuild core/onprem/template-image; automatic template smoke should pass).
  • Increase wait budget only if boot is legitimately slow:
post_apply_ssh_readiness:
  connectivity_wait_s: 600

References