Environment Guard Framework (EGF) for Ansible Governance¶
Status¶
Accepted — EGF is the standard way to gate Ansible automation with environment-aware risk checks, governed targeting, and evidence-first connectivity validation.
1. Context¶
HybridOps.Studio uses Ansible as a core automation layer across platform, non-production, and production-like environments. As the estate grows, ungoverned playbook runs present several risks:
- Accidental execution against the wrong environment (for example running a “prod” playbook against dev or vice versa).
- Lack of clear pre-change checks (reachability, risk assessment, approvals).
- Weak traceability for reviewers (no consistent correlation ID, scattered logs, and ad hoc artefacts).
- Difficulty demonstrating behaviour to stakeholders (reviewers, auditors, hiring managers) who are not inside the repository structure every day.
At the same time:
- The platform is migrating toward NetBox-backed IPAM and dynamic inventories, but still has legacy patterns (static inventories, runtime IP maps).
- The same governance story must hold for a single-node lab, a “prod-like” platform, and future multi-cloud extensions.
- Evidence production (JSON, JSONL, markdown reports, log files) must be deterministic and live under the canonical
output/tree.
A structured framework is required so that:
- Every pipeline can opt into the same set of guardrails.
- Evidence paths and correlation IDs are predictable.
- Legacy inventory/IP patterns can be used in a controlled way while NetBox becomes the primary source of truth.
2. Decision¶
Adopt the Environment Guard Framework (EGF) as the standard governance chain for Ansible workflows, implemented as a set of roles in the hybridops.common collection.
flowchart LR
A[env_guard<br/>Governance & risk] --> B[gen_inventory<br/>Bridge inventory]
B --> C[host_selector<br/>Target selection]
C --> D[ip_mapper<br/>Bridge IP mapping]
D --> E[connectivity_test<br/>Pre-deploy checks]
E --> F[deployment<br/>Change execution]
Core decisions:
-
Governance entry point
env_guardis the mandatory first step for governed flows. It validates the target environment, performs risk scoring, enforces maintenance windows (especially for prod), and emits structured audit logs and markdown reports with correlation IDs. -
Environment resolution (interactive vs CI)
In non-interactive contexts (for example CI pipelines),env_guardruns in a non-interactive mode and derives the target environment from either: - an explicit
envvariable, or -
the
HOS_ENVenvironment variable (for example set by Jenkins parameters),
falling back todevwhen neither is provided. Interactive prompts are only used whenenvis not set. -
Target and address handling
gen_inventoryis retained as a bridge role to generate inventories from structured environment data where NetBox is not yet present.host_selectoris the default targeting control, with four methods (A–D) and clear validation rules.-
ip_mapperis a bridge/fallback role that resolves placeholders (XX.XX.XX.00) to real IPs in environments that have not fully moved to NetBox dynamic inventory. -
Connectivity gate
connectivity_testis the standard pre-deploy reachability check, producing both JSON (dict keyed by hostname) and JSONL (one event per line) underoutput/logs/ansible/connectivity_logs/<run_id>/. -
Evidence and logging layout
- Env Guard audit logs and reports live under
output/logs/ansible/env_guard_logs/<run_id>/. - Connectivity artefacts live under
output/logs/ansible/connectivity_logs/<run_id>/. -
Tests and CI harnesses may copy selected artefacts under local
tests/output/folders but do not change the canonical output layout. -
Correlation ID (CID)
- A correlation ID is generated or inherited at the
env_guardboundary and propagated through dependent roles as a logging-only identifier. - CIDs appear in log lines and report filenames but are not used as functional inputs to hosts.
EGF becomes the default pattern for new pipelines and the reference implementation when demonstrating governance and evidence.
3. Rationale¶
3.1 Why a dedicated framework?¶
Without a single, named framework, governance concerns tend to be re-implemented ad hoc in each playbook:
- One playbook prompts for environment, another hardcodes it, and a third reads it from environment variables.
- Some jobs run “pre-checks” via shell scripts, others via inline tasks, and most do not emit consistent artefacts.
- There is no unified way to show “this is the gate that runs before changes”.
EGF provides:
- A concrete story: “Every governed pipeline goes through Env Guard → (optional inventory/IP bridge) → host selection → connectivity gate → deployment.”
- A reusable library of roles in
hybridops.commonthat can be used across multiple repositories. - A predictable evidence model anchored in
output/with JSON, JSONL, and markdown artefacts.
3.2 Why these roles and this pipeline order?¶
-
env_guard first
Governance must run before inventory generation, targeting, and connectivity checks; otherwise it is possible to “half-run” expensive operations before discovering that the environment or window is invalid. -
gen_inventory and ip_mapper as bridges
These roles exist to support structured/static inventories and runtime IP maps while NetBox adoption is ongoing. Making them first-class bridge roles clearly signals that the long-term direction is NetBox → dynamic inventory. -
host_selector before connectivity_test
Connectivity checks should only run against hosts that have been selected and approved. This keeps gates fast, predictable, and avoids surprising traffic to unwanted targets. -
connectivity_test before deployment
Pre-flight gates provide a defensible operational story and produce consistent JSON/JSONL artefacts for evidence packs and post-incident analysis.
3.3 Why log-only correlation IDs?¶
CIDs are useful for:
- Tracing a single run across multiple roles and playbooks.
- Filtering logs and artefacts during incident review, audits, or interviews.
- Demonstrating a “single-threaded” chain of evidence for a scenario.
Using CIDs as functional inputs (for example as part of hostnames) would increase coupling and complexity. Keeping CIDs as log-only identifiers:
- Provides traceability without entangling business logic.
- Keeps the framework easy to explain in ADRs, HOWTOs, and customer-facing material.
4. Consequences¶
4.1 Positive consequences¶
-
Unified governance pipeline
All governed Ansible flows can refer to the same chain (env_guard → … → connectivity_test → deployment), which simplifies documentation, diagrams, stakeholder briefings, and onboarding. -
Consistent evidence output
env_guardaudit logs and reports always land underoutput/logs/ansible/env_guard_logs/<run_id>/.connectivity_testalways writes JSON+JSONL tooutput/logs/ansible/connectivity_logs/<run_id>/.-
CI harnesses can copy artefacts to
tests/output/without changing canonical paths. -
Improved safety
env_guardenforces environment validation, risk scoring, and maintenance windows (with stricter rules for production).connectivity_testensures that playbooks only run when basic reachability is proven.-
host_selectorreduces targeting errors via validated methods and readable prompts. -
Controlled bridge for legacy patterns
gen_inventoryandip_mapperallow legacy or structured inventory data to be used safely while NetBox adoption is ongoing.-
The roles can be removed from the “main line” later without redesigning the rest of the pipeline.
-
Stronger platform narrative
EGF provides a concise statement:
“Every meaningful change goes through Env Guard (governance) → host selection → connectivity gate.
Here are the artefacts and the logs for multiple runs.”
This is easy to demonstrate via screenshots, JSON snippets, and linked logs for platform stakeholders.
4.2 Negative consequences / risks¶
-
More steps in simple flows
One-off or ad hoc playbooks now need to either opt into EGF (and accept the overhead) or explicitly stay outside it. This can feel heavy for trivial tasks. -
Learning curve for contributors
Engineers must understand the framework roles, especiallyenv_guardandconnectivity_test, before creating new pipelines. -
Additional maintenance surface
The framework needs ongoing maintenance: - Keeping output paths in sync with platform conventions.
- Adjusting argument specs and defaults as Ansible versions evolve.
-
Updating READMEs and ADRs when NetBox-first patterns replace bridge roles.
-
Risk of partial adoption
If some pipelines use EGF and others do not, the governance story becomes fragmented. This risk is mitigated by documenting EGF as the default and using ADRs to justify any deviations.
5. Alternatives considered¶
-
Per-playbook governance tasks
Implement environment checks, maintenance windows, and connectivity tests in each playbook independently.
Rejected because it scatters logic, makes evidence inconsistent, and is harder to explain or audit. -
Rely purely on CI/CD pipeline logic (no Ansible roles)
Use external scripts or pipeline steps (shell, Python, or Terraform) to do validation before Ansible runs.
Rejected because it splits governance across multiple technologies and hides the checks from engineers who run playbooks locally. -
NetBox-only approach from day one
Require NetBox dynamic inventory everywhere and removegen_inventory/ip_mapperentirely.
Rejected for now because the platform is mid-migration and still needs to demonstrate both pre-NetBox and NetBox-first patterns. -
Single “mega-role” instead of composable roles
One role responsible for governance, inventory, targeting, IP mapping, and connectivity checks.
Rejected because it would be harder to test, reuse in other contexts, and evolve as NetBox adoption increases.
6. Implementation notes¶
6.1 Roles and locations¶
The decision is implemented primarily in the hybridops.common collection:
ansible_collections/hybridops/common/roles/env_guard/ansible_collections/hybridops/common/roles/gen_inventory/ansible_collections/hybridops/common/roles/host_selector/ansible_collections/hybridops/common/roles/ip_mapper/ansible_collections/hybridops/common/roles/connectivity_test/
Each role includes:
defaults/main.ymlwith canonical paths and environment defaults (for example resolving a project root fromplaybook_dir).meta/main.ymlandmeta/argument_specs.ymlwhere applicable.tasks/broken into small, composable files (for example_validate.yml,run_tests.yml,compile_results.yml,save_results.yml).
6.2 Output paths and evidence¶
EGF roles write primary artefacts under:
output/
logs/
ansible/
env_guard_logs/<run_id>/
connectivity_logs/<run_id>/
CI harnesses and role tests may copy selected artefacts into local tests/output/ directories for convenience, but must not change the canonical paths.
connectivity_test publishes run metadata via set_stats, including:
connectivity_logs_rootconnectivity_run_idconnectivity_run_dirconnectivity_json_pathconnectivity_jsonl_path
6.3 Correlation ID propagation¶
env_guardis responsible for seedingenv_guard_correlation_id(UUID-based with a fallback format such asenvguard-<epoch>-<hex>).- Downstream roles use an inherited/logging-only CID:
connectivity_use_cid: true
connectivity_inherited_cid: "{{ correlation_id | default(egf_correlation_id | default('', true), true) | lower }}"
connectivity_env_cid: "{{ lookup('env','EGF_CORR_ID') | default('', true) | lower }}"
connectivity_cid_pre: "{{ connectivity_inherited_cid or connectivity_env_cid }}"
- Log messages include
[cid=...]prefixes for key events (unmapped hosts, failures, summaries).
6.4 CI harness¶
The ansible-galaxy-hybridops workspace provides a Makefile that can:
- Run role-level smoke tests (for example
tests/smoke.ymlunder each role) against disposable inventories. - Build collection artefacts for Galaxy via targets such as
release.dry-runandgalaxy.build.all. - Clean up locally built tarballs and uninstall
hybridops.*collections when iterating on releases.
The hybridops-platform repository is treated as the primary consumer:
- Collections are installed from Galaxy via
deployment/requirements.yml. - Ansible uses
deployment/ansible.cfg(plus the defaultCOLLECTIONS_PATHS) to locatehybridops.*. - The EGF connectivity gate is exercised from
deployment/using the standard CI playbooks (for exampleci/playbooks/cicd_connectivity_gate.yml), which in turn produce artefacts underoutput/logs/ansible/env_guard_logs/andoutput/logs/ansible/connectivity_logs/.
7. Operational impact and validation¶
7.1 Day-to-day flows¶
Engineers interact with EGF primarily via:
- Governed deploy playbooks, which include:
- hosts: localhost
roles:
- hybridops.common.env_guard
- hybridops.common.host_selector
- hybridops.common.connectivity_test
# followed by application or platform roles
- CI workflows that:
- Run
env_guardin non-interactive mode withCI_ENVand gated windows. - Run
connectivity_testand archive JSON/JSONL artefacts. - Optionally run legacy bridge roles (
gen_inventory,ip_mapper) when NetBox is not yet authoritative.
7.2 Runbooks and HOWTOs¶
- Runbooks under
docs/ops/runbooks/egf/describe: - How to run the connectivity gate.
- How to interpret Env Guard audit logs and markdown reports.
-
How to triage failed risk checks or connectivity failures.
-
HOWTO guides under
docs/howto/describe: - Adding a new governed pipeline to an existing playbook.
- Integrating EGF roles in CI workflows.
7.3 Evidence and metrics¶
Evidence is collected via:
- JSON and JSONL artefacts from
connectivity_test. - Markdown reports and audit logs from
env_guard. - CI job logs that include
[cid=...]prefixes and artefact paths.
Potential metrics:
- Number of failed versus successful Env Guard runs per environment.
- Frequency of connectivity failures by environment or host group.
- Percentage of pipelines that include the full EGF chain.
These metrics can be surfaced via dashboards or summarised for internal reviews and external conversations.
8. References¶
- EGF documentation
- HOWTO – EGF Governed Deploy
-
Related ADRs
-
Evidence folders
output/logs/ansible/env_guard_logs/output/logs/ansible/connectivity_logs/-
External
- Ansible collections documentation – see the official Ansible docs.
- NetBox documentation – see the official NetBox docs.