Skip to content

Environment Guard Framework (EGF) for Ansible Governance

Status

Accepted — EGF is the standard way to gate Ansible automation with environment-aware risk checks, governed targeting, and evidence-first connectivity validation.


1. Context

HybridOps.Studio uses Ansible as a core automation layer across platform, non-production, and production-like environments. As the estate grows, ungoverned playbook runs present several risks:

  • Accidental execution against the wrong environment (for example running a “prod” playbook against dev or vice versa).
  • Lack of clear pre-change checks (reachability, risk assessment, approvals).
  • Weak traceability for reviewers (no consistent correlation ID, scattered logs, and ad hoc artefacts).
  • Difficulty demonstrating behaviour to stakeholders (reviewers, auditors, hiring managers) who are not inside the repository structure every day.

At the same time:

  • The platform is migrating toward NetBox-backed IPAM and dynamic inventories, but still has legacy patterns (static inventories, runtime IP maps).
  • The same governance story must hold for a single-node lab, a “prod-like” platform, and future multi-cloud extensions.
  • Evidence production (JSON, JSONL, markdown reports, log files) must be deterministic and live under the canonical output/ tree.

A structured framework is required so that:

  • Every pipeline can opt into the same set of guardrails.
  • Evidence paths and correlation IDs are predictable.
  • Legacy inventory/IP patterns can be used in a controlled way while NetBox becomes the primary source of truth.

2. Decision

Adopt the Environment Guard Framework (EGF) as the standard governance chain for Ansible workflows, implemented as a set of roles in the hybridops.common collection.

flowchart LR
  A[env_guard<br/>Governance & risk] --> B[gen_inventory<br/>Bridge inventory]
  B --> C[host_selector<br/>Target selection]
  C --> D[ip_mapper<br/>Bridge IP mapping]
  D --> E[connectivity_test<br/>Pre-deploy checks]
  E --> F[deployment<br/>Change execution]

Core decisions:

  • Governance entry point
    env_guard is the mandatory first step for governed flows. It validates the target environment, performs risk scoring, enforces maintenance windows (especially for prod), and emits structured audit logs and markdown reports with correlation IDs.

  • Environment resolution (interactive vs CI)
    In non-interactive contexts (for example CI pipelines), env_guard runs in a non-interactive mode and derives the target environment from either:

  • an explicit env variable, or
  • the HOS_ENV environment variable (for example set by Jenkins parameters),
    falling back to dev when neither is provided. Interactive prompts are only used when env is not set.

  • Target and address handling

  • gen_inventory is retained as a bridge role to generate inventories from structured environment data where NetBox is not yet present.
  • host_selector is the default targeting control, with four methods (A–D) and clear validation rules.
  • ip_mapper is a bridge/fallback role that resolves placeholders (XX.XX.XX.00) to real IPs in environments that have not fully moved to NetBox dynamic inventory.

  • Connectivity gate
    connectivity_test is the standard pre-deploy reachability check, producing both JSON (dict keyed by hostname) and JSONL (one event per line) under output/logs/ansible/connectivity_logs/<run_id>/.

  • Evidence and logging layout

  • Env Guard audit logs and reports live under output/logs/ansible/env_guard_logs/<run_id>/.
  • Connectivity artefacts live under output/logs/ansible/connectivity_logs/<run_id>/.
  • Tests and CI harnesses may copy selected artefacts under local tests/output/ folders but do not change the canonical output layout.

  • Correlation ID (CID)

  • A correlation ID is generated or inherited at the env_guard boundary and propagated through dependent roles as a logging-only identifier.
  • CIDs appear in log lines and report filenames but are not used as functional inputs to hosts.

EGF becomes the default pattern for new pipelines and the reference implementation when demonstrating governance and evidence.


3. Rationale

3.1 Why a dedicated framework?

Without a single, named framework, governance concerns tend to be re-implemented ad hoc in each playbook:

  • One playbook prompts for environment, another hardcodes it, and a third reads it from environment variables.
  • Some jobs run “pre-checks” via shell scripts, others via inline tasks, and most do not emit consistent artefacts.
  • There is no unified way to show “this is the gate that runs before changes”.

EGF provides:

  • A concrete story: “Every governed pipeline goes through Env Guard → (optional inventory/IP bridge) → host selection → connectivity gate → deployment.”
  • A reusable library of roles in hybridops.common that can be used across multiple repositories.
  • A predictable evidence model anchored in output/ with JSON, JSONL, and markdown artefacts.

3.2 Why these roles and this pipeline order?

  • env_guard first
    Governance must run before inventory generation, targeting, and connectivity checks; otherwise it is possible to “half-run” expensive operations before discovering that the environment or window is invalid.

  • gen_inventory and ip_mapper as bridges
    These roles exist to support structured/static inventories and runtime IP maps while NetBox adoption is ongoing. Making them first-class bridge roles clearly signals that the long-term direction is NetBox → dynamic inventory.

  • host_selector before connectivity_test
    Connectivity checks should only run against hosts that have been selected and approved. This keeps gates fast, predictable, and avoids surprising traffic to unwanted targets.

  • connectivity_test before deployment
    Pre-flight gates provide a defensible operational story and produce consistent JSON/JSONL artefacts for evidence packs and post-incident analysis.

3.3 Why log-only correlation IDs?

CIDs are useful for:

  • Tracing a single run across multiple roles and playbooks.
  • Filtering logs and artefacts during incident review, audits, or interviews.
  • Demonstrating a “single-threaded” chain of evidence for a scenario.

Using CIDs as functional inputs (for example as part of hostnames) would increase coupling and complexity. Keeping CIDs as log-only identifiers:

  • Provides traceability without entangling business logic.
  • Keeps the framework easy to explain in ADRs, HOWTOs, and customer-facing material.

4. Consequences

4.1 Positive consequences

  • Unified governance pipeline
    All governed Ansible flows can refer to the same chain (env_guard → … → connectivity_test → deployment), which simplifies documentation, diagrams, stakeholder briefings, and onboarding.

  • Consistent evidence output

  • env_guard audit logs and reports always land under output/logs/ansible/env_guard_logs/<run_id>/.
  • connectivity_test always writes JSON+JSONL to output/logs/ansible/connectivity_logs/<run_id>/.
  • CI harnesses can copy artefacts to tests/output/ without changing canonical paths.

  • Improved safety

  • env_guard enforces environment validation, risk scoring, and maintenance windows (with stricter rules for production).
  • connectivity_test ensures that playbooks only run when basic reachability is proven.
  • host_selector reduces targeting errors via validated methods and readable prompts.

  • Controlled bridge for legacy patterns

  • gen_inventory and ip_mapper allow legacy or structured inventory data to be used safely while NetBox adoption is ongoing.
  • The roles can be removed from the “main line” later without redesigning the rest of the pipeline.

  • Stronger platform narrative
    EGF provides a concise statement:

“Every meaningful change goes through Env Guard (governance) → host selection → connectivity gate.
Here are the artefacts and the logs for multiple runs.”

This is easy to demonstrate via screenshots, JSON snippets, and linked logs for platform stakeholders.

4.2 Negative consequences / risks

  • More steps in simple flows
    One-off or ad hoc playbooks now need to either opt into EGF (and accept the overhead) or explicitly stay outside it. This can feel heavy for trivial tasks.

  • Learning curve for contributors
    Engineers must understand the framework roles, especially env_guard and connectivity_test, before creating new pipelines.

  • Additional maintenance surface
    The framework needs ongoing maintenance:

  • Keeping output paths in sync with platform conventions.
  • Adjusting argument specs and defaults as Ansible versions evolve.
  • Updating READMEs and ADRs when NetBox-first patterns replace bridge roles.

  • Risk of partial adoption
    If some pipelines use EGF and others do not, the governance story becomes fragmented. This risk is mitigated by documenting EGF as the default and using ADRs to justify any deviations.


5. Alternatives considered

  1. Per-playbook governance tasks
    Implement environment checks, maintenance windows, and connectivity tests in each playbook independently.
    Rejected because it scatters logic, makes evidence inconsistent, and is harder to explain or audit.

  2. Rely purely on CI/CD pipeline logic (no Ansible roles)
    Use external scripts or pipeline steps (shell, Python, or Terraform) to do validation before Ansible runs.
    Rejected because it splits governance across multiple technologies and hides the checks from engineers who run playbooks locally.

  3. NetBox-only approach from day one
    Require NetBox dynamic inventory everywhere and remove gen_inventory / ip_mapper entirely.
    Rejected for now because the platform is mid-migration and still needs to demonstrate both pre-NetBox and NetBox-first patterns.

  4. Single “mega-role” instead of composable roles
    One role responsible for governance, inventory, targeting, IP mapping, and connectivity checks.
    Rejected because it would be harder to test, reuse in other contexts, and evolve as NetBox adoption increases.


6. Implementation notes

6.1 Roles and locations

The decision is implemented primarily in the hybridops.common collection:

  • ansible_collections/hybridops/common/roles/env_guard/
  • ansible_collections/hybridops/common/roles/gen_inventory/
  • ansible_collections/hybridops/common/roles/host_selector/
  • ansible_collections/hybridops/common/roles/ip_mapper/
  • ansible_collections/hybridops/common/roles/connectivity_test/

Each role includes:

  • defaults/main.yml with canonical paths and environment defaults (for example resolving a project root from playbook_dir).
  • meta/main.yml and meta/argument_specs.yml where applicable.
  • tasks/ broken into small, composable files (for example _validate.yml, run_tests.yml, compile_results.yml, save_results.yml).

6.2 Output paths and evidence

EGF roles write primary artefacts under:

output/
  logs/
    ansible/
      env_guard_logs/<run_id>/
      connectivity_logs/<run_id>/

CI harnesses and role tests may copy selected artefacts into local tests/output/ directories for convenience, but must not change the canonical paths.

connectivity_test publishes run metadata via set_stats, including:

  • connectivity_logs_root
  • connectivity_run_id
  • connectivity_run_dir
  • connectivity_json_path
  • connectivity_jsonl_path

6.3 Correlation ID propagation

  • env_guard is responsible for seeding env_guard_correlation_id (UUID-based with a fallback format such as envguard-<epoch>-<hex>).
  • Downstream roles use an inherited/logging-only CID:
connectivity_use_cid: true
connectivity_inherited_cid: "{{ correlation_id | default(egf_correlation_id | default('', true), true) | lower }}"
connectivity_env_cid: "{{ lookup('env','EGF_CORR_ID') | default('', true) | lower }}"
connectivity_cid_pre: "{{ connectivity_inherited_cid or connectivity_env_cid }}"
  • Log messages include [cid=...] prefixes for key events (unmapped hosts, failures, summaries).

6.4 CI harness

The ansible-galaxy-hybridops workspace provides a Makefile that can:

  • Run role-level smoke tests (for example tests/smoke.yml under each role) against disposable inventories.
  • Build collection artefacts for Galaxy via targets such as release.dry-run and galaxy.build.all.
  • Clean up locally built tarballs and uninstall hybridops.* collections when iterating on releases.

The hybridops-platform repository is treated as the primary consumer:

  • Collections are installed from Galaxy via deployment/requirements.yml.
  • Ansible uses deployment/ansible.cfg (plus the default COLLECTIONS_PATHS) to locate hybridops.*.
  • The EGF connectivity gate is exercised from deployment/ using the standard CI playbooks (for example ci/playbooks/cicd_connectivity_gate.yml), which in turn produce artefacts under output/logs/ansible/env_guard_logs/ and output/logs/ansible/connectivity_logs/.

7. Operational impact and validation

7.1 Day-to-day flows

Engineers interact with EGF primarily via:

  • Governed deploy playbooks, which include:
- hosts: localhost
  roles:
    - hybridops.common.env_guard
    - hybridops.common.host_selector
    - hybridops.common.connectivity_test
    # followed by application or platform roles
  • CI workflows that:
  • Run env_guard in non-interactive mode with CI_ENV and gated windows.
  • Run connectivity_test and archive JSON/JSONL artefacts.
  • Optionally run legacy bridge roles (gen_inventory, ip_mapper) when NetBox is not yet authoritative.

7.2 Runbooks and HOWTOs

  • Runbooks under docs/ops/runbooks/egf/ describe:
  • How to run the connectivity gate.
  • How to interpret Env Guard audit logs and markdown reports.
  • How to triage failed risk checks or connectivity failures.

  • HOWTO guides under docs/howto/ describe:

  • Adding a new governed pipeline to an existing playbook.
  • Integrating EGF roles in CI workflows.

7.3 Evidence and metrics

Evidence is collected via:

  • JSON and JSONL artefacts from connectivity_test.
  • Markdown reports and audit logs from env_guard.
  • CI job logs that include [cid=...] prefixes and artefact paths.

Potential metrics:

  • Number of failed versus successful Env Guard runs per environment.
  • Frequency of connectivity failures by environment or host group.
  • Percentage of pipelines that include the full EGF chain.

These metrics can be surfaced via dashboards or summarised for internal reviews and external conversations.


8. References