Skip to content

Cost Guardrail Breach During DR/Burst (Decision: DENY)

Purpose: Provide a clear procedure for responding when the Cost Decision Service returns a DENY (or equivalent) decision for a DR or cloud burst action, including how to interpret the decision, choose between override vs degraded mode, and capture evidence.

Owner: Platform / SRE team (HybridOps.Studio)
Trigger: A DR or burst-related workflow (typically GitHub Actions) calls the Cost Decision Service and receives a DENY decision.
Impact: The requested DR/burst action is blocked on cost grounds. Service availability or performance may remain degraded until an alternative path is chosen.
Severity: P2 – high impact, but by design it is a governed decision, not uncontrolled failure.

This runbook aligns with:

Evidence for this runbook should be stored under:


1. Scenario overview

The platform uses a Cost Decision Service as part of the DR/burst control loop:

  1. A technical signal (for example, jenkins_critical_down, platform_unavailable) is detected by Prometheus/Alertmanager.
  2. Alertmanager triggers a GitHub Actions DR or burst workflow.
  3. The workflow calls the Cost Decision Service with context:
  4. Environment and target (on-prem, cloud-dr, extra burst capacity).
  5. Estimated cost for the action (for example, per hour/day).
  6. Budget guardrail configuration for the current period.
  7. The Cost Decision Service returns ALLOW, DENY, or SIMULATE_ONLY.

This runbook covers the path where the decision is:

  • DENY – the requested DR/burst action is not allowed within current cost guardrails.

The objective is to:

  • Understand why it was denied.
  • Decide whether to:
  • Respect the denial and operate in a degraded but safe mode, or
  • Seek explicit override approval and re-run with override.
  • Capture all decisions as evidence for governance and FinOps.

2. Preconditions and safety checks

Before taking action:

  1. Confirm this is a Cost Decision Service DENY, not a technical failure

  2. Check the GitHub Actions workflow logs for the Cost Decision Service step.

  3. Confirm the service responded successfully with a decision: "DENY" (or equivalent), not an HTTP/network error.

  4. Confirm the nature of the underlying technical incident

  5. Is this a DR event (for example, on-prem cluster impaired) or a burst request (for example, scale out for load)?

  6. Check relevant platform runbooks:

    • DR cutover and failback runbooks.
    • Jenkins controller outage runbook.
    • db-01 failover runbook (if a database issue is involved).
  7. Locate and create evidence folders

  8. Establish an event-specific folder:

    mkdir -p output/artifacts/cost/cost-guardrail-<date>/
    mkdir -p output/artifacts/dr/cost-guardrail-<date>/
    
  9. Replace <date> with a timestamp (for example, 2025-12-02T210000Z).

  10. Check if this is a drill or a real incident

  11. Inspect the workflow inputs and Cost Decision Service payload:

    • mode: "dr-drill" or "production" (or similar).
  12. This affects:

    • Communication style.
    • Whether override is even considered.
  13. Confirm current business priority

  14. For a portfolio/demo environment:

    • Availability vs cost tolerance may be different than for a paid environment.
  15. For a hypothetical real environment:
    • Clarify whether contractual SLOs or critical obligations would justify override.

Record these initial observations in the incident ticket and in a text file under output/artifacts/cost/cost-guardrail-<date>/context.txt.


3. Phase 1 – Inspect the Cost Decision Service response

Goal: Understand why the decision is DENY.

  1. Extract the Cost Decision Service payload

  2. From the GitHub Actions logs, copy the JSON response into a file:

    # Example: captured from workflow logs or artifact
    cat > output/artifacts/cost/cost-guardrail-<date>/cost-decision.json <<'EOF'
    {...}
    EOF
    
  3. Identify key fields

Look for at least:

  • decision – must be DENY for this runbook.
  • reason or rationale – textual reason (for example, "monthly budget exceeded").
  • estimated_cost – projected cost of the requested action.
  • budget_remaining – remaining budget for the relevant period.
  • policy_id – which policy triggered the denial.

  • Summarise the decision

  • In a short file (summary.txt) under the same folder, summarise:

    • Why the decision is DENY.
    • Which policy and thresholds were involved.
    • Whether this is a per-environment policy (for example, env: lab vs env: production).

This summary becomes part of the governance evidence.


4. Phase 2 – Decide between degraded mode and override

Goal: Make a deliberate, documented decision on whether to respect the DENY or seek override.

  1. Assess technical impact

  2. For DR:

    • How impaired is the on-prem environment?
    • Are there alternate paths (for example, partial service, read-only mode, minimal capacity)?
  3. For burst:

    • Is the current capacity saturated?
    • Will not bursting cause a clear user impact?
  4. Assess financial and governance impact

  5. Compare:

    • Estimated DR/burst cost vs budget remaining.
    • Nature of current environment (portfolio, lab, production).
  6. Consider whether approving override:

    • Is justified by critical user or business impact.
    • Would create unacceptable financial risk.
  7. Default stance

  8. For drills and lab/portfolio environments:

    • Default is to respect DENY and treat it as a successful demonstration of guardrails.
  9. For hypothetical production:

    • Default is to respect DENY unless:
    • A clear, documented business owner explicitly approves override.
  10. Decision options

  11. Option A – Respect DENY and operate in degraded mode

  12. Option B – Request and document override, then re-run
  13. Option C – Postpone DR/burst and schedule a later window (for example, after budget reset or policy change).

Document the chosen option and rationale in:

  • Incident ticket, and
  • output/artifacts/cost/cost-guardrail-<date>/decision.txt

5. Phase 3 – Path A: Respect DENY and operate in degraded mode

If you choose to respect the Cost Decision Service decision:

  1. Select a degraded operating mode

Examples:

  • Keep a minimal on-prem footprint running (essential services only).
  • Serve some workloads in read-only mode (for example, NetBox read-only).
  • Temporarily accept higher latency or lower throughput.

  • Apply technical safeguards

  • Ensure automation is not continuously retrying DR/burst attempts:

    • Disable or pause the triggering workflow temporarily.
    • Put DR/burst jobs in a safe status (for example, disabled in Jenkins or GitHub Actions).
  • Communicate status

  • If this is a multi-user environment, clearly communicate:

    • That DR/burst was intentionally blocked on cost grounds.
    • What degraded behaviour users should expect.
  • Record operational state

  • Briefly describe the degraded mode in:

    • output/artifacts/dr/cost-guardrail-<date>/degraded-mode.txt
  • Plan follow-up

  • Decide whether you will:

    • Adjust budgets or policies for the future.
    • Improve capacity planning to reduce need for emergency burst.

6. Phase 4 – Path B: Request override and re-run

If you choose to request override (primarily a conceptual path in this portfolio):

  1. Document override request

  2. Capture:

    • Who is requesting override.
    • Who would approve it (for example, product or business owner).
    • Why override is justified vs cost risk.
  3. Save this as:

    • output/artifacts/cost/cost-guardrail-<date>/override-request.txt
  4. Record hypothetical approval

  5. For a real environment:

    • Capture actual approval (for example, written confirmation).
  6. For a portfolio demo:

    • Explain that override is being simulated for demonstration and that no real funds are at risk.
  7. Re-run workflow with override flag

  8. Re-trigger the DR/burst workflow with a flag (for example, override: true or a dedicated event type).

  9. Ensure the Cost Decision Service:

    • Records that this is an override.
    • Returns ALLOW with an override field set.
  10. Monitor DR/burst actions

  11. If this is a non-destructive lab scenario, allow the workflow to:

    • Provision DR or burst resources.
    • Validate workloads.
    • Tear down resources after validation.
  12. Capture override evidence

  13. Save updated Cost Decision Service responses to:

    • output/artifacts/cost/cost-guardrail-<date>/cost-decision-override.json
  14. Capture any additional DR/burst artefacts under:
    • output/artifacts/dr/cost-guardrail-<date>/

7. Phase 5 – Path C: Postpone and reschedule

If you decide to neither override nor stay in prolonged degraded mode:

  1. Postpone action

  2. Treat this as a decision to defer DR/burst until:

    • Budget is refreshed, or
    • Policies are re-tuned.
  3. Document the deferral

  4. In decision.txt and the incident ticket, state:

    • That action was deferred due to cost.
    • Any temporary mitigations applied.
  5. Update cost or DR plans

  6. Consider changes to:

    • Budgeting for DR/burst.
    • Environment sizing to avoid repeated near-misses.

8. Phase 6 – Evidence and close-out

Before closing the runbook:

  1. Ensure the following folders contain artefacts for this event:

  2. output/artifacts/cost/cost-guardrail-<date>/

  3. output/artifacts/dr/cost-guardrail-<date>/

  4. Check that you have:

  5. Original Cost Decision Service response (DENY).

  6. Summary of the decision and rationale.
  7. Chosen path (A/B/C) and justification.
  8. Any override notes (if applicable).
  9. Description of degraded mode or follow-up actions.

  10. Update Evidence 4 (if this was a deliberate drill) to:

  11. Mention the event as a proof point that cost guardrails are not just theoretical.

  12. Close the incident or drill with:

  13. Clear final state.

  14. Lessons learned and any backlog items.

9. Validation checklist

  • [ ] Confirmed that the Cost Decision Service returned a genuine DENY (not a technical error).
  • [ ] Underlying technical incident (DR or burst need) was understood and recorded.
  • [ ] Cost decision payload and rationale stored under output/artifacts/cost/.
  • [ ] A deliberate choice was made between degraded mode, override, or postponement.
  • [ ] If degraded mode was chosen, technical safeguards and communication steps were applied.
  • [ ] If override was chosen, approvals and re-run behaviour were recorded.
  • [ ] DR/burst workflows were left in a safe state, with no uncontrolled retries.
  • [ ] Evidence was captured and, if appropriate, referenced from Evidence 4.

References


Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation