Cost Guardrail Breach During DR/Burst (Decision: DENY)¶

Purpose: Provide a clear procedure for responding when the Cost Decision Service returns a DENY (or equivalent) decision for a DR or cloud burst action, including how to interpret the decision, choose between override vs degraded mode, and capture evidence.

Owner: Platform / SRE team (HybridOps.Studio)
Trigger: A DR or burst-related workflow (typically GitHub Actions) calls the Cost Decision Service and receives a DENY decision.
Impact: The requested DR/burst action is blocked on cost grounds. Service availability or performance may remain degraded until an alternative path is chosen.
Severity: P2 – high impact, but by design it is a governed decision, not uncontrolled failure.

This runbook aligns with:

Evidence for this runbook should be stored under:

1. Scenario overview¶

The platform uses a Cost Decision Service as part of the DR/burst control loop:

A technical signal (for example, jenkins_critical_down, platform_unavailable) is detected by Prometheus/Alertmanager.
Alertmanager triggers a GitHub Actions DR or burst workflow.
The workflow calls the Cost Decision Service with context:
Environment and target (on-prem, cloud-dr, extra burst capacity).
Estimated cost for the action (for example, per hour/day).
Budget guardrail configuration for the current period.
The Cost Decision Service returns ALLOW, DENY, or SIMULATE_ONLY.

This runbook covers the path where the decision is:

DENY – the requested DR/burst action is not allowed within current cost guardrails.

The objective is to:

Understand why it was denied.
Decide whether to:
Respect the denial and operate in a degraded but safe mode, or
Seek explicit override approval and re-run with override.
Capture all decisions as evidence for governance and FinOps.

2. Preconditions and safety checks¶

Before taking action:

Confirm this is a Cost Decision Service DENY, not a technical failure
Check the GitHub Actions workflow logs for the Cost Decision Service step.
Confirm the service responded successfully with a decision: "DENY" (or equivalent), not an HTTP/network error.
Confirm the nature of the underlying technical incident
Is this a DR event (for example, on-prem cluster impaired) or a burst request (for example, scale out for load)?
Check relevant platform runbooks:
- DR cutover and failback runbooks.
- Jenkins controller outage runbook.
- db-01 failover runbook (if a database issue is involved).
Locate and create evidence folders

Establish an event-specific folder:

mkdir -p output/artifacts/cost/cost-guardrail-<date>/
mkdir -p output/artifacts/dr/cost-guardrail-<date>/

Replace <date> with a timestamp (for example, 2025-12-02T210000Z).
Check if this is a drill or a real incident
Inspect the workflow inputs and Cost Decision Service payload:
- mode: "dr-drill" or "production" (or similar).
This affects:
- Communication style.
- Whether override is even considered.
Confirm current business priority
For a portfolio/demo environment:
- Availability vs cost tolerance may be different than for a paid environment.
For a hypothetical real environment:
- Clarify whether contractual SLOs or critical obligations would justify override.

Record these initial observations in the incident ticket and in a text file under output/artifacts/cost/cost-guardrail-<date>/context.txt.

3. Phase 1 – Inspect the Cost Decision Service response¶

Goal: Understand why the decision is DENY.

Extract the Cost Decision Service payload

From the GitHub Actions logs, copy the JSON response into a file:

# Example: captured from workflow logs or artifact
cat > output/artifacts/cost/cost-guardrail-<date>/cost-decision.json <<'EOF'
{...}
EOF

Identify key fields

Look for at least:

decision – must be DENY for this runbook.
reason or rationale – textual reason (for example, "monthly budget exceeded").
estimated_cost – projected cost of the requested action.
budget_remaining – remaining budget for the relevant period.
policy_id – which policy triggered the denial.
Summarise the decision
In a short file (summary.txt) under the same folder, summarise:
- Why the decision is DENY.
- Which policy and thresholds were involved.
- Whether this is a per-environment policy (for example, env: lab vs env: production).

This summary becomes part of the governance evidence.

4. Phase 2 – Decide between degraded mode and override¶

Goal: Make a deliberate, documented decision on whether to respect the DENY or seek override.

Assess technical impact
For DR:
- How impaired is the on-prem environment?
- Are there alternate paths (for example, partial service, read-only mode, minimal capacity)?
For burst:
- Is the current capacity saturated?
- Will not bursting cause a clear user impact?
Assess financial and governance impact
Compare:
- Estimated DR/burst cost vs budget remaining.
- Nature of current environment (portfolio, lab, production).
Consider whether approving override:
- Is justified by critical user or business impact.
- Would create unacceptable financial risk.
Default stance
For drills and lab/portfolio environments:
- Default is to respect DENY and treat it as a successful demonstration of guardrails.
For hypothetical production:
- Default is to respect DENY unless:
- A clear, documented business owner explicitly approves override.
Decision options
Option A – Respect DENY and operate in degraded mode
Option B – Request and document override, then re-run
Option C – Postpone DR/burst and schedule a later window (for example, after budget reset or policy change).

Document the chosen option and rationale in:

Incident ticket, and
output/artifacts/cost/cost-guardrail-<date>/decision.txt

5. Phase 3 – Path A: Respect DENY and operate in degraded mode¶

If you choose to respect the Cost Decision Service decision:

Select a degraded operating mode

Examples:

Keep a minimal on-prem footprint running (essential services only).
Serve some workloads in read-only mode (for example, NetBox read-only).
Temporarily accept higher latency or lower throughput.
Apply technical safeguards
Ensure automation is not continuously retrying DR/burst attempts:
- Disable or pause the triggering workflow temporarily.
- Put DR/burst jobs in a safe status (for example, disabled in Jenkins or GitHub Actions).
Communicate status
If this is a multi-user environment, clearly communicate:
- That DR/burst was intentionally blocked on cost grounds.
- What degraded behaviour users should expect.
Record operational state
Briefly describe the degraded mode in:
- output/artifacts/dr/cost-guardrail-<date>/degraded-mode.txt
Plan follow-up
Decide whether you will:
- Adjust budgets or policies for the future.
- Improve capacity planning to reduce need for emergency burst.

6. Phase 4 – Path B: Request override and re-run¶

If you choose to request override (primarily a conceptual path in this portfolio):

Document override request
Capture:
- Who is requesting override.
- Who would approve it (for example, product or business owner).
- Why override is justified vs cost risk.
Save this as:
- output/artifacts/cost/cost-guardrail-<date>/override-request.txt
Record hypothetical approval
For a real environment:
- Capture actual approval (for example, written confirmation).
For a portfolio demo:
- Explain that override is being simulated for demonstration and that no real funds are at risk.
Re-run workflow with override flag
Re-trigger the DR/burst workflow with a flag (for example, override: true or a dedicated event type).
Ensure the Cost Decision Service:
- Records that this is an override.
- Returns ALLOW with an override field set.
Monitor DR/burst actions
If this is a non-destructive lab scenario, allow the workflow to:
- Provision DR or burst resources.
- Validate workloads.
- Tear down resources after validation.
Capture override evidence
Save updated Cost Decision Service responses to:
- output/artifacts/cost/cost-guardrail-<date>/cost-decision-override.json
Capture any additional DR/burst artefacts under:
- output/artifacts/dr/cost-guardrail-<date>/

7. Phase 5 – Path C: Postpone and reschedule¶

If you decide to neither override nor stay in prolonged degraded mode:

Postpone action
Treat this as a decision to defer DR/burst until:
- Budget is refreshed, or
- Policies are re-tuned.
Document the deferral
In decision.txt and the incident ticket, state:
- That action was deferred due to cost.
- Any temporary mitigations applied.
Update cost or DR plans
Consider changes to:
- Budgeting for DR/burst.
- Environment sizing to avoid repeated near-misses.

8. Phase 6 – Evidence and close-out¶

Before closing the runbook:

Ensure the following folders contain artefacts for this event:
output/artifacts/cost/cost-guardrail-<date>/
output/artifacts/dr/cost-guardrail-<date>/
Check that you have:
Original Cost Decision Service response (DENY).
Summary of the decision and rationale.
Chosen path (A/B/C) and justification.
Any override notes (if applicable).
Description of degraded mode or follow-up actions.
Update Evidence 4 (if this was a deliberate drill) to:
Mention the event as a proof point that cost guardrails are not just theoretical.
Close the incident or drill with:
Clear final state.
Lessons learned and any backlog items.

9. Validation checklist¶

[ ] Confirmed that the Cost Decision Service returned a genuine DENY (not a technical error).
[ ] Underlying technical incident (DR or burst need) was understood and recorded.
[ ] Cost decision payload and rationale stored under output/artifacts/cost/.
[ ] A deliberate choice was made between degraded mode, override, or postponement.
[ ] If degraded mode was chosen, technical safeguards and communication steps were applied.
[ ] If override was chosen, approvals and re-run behaviour were recorded.
[ ] DR/burst workflows were left in a safe state, with no uncontrolled retries.
[ ] Evidence was captured and, if appropriate, referenced from Evidence 4.

References¶

Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation