Cost Guardrail Breach During DR/Burst (Decision: DENY)¶
Purpose: Provide a clear procedure for responding when the Cost Decision Service returns a DENY (or equivalent) decision for a DR or cloud burst action, including how to interpret the decision, choose between override vs degraded mode, and capture evidence.
Owner: Platform / SRE team (HybridOps.Studio)
Trigger: A DR or burst-related workflow (typically GitHub Actions) calls the Cost Decision Service and receives a DENY decision.
Impact: The requested DR/burst action is blocked on cost grounds. Service availability or performance may remain degraded until an alternative path is chosen.
Severity: P2 – high impact, but by design it is a governed decision, not uncontrolled failure.
This runbook aligns with:
- ADR-0701 – Use GitHub Actions as Stateless DR Orchestrator
- ADR-0801 – Treat Cost as a First-Class Signal for DR and Cloud Bursting
- HOWTO – Run a Cost-Aware DR Drill (Prometheus → GitHub Actions → DR Workflow)
- Evidence 4 – Delivery Platform, GitOps and Cluster Operations
Evidence for this runbook should be stored under:
1. Scenario overview¶
The platform uses a Cost Decision Service as part of the DR/burst control loop:
- A technical signal (for example,
jenkins_critical_down,platform_unavailable) is detected by Prometheus/Alertmanager. - Alertmanager triggers a GitHub Actions DR or burst workflow.
- The workflow calls the Cost Decision Service with context:
- Environment and target (on-prem, cloud-dr, extra burst capacity).
- Estimated cost for the action (for example, per hour/day).
- Budget guardrail configuration for the current period.
- The Cost Decision Service returns ALLOW, DENY, or SIMULATE_ONLY.
This runbook covers the path where the decision is:
DENY– the requested DR/burst action is not allowed within current cost guardrails.
The objective is to:
- Understand why it was denied.
- Decide whether to:
- Respect the denial and operate in a degraded but safe mode, or
- Seek explicit override approval and re-run with override.
- Capture all decisions as evidence for governance and FinOps.
2. Preconditions and safety checks¶
Before taking action:
-
Confirm this is a Cost Decision Service DENY, not a technical failure
-
Check the GitHub Actions workflow logs for the Cost Decision Service step.
-
Confirm the service responded successfully with a
decision: "DENY"(or equivalent), not an HTTP/network error. -
Confirm the nature of the underlying technical incident
-
Is this a DR event (for example, on-prem cluster impaired) or a burst request (for example, scale out for load)?
-
Check relevant platform runbooks:
- DR cutover and failback runbooks.
- Jenkins controller outage runbook.
- db-01 failover runbook (if a database issue is involved).
-
Locate and create evidence folders
-
Establish an event-specific folder:
mkdir -p output/artifacts/cost/cost-guardrail-<date>/ mkdir -p output/artifacts/dr/cost-guardrail-<date>/ -
Replace
<date>with a timestamp (for example,2025-12-02T210000Z). -
Check if this is a drill or a real incident
-
Inspect the workflow inputs and Cost Decision Service payload:
mode: "dr-drill"or"production"(or similar).
-
This affects:
- Communication style.
- Whether override is even considered.
-
Confirm current business priority
-
For a portfolio/demo environment:
- Availability vs cost tolerance may be different than for a paid environment.
- For a hypothetical real environment:
- Clarify whether contractual SLOs or critical obligations would justify override.
Record these initial observations in the incident ticket and in a text file under output/artifacts/cost/cost-guardrail-<date>/context.txt.
3. Phase 1 – Inspect the Cost Decision Service response¶
Goal: Understand why the decision is DENY.
-
Extract the Cost Decision Service payload
-
From the GitHub Actions logs, copy the JSON response into a file:
# Example: captured from workflow logs or artifact cat > output/artifacts/cost/cost-guardrail-<date>/cost-decision.json <<'EOF' {...} EOF -
Identify key fields
Look for at least:
decision– must beDENYfor this runbook.reasonorrationale– textual reason (for example, "monthly budget exceeded").estimated_cost– projected cost of the requested action.budget_remaining– remaining budget for the relevant period.-
policy_id– which policy triggered the denial. -
Summarise the decision
-
In a short file (
summary.txt) under the same folder, summarise:- Why the decision is DENY.
- Which policy and thresholds were involved.
- Whether this is a per-environment policy (for example,
env: labvsenv: production).
This summary becomes part of the governance evidence.
4. Phase 2 – Decide between degraded mode and override¶
Goal: Make a deliberate, documented decision on whether to respect the DENY or seek override.
-
Assess technical impact
-
For DR:
- How impaired is the on-prem environment?
- Are there alternate paths (for example, partial service, read-only mode, minimal capacity)?
-
For burst:
- Is the current capacity saturated?
- Will not bursting cause a clear user impact?
-
Assess financial and governance impact
-
Compare:
- Estimated DR/burst cost vs budget remaining.
- Nature of current environment (portfolio, lab, production).
-
Consider whether approving override:
- Is justified by critical user or business impact.
- Would create unacceptable financial risk.
-
Default stance
-
For drills and lab/portfolio environments:
- Default is to respect DENY and treat it as a successful demonstration of guardrails.
-
For hypothetical production:
- Default is to respect DENY unless:
- A clear, documented business owner explicitly approves override.
-
Decision options
-
Option A – Respect DENY and operate in degraded mode
- Option B – Request and document override, then re-run
- Option C – Postpone DR/burst and schedule a later window (for example, after budget reset or policy change).
Document the chosen option and rationale in:
- Incident ticket, and
output/artifacts/cost/cost-guardrail-<date>/decision.txt
5. Phase 3 – Path A: Respect DENY and operate in degraded mode¶
If you choose to respect the Cost Decision Service decision:
- Select a degraded operating mode
Examples:
- Keep a minimal on-prem footprint running (essential services only).
- Serve some workloads in read-only mode (for example, NetBox read-only).
-
Temporarily accept higher latency or lower throughput.
-
Apply technical safeguards
-
Ensure automation is not continuously retrying DR/burst attempts:
- Disable or pause the triggering workflow temporarily.
- Put DR/burst jobs in a safe status (for example, disabled in Jenkins or GitHub Actions).
-
Communicate status
-
If this is a multi-user environment, clearly communicate:
- That DR/burst was intentionally blocked on cost grounds.
- What degraded behaviour users should expect.
-
Record operational state
-
Briefly describe the degraded mode in:
output/artifacts/dr/cost-guardrail-<date>/degraded-mode.txt
-
Plan follow-up
-
Decide whether you will:
- Adjust budgets or policies for the future.
- Improve capacity planning to reduce need for emergency burst.
6. Phase 4 – Path B: Request override and re-run¶
If you choose to request override (primarily a conceptual path in this portfolio):
-
Document override request
-
Capture:
- Who is requesting override.
- Who would approve it (for example, product or business owner).
- Why override is justified vs cost risk.
-
Save this as:
output/artifacts/cost/cost-guardrail-<date>/override-request.txt
-
Record hypothetical approval
-
For a real environment:
- Capture actual approval (for example, written confirmation).
-
For a portfolio demo:
- Explain that override is being simulated for demonstration and that no real funds are at risk.
-
Re-run workflow with override flag
-
Re-trigger the DR/burst workflow with a flag (for example,
override: trueor a dedicated event type). -
Ensure the Cost Decision Service:
- Records that this is an override.
- Returns
ALLOWwith anoverridefield set.
-
Monitor DR/burst actions
-
If this is a non-destructive lab scenario, allow the workflow to:
- Provision DR or burst resources.
- Validate workloads.
- Tear down resources after validation.
-
Capture override evidence
-
Save updated Cost Decision Service responses to:
output/artifacts/cost/cost-guardrail-<date>/cost-decision-override.json
- Capture any additional DR/burst artefacts under:
output/artifacts/dr/cost-guardrail-<date>/
7. Phase 5 – Path C: Postpone and reschedule¶
If you decide to neither override nor stay in prolonged degraded mode:
-
Postpone action
-
Treat this as a decision to defer DR/burst until:
- Budget is refreshed, or
- Policies are re-tuned.
-
Document the deferral
-
In
decision.txtand the incident ticket, state:- That action was deferred due to cost.
- Any temporary mitigations applied.
-
Update cost or DR plans
-
Consider changes to:
- Budgeting for DR/burst.
- Environment sizing to avoid repeated near-misses.
8. Phase 6 – Evidence and close-out¶
Before closing the runbook:
-
Ensure the following folders contain artefacts for this event:
-
output/artifacts/cost/cost-guardrail-<date>/ -
output/artifacts/dr/cost-guardrail-<date>/ -
Check that you have:
-
Original Cost Decision Service response (
DENY). - Summary of the decision and rationale.
- Chosen path (A/B/C) and justification.
- Any override notes (if applicable).
-
Description of degraded mode or follow-up actions.
-
Update Evidence 4 (if this was a deliberate drill) to:
-
Mention the event as a proof point that cost guardrails are not just theoretical.
-
Close the incident or drill with:
-
Clear final state.
- Lessons learned and any backlog items.
9. Validation checklist¶
- [ ] Confirmed that the Cost Decision Service returned a genuine
DENY(not a technical error). - [ ] Underlying technical incident (DR or burst need) was understood and recorded.
- [ ] Cost decision payload and rationale stored under
output/artifacts/cost/. - [ ] A deliberate choice was made between degraded mode, override, or postponement.
- [ ] If degraded mode was chosen, technical safeguards and communication steps were applied.
- [ ] If override was chosen, approvals and re-run behaviour were recorded.
- [ ] DR/burst workflows were left in a safe state, with no uncontrolled retries.
- [ ] Evidence was captured and, if appropriate, referenced from Evidence 4.
References¶
- ADR-0701 – Use GitHub Actions as Stateless DR Orchestrator
- ADR-0801 – Treat Cost as a First-Class Signal for DR and Cloud Bursting
- HOWTO – Run a Cost-Aware DR Drill (Prometheus → GitHub Actions → DR Workflow)
- Evidence 4 – Delivery Platform, GitOps and Cluster Operations
output/artifacts/cost/output/artifacts/dr/
Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation