Skip to content

Failback – Cloud Cluster to On-Prem RKE2

Purpose: Safely return primary workloads and traffic from the cloud DR/cluster back to the on-prem RKE2 cluster after an incident has been resolved or a DR drill has completed.
Owner: Platform / SRE team (HybridOps.Studio)
Trigger: On-prem RKE2 and supporting services are healthy again, and incident command approves failback.
Impact: Production control-plane and workloads move from cloud to on-prem; DNS / Front Door endpoints revert to on-prem.
Severity: P2 – high-impact change but usually performed during stabilised conditions.
Pre-reqs: On-prem environment restored and validated; DR cutover runbook has previously been executed.


1. Scenario overview

This runbook assumes:

  • You previously executed DR cutover from on-prem to cloud using the DR runbook:
  • Runbook – DR Cutover: On-Prem RKE2 to Cloud Cluster
  • On-prem RKE2, PostgreSQL LXC and related services have been repaired or rebuilt and are now healthy.
  • Cloud DR resources are still running and serving production traffic.

The goal is to:

  1. Confirm on-prem is ready to resume as primary.
  2. Re-sync or promote data back to on-prem if needed.
  3. Swap traffic (DNS / Front Door) back to on-prem.
  4. De-scale cloud DR resources within cost and risk tolerances.
  5. Capture full evidence for the failback operation.

2. Preconditions and checks

Before starting failback:

  1. Verify on-prem RKE2 health

  2. From the control node, ensure:

    export KUBECONFIG=~/.kube/rke2-hybridops-onprem.yaml
    kubectl get nodes -o wide
    kubectl get pods -A
    
  3. All control-plane and worker nodes must be Ready.

  4. Core system components (CNI, DNS, ingress, metrics) must be healthy.

  5. Verify PostgreSQL LXC state

  6. Confirm the primary PostgreSQL LXC on-prem (for example, db-01) is:

    • Running on the correct Proxmox node.
    • Passing basic health checks (for example, psql connectivity, replication status if used).
  7. Ensure its data is up to date or that you have a plan to refresh from the cloud DB.

  8. Confirm cloud cluster state

  9. Validate the cloud cluster is currently serving production traffic and is healthy:

    export KUBECONFIG=~/.kube/rke2-cloud-dr.yaml
    kubectl get nodes -o wide
    kubectl get pods -A
    
  10. Stakeholder approval

  11. Incident commander / product owner confirms:

    • It is acceptable to schedule a failback window.
    • Any user-facing impact is documented and communicated.
  12. Evidence location

  13. Decide a folder for this failback event under:

  14. For example: output/artifacts/dr/failback-<date>-cloud-to-onprem/.


3. Phase 1 – Prepare on-prem environment for primary role

Goal: Ensure on-prem RKE2 and PostgreSQL are ready to become authoritative again.

  1. Synchronise application configuration

  2. Ensure GitOps manifests or Helm values for on-prem RKE2 are:

    • Up to date with any changes made during DR.
    • Reviewed to avoid drift between cloud and on-prem environments.
  3. Database/data sync strategy

  4. If the cloud DB is currently primary:

    • Either perform a controlled replication back to on-prem, or
    • Export/import data in a planned maintenance window aligned with application expectations.
  5. Ensure any decisions here are documented and appropriate for the system’s data consistency model.

  6. Dry-run application deployments

  7. For key workloads (for example, NetBox, supporting platform services), run:

    • A dry-run apply via GitOps/Helm where supported.
  8. Check that there are no obvious configuration errors (missing secrets, bad endpoints).

  9. Record pre-failback state

  10. Capture basic evidence of on-prem readiness:

    kubectl get nodes -o wide > output/artifacts/dr/failback-<date>-cloud-to-onprem/kubectl-nodes-onprem-before.txt
    kubectl get pods -A > output/artifacts/dr/failback-<date>-cloud-to-onprem/kubectl-pods-all-onprem-before.txt
    
  11. Replace <date> with your actual event stamp.


4. Phase 2 – Coordinate data and application cutover

Goal: Minimise data inconsistency during the switch from cloud to on-prem.

  1. Quiesce writes if required

  2. For workloads that require strong consistency:

    • Coordinate a short freeze of write operations (for example, maintenance mode, brief downtime).
  3. For read-heavy / eventually consistent workloads:

    • Document the acceptable level of risk and behaviour.
  4. Final data sync

  5. Run the final sync or promotion of the on-prem database:

    • Ensure on-prem PostgreSQL is now the primary.
    • Confirm replication and/or application configuration reflects this.
  6. Update application connections

  7. Confirm RKE2 workloads on-prem point to the on-prem PostgreSQL instance.

  8. In cloud DR cluster, ensure applications either:

    • Are drained/disabled, or
    • Clearly flagged as secondary/non-serving instances.
  9. Sanity checks

  10. Run application-level health checks on on-prem (for example, NetBox /health endpoint).

  11. Verify that the on-prem environment can serve test traffic before DNS/Front Door changes.

Record these steps and outputs in the failback proof folder.


5. Phase 3 – Switch traffic back to on-prem

Goal: Move user-facing traffic from cloud cluster to on-prem RKE2.

  1. Prepare DNS / Front Door changes

  2. Identify current configuration:

    • DNS records pointing to cloud, or
    • Azure Front Door / load balancer using the cloud cluster as backend.
  3. Update configuration

  4. Update DNS records to point back to the on-prem ingress/entry point, or

  5. Update Front Door / load balancer to:

    • Restore on-prem backend as primary.
    • Optionally keep cloud as warm standby with reduced weight.
  6. Monitor propagation and behaviour

  7. As DNS/Front Door changes propagate:

    • Monitor application logs and metrics on on-prem:
    • Traffic increasing.
    • Error rates and latency within SLO.
    • Monitor cloud cluster to ensure traffic is decreasing as expected.
  8. Record cutover details

  9. Note the exact time of failback.

  10. Archive configuration changes (for example, before/after snippets of DNS or Front Door config) into the proof folder.

6. Phase 4 – De-scale or decommission DR resources

Goal: Reduce cloud DR footprint while staying within cost and resilience targets.

  1. Run DR teardown/de-scale workflow

  2. Use a dedicated GitHub Actions workflow (for example, dr-scale-down-cloud.yml) that:

    • Scales down DR node pools.
    • Optionally decommissions non-essential DR resources.
    • Leaves minimal capacity for monitoring and future drills.
  3. Cost check

  4. Confirm that post-failback cost artefacts under:

    reflect the updated, reduced DR spend.

  5. Validate residual DR posture

  6. Ensure any remaining cloud resources are:

    • Clearly labelled as DR.
    • Documented in DR inventory.
  7. Confirm that next DR drill can still spin up capacity quickly from this baseline.

7. Phase 5 – Post-failback monitoring and close-out

  1. Monitor on-prem platform

  2. Continue to watch:

    • RKE2 node and pod health.
    • Application SLOs and error budgets.
    • PostgreSQL LXC health and backups.
  3. Confirm with stakeholders

  4. Communicate that:

    • Primary operations are now fully back on on-prem RKE2.
    • DR/cloud resources have been de-scaled as per plan.
  5. Update documentation

  6. Update incident/drill records with:

    • Start/end times for DR and failback.
    • Any manual interventions or issues encountered.
  7. File follow-up tasks for:
    • Runbook improvements.
    • Automation adjustments.
    • Cost model tweaks.

8. Validation checklist

  • [ ] On-prem RKE2 cluster is healthy and running core platform workloads.
  • [ ] On-prem PostgreSQL LXC is primary and serving expected applications.
  • [ ] Cloud DR cluster is no longer the primary entry point for user traffic.
  • [ ] DNS / Front Door configuration now points to on-prem.
  • [ ] Cloud DR resources are scaled down to agreed baseline levels.
  • [ ] Evidence for failback is stored under output/artifacts/dr/ and, where relevant, output/artifacts/infra/rke2/ and output/artifacts/cost/.
  • [ ] Stakeholders have confirmed that failback is complete and stable.

References


Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation