Failback – Cloud Cluster to On-Prem RKE2¶

Purpose: Safely return primary workloads and traffic from the cloud DR/cluster back to the on-prem RKE2 cluster after an incident has been resolved or a DR drill has completed.
Owner: Platform / SRE team (HybridOps.Studio)
Trigger: On-prem RKE2 and supporting services are healthy again, and incident command approves failback.
Impact: Production control-plane and workloads move from cloud to on-prem; DNS / Front Door endpoints revert to on-prem.
Severity: P2 – high-impact change but usually performed during stabilised conditions.
Pre-reqs: On-prem environment restored and validated; DR cutover runbook has previously been executed.

1. Scenario overview¶

This runbook assumes:

You previously executed DR cutover from on-prem to cloud using the DR runbook:
Runbook – DR Cutover: On-Prem RKE2 to Cloud Cluster
On-prem RKE2, PostgreSQL LXC and related services have been repaired or rebuilt and are now healthy.
Cloud DR resources are still running and serving production traffic.

The goal is to:

Confirm on-prem is ready to resume as primary.
Re-sync or promote data back to on-prem if needed.
Swap traffic (DNS / Front Door) back to on-prem.
De-scale cloud DR resources within cost and risk tolerances.
Capture full evidence for the failback operation.

2. Preconditions and checks¶

Before starting failback:

Verify on-prem RKE2 health

From the control node, ensure:

export KUBECONFIG=~/.kube/rke2-hybridops-onprem.yaml
kubectl get nodes -o wide
kubectl get pods -A

All control-plane and worker nodes must be Ready.
Core system components (CNI, DNS, ingress, metrics) must be healthy.
Verify PostgreSQL LXC state
Confirm the primary PostgreSQL LXC on-prem (for example, db-01) is:
- Running on the correct Proxmox node.
- Passing basic health checks (for example, psql connectivity, replication status if used).
Ensure its data is up to date or that you have a plan to refresh from the cloud DB.
Confirm cloud cluster state

Validate the cloud cluster is currently serving production traffic and is healthy:

export KUBECONFIG=~/.kube/rke2-cloud-dr.yaml
kubectl get nodes -o wide
kubectl get pods -A

Stakeholder approval
Incident commander / product owner confirms:
- It is acceptable to schedule a failback window.
- Any user-facing impact is documented and communicated.
Evidence location
Decide a folder for this failback event under:
- output/artifacts/dr/
For example: output/artifacts/dr/failback-<date>-cloud-to-onprem/.

3. Phase 1 – Prepare on-prem environment for primary role¶

Goal: Ensure on-prem RKE2 and PostgreSQL are ready to become authoritative again.

Synchronise application configuration
Ensure GitOps manifests or Helm values for on-prem RKE2 are:
- Up to date with any changes made during DR.
- Reviewed to avoid drift between cloud and on-prem environments.
Database/data sync strategy
If the cloud DB is currently primary:
- Either perform a controlled replication back to on-prem, or
- Export/import data in a planned maintenance window aligned with application expectations.
Ensure any decisions here are documented and appropriate for the system’s data consistency model.
Dry-run application deployments
For key workloads (for example, NetBox, supporting platform services), run:
- A dry-run apply via GitOps/Helm where supported.
Check that there are no obvious configuration errors (missing secrets, bad endpoints).
Record pre-failback state

Capture basic evidence of on-prem readiness:

kubectl get nodes -o wide > output/artifacts/dr/failback-<date>-cloud-to-onprem/kubectl-nodes-onprem-before.txt
kubectl get pods -A > output/artifacts/dr/failback-<date>-cloud-to-onprem/kubectl-pods-all-onprem-before.txt

Replace <date> with your actual event stamp.

4. Phase 2 – Coordinate data and application cutover¶

Goal: Minimise data inconsistency during the switch from cloud to on-prem.

Quiesce writes if required
For workloads that require strong consistency:
- Coordinate a short freeze of write operations (for example, maintenance mode, brief downtime).
For read-heavy / eventually consistent workloads:
- Document the acceptable level of risk and behaviour.
Final data sync
Run the final sync or promotion of the on-prem database:
- Ensure on-prem PostgreSQL is now the primary.
- Confirm replication and/or application configuration reflects this.
Update application connections
Confirm RKE2 workloads on-prem point to the on-prem PostgreSQL instance.
In cloud DR cluster, ensure applications either:
- Are drained/disabled, or
- Clearly flagged as secondary/non-serving instances.
Sanity checks
Run application-level health checks on on-prem (for example, NetBox /health endpoint).
Verify that the on-prem environment can serve test traffic before DNS/Front Door changes.

Record these steps and outputs in the failback proof folder.

5. Phase 3 – Switch traffic back to on-prem¶

Goal: Move user-facing traffic from cloud cluster to on-prem RKE2.

Prepare DNS / Front Door changes
Identify current configuration:
- DNS records pointing to cloud, or
- Azure Front Door / load balancer using the cloud cluster as backend.
Update configuration
Update DNS records to point back to the on-prem ingress/entry point, or
Update Front Door / load balancer to:
- Restore on-prem backend as primary.
- Optionally keep cloud as warm standby with reduced weight.
Monitor propagation and behaviour
As DNS/Front Door changes propagate:
- Monitor application logs and metrics on on-prem:
- Traffic increasing.
- Error rates and latency within SLO.
- Monitor cloud cluster to ensure traffic is decreasing as expected.
Record cutover details
Note the exact time of failback.
Archive configuration changes (for example, before/after snippets of DNS or Front Door config) into the proof folder.

6. Phase 4 – De-scale or decommission DR resources¶

Goal: Reduce cloud DR footprint while staying within cost and resilience targets.

Run DR teardown/de-scale workflow
Use a dedicated GitHub Actions workflow (for example, dr-scale-down-cloud.yml) that:
- Scales down DR node pools.
- Optionally decommissions non-essential DR resources.
- Leaves minimal capacity for monitoring and future drills.
Cost check
Confirm that post-failback cost artefacts under:
- output/artifacts/cost/
reflect the updated, reduced DR spend.
Validate residual DR posture
Ensure any remaining cloud resources are:
- Clearly labelled as DR.
- Documented in DR inventory.
Confirm that next DR drill can still spin up capacity quickly from this baseline.

7. Phase 5 – Post-failback monitoring and close-out¶

Monitor on-prem platform
Continue to watch:
- RKE2 node and pod health.
- Application SLOs and error budgets.
- PostgreSQL LXC health and backups.
Confirm with stakeholders
Communicate that:
- Primary operations are now fully back on on-prem RKE2.
- DR/cloud resources have been de-scaled as per plan.
Update documentation
Update incident/drill records with:
- Start/end times for DR and failback.
- Any manual interventions or issues encountered.
File follow-up tasks for:
- Runbook improvements.
- Automation adjustments.
- Cost model tweaks.

8. Validation checklist¶

[ ] On-prem RKE2 cluster is healthy and running core platform workloads.
[ ] On-prem PostgreSQL LXC is primary and serving expected applications.
[ ] Cloud DR cluster is no longer the primary entry point for user traffic.
[ ] DNS / Front Door configuration now points to on-prem.
[ ] Cloud DR resources are scaled down to agreed baseline levels.
[ ] Evidence for failback is stored under output/artifacts/dr/ and, where relevant, output/artifacts/infra/rke2/ and output/artifacts/cost/.
[ ] Stakeholders have confirmed that failback is complete and stable.

References¶

Maintainer: HybridOps.Studio
License: MIT-0 for code, CC-BY-4.0 for documentation