Dual ISP Load Balancing for Resiliency¶

Status¶

Accepted — the platform adopts a dual-ISP configuration with policy-based routing and automatic failover for sustained uptime and route diversity.

Context¶

HybridOps.Studio’s on-prem infrastructure depends on external internet reachability for:

Remote Git repository access and CI/CD workflows
DR connectivity to public cloud (Azure, GCP)
External observability federation and notifications

A single ISP introduces risk: outages, maintenance windows, or degraded routes can stall automation and replication.

Dual ISPs with health-checked routing policies significantly improve resiliency and provide a more realistic enterprise-style WAN scenario for the HybridOps.Studio blueprint.

Decision¶

Implement a dual-WAN design at the edge using the network stack defined in:

Key elements:

Primary ISP: wan_a (tier 1, higher bandwidth).
Secondary ISP: wan_b (tier 2, lower-cost / backup path).
Gateway / path selection:
Health checks per ISP (ICMP and/or HTTP probes).
Policy-based routing and/or gateway groups to steer traffic.
Hybrid cloud integration:
Dual IPsec/BGP paths to cloud via both ISPs.
Preferred path via wan_a with automatic failover to wan_b.
Observability:
Prometheus monitors RTT, packet loss, and flap events on each uplink.

Failover Behaviour (High-Level)¶

Health probe detects packet loss or latency above a defined threshold on wan_a.
Edge routing stack:
Switches default route to wan_b.
Moves IPsec/BGP sessions to use wan_b as source.
Alerts are raised via Prometheus / Alertmanager.
When wan_a is stable for a sustained window (for example ≥ 3 minutes), traffic is failed back in a controlled way.

Exact thresholds and timers are defined in the associated runbook and configuration.

Consequences¶

Positive¶

Provides realistic enterprise-grade resiliency for WAN connectivity.
Protects CI/CD, DR replication, and monitoring from single-ISP failures.
Demonstrates clear separation between core routing (Proxmox, ADR-0102) and edge/WAN routing (VyOS/CSR stack).
Enables controlled testing of failure scenarios (pulling one ISP, simulating brownouts).

Negative¶

Increases complexity at the edge (more routes, more health checks, more moving parts).
Requires careful NAT and port-forwarding design for inbound services to behave correctly across ISPs.
More involved troubleshooting when issues arise (must distinguish ISP failure vs. local misconfiguration).

Neutral¶

Additional cost for second ISP link, but acceptable for the blueprint’s learning and showcase value.
Implementation can start with lab-only simulation (EVE-NG) and later be extended to physical links.

References¶

ADR-0102 – Proxmox as Intra-Site Core Router
ADR-0107 – VyOS as Cost-Effective Edge Router
ADR-0108 – Full Mesh Topology for High Availability
ADR-0201 – EVE-NG Network Lab Architecture
Runbook: Dual ISP Load Balancing
Diagram: Dual ISP Architecture
Run artefacts & logs: gateway failover logs
Health-check logs
BGP / IPsec session logs
Ping / traceroute before/after failover

Maintainer: HybridOps.Studio License: MIT-0 for code, CC-BY-4.0 for documentation unless otherwise stated.