top of page
Search

How we achieve multi-provider DNS redundency

How Hale Stack Built Highly Resilient DNS with Multi‑Provider Syncing


At Hale Stack, we believe that critical infrastructure, especially DNS, cannot live on a single point of failure. That’s why, over the past year, we’ve built out an automation platform that keeps DNS records in sync across up to three DNS providers, while listing all of those providers in the registrar’s NS (nameserver) configuration. This setup gives our clients both redundancy and failover protection.


Here’s how we did it, and why we thinks it makes a huge difference.


1. The Challenge: DNS Risk & Provider Dependence

  • Many organizations rely on a single DNS provider (e.g., Cloudflare, AWS Route 53, etc.). If that provider has an outage or degradation, DNS resolution can fail, potentially taking your entire website or service offline.

  • Manual syncing between DNS providers is error prone, time consuming, and doesn’t scale.

  • Registrar-level changes (changing name servers) are rarely automated, and making live changes can be risky.


We saw this risk as a major weakness, especially for clients whose uptime and reliability are non-negotiable.


2. Our Solution: Management API + Multi‑Provider Sync

Here’s a breakdown of how we built our automated, resilient DNS platform:

  • Central DNS Record Source of Truth: We maintain a central repository of DNS records (A, CNAME, MX, TXT, etc.) in our own versioned configuration store. This becomes the “master” of truth. This is another area we are considering future redundancy upgrades.


  • Automated Management API Integration: For each DNS provider we support (up to three per domain), we use their management APIs to programmatically read and write DNS records. Whenever a record changes in our central store, we push updates via the API to all configured DNS providers.


  • Bidirectional Reconciliation

    • On a regular schedule (or triggered by events), we fetch the current DNS state from each provider via API.

    • We compare each provider’s records to our central source.

    • If there are drift or discrepancies (for example, a record was added manually in one provider), we raise alerts.


  • Registrar NS Configuration: At the registrar level, we list all of the DNS providers’ name servers in the NS records. This means the domain is delegating to multiple DNS providers in parallel. If one or two providers are unreachable, DNS queries still succeed via the others.


  • Monitoring & Health Checks: We continuously monitor DNS resolution health: query times, TTLs, record consistency, propagation success, etc. If a provider becomes unhealthy (e.g., consistently slow, failing API, or serving stale data), we trigger alerts and remediation.


How Hale Stack’s DNS Architecture Saved the Day During the Cloudflare Outage


On November 18, 2025, Cloudflare experienced a significant global outage that impacted core parts of its network, including traffic routing, proxies, and service delivery. More info here: The Cloudflare Blog


Here’s what happened and how Hale Stack’s DNS design helped our clients stay online:


  1. Cloudflare’s Root Cause

    • According to Cloudflare's incident report, a permissions change in one of their ClickHouse database systems caused a “feature file” used for Bot Management to bloat in size.

    • That oversized file was propagated across their network, causing proxy components to fail and leading to elevated HTTP 5xx errors. More info:

    • The outage lasted several hours, with core services recovering only after rollback of the problematic configuration and restart of critical proxy services.


  2. Impact on Standard DNS Setups

    • Many customers rely solely on Cloudflare for their DNS. During the outage, some reported being unable to resolve their domains or experiencing high error rates. More info: Krebs on Security

    • Because DNS is often delegated exclusively to Cloudflare’s nameservers, if their infrastructure degrades, there is no fallback.


  3. Why did Hale Stack Clients Kept Serving DNS? Thanks to our multi‑provider DNS strategy:

    • Even though Cloudflare’s DNS/control plane was impacted, the other DNS providers in our clients’ configuration remained available. Because we had set up up tp 3 providers in the registrar’s NS records, recursive resolvers could still query the healthy providers.

    • Our continuous reconciliation ensured that the DNS records on the “backup” providers were up-to-date and fully mirrored with no stale or missing entries.

    • As a result, DNS resolution for our clients’ domains remained intact, and their services kept functioning without interruption, or with only minimal a very short-term performance impact. That is, somewhere less than 10 seconds.


  4. Rapid Response & Visibility


    • Our monitoring systems immediately flagged any anomalies in DNS resolution or increased query latency.

    • Our ops team was able to verify via API that Cloudflare’s provider was degraded.

    • Because of our automated sync, we didn’t need to manually reconfigure DNS or perform emergency registrar work; failover happened at the DNS resolver level naturally!


Key Takeaways & Lessons Learned

  • Multi-provider DNS matters: Relying on a single DNS provider introduces systemic risk. Having parallel DNS providers dramatically improves resilience.

  • APIs make automation possible: By using management APIs from DNS providers, you can programmatically keep records in sync, reducing risk and manual error.

  • Registrar-level redundancy is powerful: Listing all DNS providers’ nameservers at the registrar ensures queries can be answered even if one or more providers go down.

  • Observability is critical: Continuous monitoring and reconciliation are key. If a provider drifts, you need to catch it early and not during an outage.

  • Practice for failure: The Cloudflare outage wasn’t due to an external attack; it was a configuration and scaling mistake. Even trusted providers can fail in unexpected ways. Hale Stack’s architecture treated DNS as a first-class critical service with redundancy.


Final Word


At Hale Stack, our goal is to make infrastructure invisible — in the best way. Our clients shouldn’t have to think about DNS when they’re focused on their core business. By building a resilient, automated, multi‑provider DNS solution, we turned what is often a vulnerability into a strength.


The recent Cloudflare outage was a real-world proof point: our design worked. DNS stayed up, failover happened seamlessly, and our clients stayed online.

If you’re interested in how Hale Stack can help you architect resilient DNS (or broader managed infrastructure), feel free to reach out, we’d be happy to walk you through our approach.

 
 
 

Recent Posts

See All
AWS Outage and Hale Stack

How Hale Stack Delivered Zero Downtime During the AWS Outage When AWS experienced its latest major service outage, many companies faced downtime, data unavailability, and disrupted customer experience

 
 
 

Comments


Copyright 2025 Hale Stack LLC

All Rights Reserved​

  • Facebook - Black Circle

USA: 1000 Commerce Park Drive, Suite 305 Williamsport, PA 17701 -

+1 (415) 985-1322

 

UAE: Al Khaleej Al Tejari 1 St - Business Bay #25052 - Dubai - United Arab Emirates

bottom of page