Cloud & DevOps

DNS Configuration Is Still Breaking Production

Apple's latest macOS release broke custom DNS settings for thousands of developers. Specifically, it started intercepting queries for the .internal TLD — which many companies use for private network resources — and routing them to mDNS (multicast DNS) instead of the configured DNS resolver. Services that had worked fine for years suddenly couldn't be reached. The fix wasn't obvious. The error messages were unhelpful. And as usual with DNS problems, the symptoms looked like everything except DNS.

This is a story about DNS configuration — a topic that every developer encounters, few understand deeply, and everyone underestimates until it breaks something in production at 2 AM.

The .internal TLD Incident

Many organizations use .internal as a private TLD for internal services: api.internal, db.staging.internal, grafana.internal. It seemed like a safe choice — it's not a public TLD, it clearly signals 'this is internal,' and it's been used this way for decades.

Then Apple's new OS decided that .internal should be handled by mDNS — the same protocol that resolves myprinter.local on your home network. When your Mac tries to resolve api.internal, instead of asking your configured DNS server, it broadcasts an mDNS query on the local network. Your DNS server never sees the query. The response is either nothing (timeout) or the wrong address (if something on the local network happens to respond to that name).

The symptoms: curl api.internal hangs for 5 seconds then fails. SSH connections to internal hosts time out. VPN users can't reach internal resources. Docker containers can't resolve service names. And because the failure manifests as a generic connection timeout, most developers start debugging their VPN, their firewall, their application — everything except DNS.

How DNS Resolution Actually Works on Modern Systems

The 'DNS is simple' mental model — your computer asks a DNS server for an IP, the server responds — hasn't been accurate for years. Modern systems have a resolution pipeline with multiple stages, each of which can intercept, modify, or redirect queries.

DNS resolution pipeline on a typical macOS/Linux system:
Application calls getaddrinfo("api.internal")
↓
1. /etc/hosts — checked first. Static overrides.
↓
2. NSSwitch / resolver configuration
- Linux: /etc/nsswitch.conf controls lookup order
- macOS: /etc/resolver/* for per-domain DNS servers
↓
3. System resolver (systemd-resolved, mDNSResponder)
- May intercept .local, .internal, or other special TLDs
- May apply DNSSEC validation
- May cache responses
↓
4. DNS server (from DHCP, VPN, or manual config)
- Queries sent to configured upstream resolver
- May go through a corporate DNS proxy
↓
5. Recursive resolver (8.8.8.8, 1.1.1.1, corporate)
- Walks the DNS hierarchy: root → TLD → authoritative
↓
Response flows back up the chain

At each stage, something can go wrong. The system resolver might intercept a query before it reaches your configured DNS server. The VPN might override DNS settings. DHCP might push DNS servers that take precedence over your manual configuration. And caching at every level means that a fix might not take effect immediately.

The /etc/resolv.conf Lie

On Linux, /etc/resolv.conf was historically the single source of truth for DNS configuration. Set your nameserver there, and the system uses it. Simple.

Modern Linux distributions have made this dramatically more complicated. If you're running systemd-resolved (which Ubuntu, Fedora, and most systemd-based distros do by default), /etc/resolv.conf is a symlink to a stub resolver at 127.0.0.53. Your actual DNS configuration lives in systemd-resolved's state, managed through resolvectl. Editing /etc/resolv.conf directly either gets overwritten on the next network change or breaks the stub resolver.

# What you THINK your DNS config is:
$ cat /etc/resolv.conf
nameserver 127.0.0.53  # This is systemd-resolved's stub
# What your DNS config ACTUALLY is:
$ resolvectl status
Global:
DNS Servers: 8.8.8.8 8.8.4.4
DNS Domain: ~.
Link 2 (eth0):
DNS Servers: 10.0.0.1  # From DHCP — this takes precedence
DNS Domain: corp.internal
Link 5 (wg0):  # WireGuard VPN
DNS Servers: 10.100.0.1
DNS Domain: ~internal  # VPN claims the .internal domain
# Three different DNS servers, each handling different domains.
# /etc/resolv.conf shows none of this.

Docker adds another layer. Docker containers get their own /etc/resolv.conf, typically copied from the host's configuration when the container starts. If the host's DNS changes (VPN connects, network switches), running containers keep the old DNS configuration until they're restarted. This is a common source of 'it works on restart but fails after a while' bugs.

Split DNS and VPN Conflicts

Split DNS — using different DNS servers for different domains — is standard in corporate environments. Your VPN routes *.corp.internal queries to the corporate DNS server while public queries go to your normal resolver. This works well when it's configured correctly and fails in confusing ways when it's not.

Common failure modes:

  • VPN pushes DNS servers as global defaults. Some VPN clients configure the VPN's DNS server as the default resolver for all queries, not just internal ones. All your DNS queries — including personal browsing — now go through the corporate DNS server. This is slow (because the corporate DNS is in another datacenter) and a privacy concern.
  • DNS server ordering conflicts. When multiple DNS servers are configured, the system tries them in order. If the first server is slow or unreachable (VPN disconnected but DNS config persists), every DNS query waits for a timeout before trying the second server. This adds 5-second delays to every connection.
  • Search domains append silently. A search domain of corp.internal means that a query for api also tries api.corp.internal. This is useful (type ssh api instead of ssh api.corp.internal) until it's not — when a short hostname collides with an internal service name, queries resolve to the wrong IP.
  • mDNS intercepts internal domains. As the .internal incident shows, the system resolver may intercept queries for certain TLDs before they reach any configured DNS server. This is particularly insidious because the interception happens silently — no error message, just a timeout or wrong response.

Debugging DNS: The Toolkit

When DNS breaks, you need tools that show you exactly what's happening at each stage of the resolution pipeline.

# 1. Check what your system actually resolves:
$ dig api.internal
# Shows: query sent to which server, response received
# The 'SERVER' line tells you which DNS server answered
# 2. Query a specific DNS server directly:
$ dig @10.100.0.1 api.internal
# Bypasses the system resolver — goes directly to the specified server
# If this works but 'dig api.internal' doesn't, the problem is
# in the system resolver, not the DNS server
# 3. Check systemd-resolved state (Linux):
$ resolvectl query api.internal
# Shows which interface/server handled the query
# 4. Check macOS DNS routing:
$ scutil --dns
# Shows per-domain DNS configuration on macOS
# Look for your .internal domain — is it being routed correctly?
# 5. Monitor DNS queries in real time:
$ sudo tcpdump -i any port 53
# See actual DNS packets. If no packets go out for your query,
# the system resolver is intercepting it locally.
# 6. Flush DNS cache (when fixing config):
$ sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder  # macOS
$ resolvectl flush-caches  # Linux with systemd-resolved

The most important diagnostic step: query your DNS server directly with dig @server hostname. If that works but normal resolution doesn't, the problem is between your application and the DNS server — system resolver interception, caching, or routing. If direct queries also fail, the problem is the DNS server itself or network connectivity to it.

Choosing Your Private TLD

The .internal incident raises the practical question: what TLD should you use for internal services? There's no universally safe answer, but some options are better than others.

  • .internal — RFC 6762 reserves this for internal use, but Apple's mDNS handling makes it problematic on macOS. May work fine on Linux.
  • .local — Reserved for mDNS (RFC 6762). Using it for DNS-resolved services will conflict with Bonjour/Avahi. Don't use it.
  • .corp, .home, .mail — Not reserved. ICANN could assign them as public TLDs at any time, which would break your internal resolution. Don't use them.
  • Subdomain of a domain you own.internal.yourcompany.com — this is the safest option. You control the parent domain, so there's no risk of collision. The downside: it's longer to type.
  • .test, .example, .invalid, .localhost — Reserved by RFC 6761 and guaranteed never to be assigned as public TLDs. .test is the cleanest option for internal services that aren't production.

The boring-but-correct answer: use a subdomain of a domain you own. api.internal.yourcompany.com is unambiguous, won't collide with anything, and works correctly on every OS and DNS resolver. It's longer, but DNS problems at 2 AM are worse than typing a few extra characters.

Lessons for Production Systems

Every DNS outage teaches the same lessons, and we keep relearning them.

  1. Monitor DNS resolution, not just DNS servers. Your DNS server being healthy doesn't mean resolution works. Monitor the full path: can your application actually resolve the hostnames it needs? A synthetic check that resolves critical hostnames every 30 seconds catches problems faster than server monitoring.
  2. Set short TTLs for internal DNS. When you need to change an internal DNS record, long TTLs mean cached stale data. 60-second TTLs for internal services give you fast failover with minimal resolver load.
  3. Test OS updates against your DNS setup. The .internal breakage only affected macOS users. Test your development environment after major OS updates — DNS resolution changes are rarely in the release notes.
  4. Document your DNS architecture. Which DNS servers handle which domains? Where are the split points? What happens when the VPN disconnects? This documentation is the first thing you'll need during an outage and the last thing anyone writes.
  5. Have a fallback. If DNS resolution fails for an internal service, can your application use a hardcoded IP? This isn't elegant, but a hardcoded IP that works is better than a DNS name that doesn't resolve.

DNS is one of those fundamental services that's invisible when it works and catastrophic when it doesn't. The .internal breakage is a reminder that DNS configuration is more fragile than it looks — a single OS update can change resolution behavior for millions of users. Build your systems with the assumption that DNS will break, and you'll be less surprised when it does.