DNS feels simple right up until the day it ruins your rollout.
A certificate check fails because one resolver still sees the old record. Traffic keeps flowing to the previous load balancer longer than expected. One region works, another does not. Someone says, "DNS has not propagated yet," and the room nods even though nobody has clearly explained what that means.
That is usually a sign that the team is using DNS every day without carrying the right mental model.
DNS is not a magical global phone book that updates everywhere at once. It is a distributed, cached naming system with distinct roles, different failure modes, and enough old edge cases to surprise even experienced engineers.
Once you understand the roles and the caches, a lot of DNS stops feeling mystical.
Start with the two roles people blur together
Two kinds of systems matter most in normal DNS lookups:
- recursive resolvers
- authoritative name servers
They do very different jobs.
Recursive resolver
This is the system your machine, container, or VPC resolver usually talks to first.
Its job is to go find the answer on your behalf, cache it, and return it.
It is the "please figure this out for me" side of DNS.
Examples include:
- your ISP resolver
- a public resolver such as Google Public DNS or Cloudflare
- an enterprise or VPC resolver
Authoritative server
This is the system that serves the DNS data for a zone because it is responsible for that zone.
It is the "I am the source of truth for this name" side of DNS.
If your domain is managed by Route 53, Cloudflare DNS, NS1, or another DNS provider, those systems are acting as authoritative servers for your zone.
That distinction matters because one of the most common DNS misunderstandings is treating whichever resolver answered you as though it were the source of truth.
It usually is not. It is often serving you a cached answer.
What actually happens during a lookup
At a high level, a normal lookup looks like this:
- an application asks the local system to resolve a name
- the local system sends the query to a recursive resolver
- if the recursive resolver already has a fresh cached answer, it returns it
- if not, it walks the DNS hierarchy until it reaches the authoritative servers for the relevant zone
- it caches the answer and returns it to the client
That "walks the hierarchy" part is where DNS gets its shape.
Very roughly, a resolver may learn:
- which servers know about the root
- which servers know about the top-level domain, such as
.com - which authoritative servers know about
example.com - what the actual record for
api.example.comis
In practice, recursive resolvers cache heavily, so most lookups are not full walks from the top every time.
That is why caching is not a side feature in DNS. It is the reason DNS can scale.
Why caching changes everything
Engineers often talk about DNS records as if changing a record updates the world.
It does not.
Changing a record updates the authoritative source of truth. Recursive resolvers and clients may still hold older cached answers until their cached entries expire.
That is why DNS changes can look inconsistent for a while:
- one resolver has the new answer
- another still has the old one
- a client OS may have its own cache
- a browser may also have cached behavior layered on top
This is the real idea behind what people casually call "propagation."
Usually the record is not slowly floating across the internet. Usually the new answer exists at the authoritative source, but different caches are aging out at different times.
That is a much more useful model than treating propagation like weather.
TTL is a control knob, not a guarantee
Each record usually carries a TTL, time to live, that tells caches how long they may retain the answer.
The important word there is "may."
TTL strongly influences cache lifetime, but DNS behavior in the real world can still involve:
- stub resolver caches
- operating system caches
- application-level caches
- resolvers with policy behavior you do not control
So a low TTL helps changes move faster, but it does not mean every client will observe the new answer instantly at TTL expiration.
Still, TTL matters a lot. It is how you trade off:
- faster reaction to changes and failovers
- lower query load and better cache efficiency
Low TTLs increase agility but reduce cache efficiency. High TTLs reduce load but make stale answers linger longer during changes.
That is an operational tradeoff, not just a record field.
Negative caching matters too
Caching is not only for successful answers.
Resolvers can also cache negative answers, like "this name does not exist."
That matters because a bad rollout can teach resolvers that a name is absent, and they may continue believing that for a while even after you add the record.
Teams often forget negative caching until they hit exactly this problem and wonder why adding the record did not fix things immediately.
Why NXDOMAIN and SERVFAIL are not the same
This distinction saves a lot of debugging time.
NXDOMAIN usually means the name does not exist.
SERVFAIL usually means the resolver could not successfully answer, often because something failed upstream or the authoritative side was broken in some way.
Those are very different failure modes.
If you treat them the same, you can waste a lot of time debugging the wrong layer.
As a rough rule:
NXDOMAINpoints you toward naming or record existenceSERVFAILpoints you toward resolution or authoritative health problems
CNAMEs, aliases, and the shape of the answer
Not every DNS answer is a direct address record.
Sometimes the answer is: "this name is actually another name, go resolve that next."
That is what a CNAME does.
This matters because:
- one lookup can turn into multiple lookups
- extra indirection affects caching and debugging
- some providers flatten or alias records at zone apexes in provider-specific ways
The practical lesson is simple: a name that looks straightforward in a dashboard may still resolve through several layers before an address is returned.
Why backend and platform engineers should care
DNS is not just a browser concern.
It shows up in:
- load balancer cutovers
- service discovery
- SMTP delivery through
MXrecords - certificate validation flows
- failover design
- CDN and edge routing
- database and cache endpoint naming
If you change traffic destinations with DNS, you are making a change whose rollout behavior is controlled by caches you do not fully own.
That is why DNS requires different operational instincts than a normal config push.
Where the "propagation" myth causes trouble
The phrase "wait for propagation" is often used as a placeholder for several different realities:
- recursive resolvers still have old cached data
- clients are using different recursive resolvers
- the old answer is still inside application or OS caches
- the authoritative change was made in one place but not another
- the domain is delegated incorrectly
- negative caching is still in effect
Saying "propagation" can be fine as shorthand, but it becomes harmful when it replaces actual diagnosis.
A better question is:
Which layer still has the old view of reality?
That question usually gets you somewhere useful.
Split-horizon DNS is powerful and confusing
Sometimes the same name intentionally resolves differently depending on where the query comes from.
That can happen for internal versus external clients, for private service discovery, or for environment-specific routing.
This is often called split-horizon DNS.
It is useful, but it also creates debugging traps:
- the name resolves one way on your laptop
- another way in production
- and a third way from inside a cluster or VPC
If you forget that DNS answers can depend on the query path, logs from different places can look contradictory even when the system is behaving as designed.
What usually breaks in production
The recurring DNS failures are more mundane than people expect.
Common ones:
- wrong record value
- wrong zone or delegation
- stale caches after a change
- low TTL assumed to mean instant cutover
- negative caching after a missing record rollout
- split-horizon confusion
- resolvers returning different cached views during failover
- relying on DNS round-robin as though it were strong load balancing
The pattern is familiar: the failure is often not "DNS is random." It is "the naming system is cached and distributed, and we forgot that during the change."
What to measure and verify
DNS can feel slippery because engineers often inspect only one viewpoint.
Useful checks include:
- querying multiple recursive resolvers
- querying authoritative servers directly
- checking TTL values on returned answers
- checking whether you are seeing cached or authoritative results
- checking whether failures are
NXDOMAIN,SERVFAIL, or timeout-driven - verifying delegation and
NSrecords
For operational dashboards, helpful signals often include:
- resolver error rates
- authoritative query latency
- record change frequency
- cache hit rates if you run recursive infrastructure
- regional differences during planned cutovers
The main point is not to ask "what does DNS say?" as though there were one answer. The main point is to ask who answered, from which cache, at what time.
The mental model worth keeping
If you only want the durable version, keep this:
Authoritative servers hold the source of truth for a zone.
Recursive resolvers go fetch that truth and cache it.
Clients usually talk to the recursive side, not the authoritative side.
Changes appear inconsistent because caches expire at different times.
Most so-called propagation issues are really caching and delegation issues wearing a vague name.
That is why DNS can feel both simple and slippery. The namespace may be centralized per zone, but the observed answers are distributed through layers of caches and query paths.
If you keep one sentence from this post, keep this one:
DNS is a cached distributed naming system, not a globally synchronized database.