DNS for Engineers: Recursion, Authority, Caching, and Why 'Propagation' Gets Blamed for Everything

DNS feels simple right up until the day it ruins your rollout.

A certificate check fails because one resolver still sees the old record. Traffic keeps flowing to the previous load balancer longer than expected. One region works, another does not. Someone says, "DNS has not propagated yet," and the room nods even though nobody has clearly explained what that means.

That is usually a sign that the team is using DNS every day without carrying the right mental model.

DNS is not a magical global phone book that updates everywhere at once. It is a distributed, cached naming system with distinct roles, different failure modes, and enough old edge cases to surprise even experienced engineers.

Once you understand the roles and the caches, a lot of DNS stops feeling mystical.

Start with the two roles people blur together

Two kinds of systems matter most in normal DNS lookups:

recursive resolvers
authoritative name servers

They do very different jobs.

Recursive resolver

This is the system your machine, container, or VPC resolver usually talks to first.

Its job is to go find the answer on your behalf, cache it, and return it.

It is the "please figure this out for me" side of DNS.

Examples include:

your ISP resolver
a public resolver such as Google Public DNS or Cloudflare
an enterprise or VPC resolver

Authoritative server

This is the system that serves the DNS data for a zone because it is responsible for that zone.

It is the "I am the source of truth for this name" side of DNS.

If your domain is managed by Route 53, Cloudflare DNS, NS1, or another DNS provider, those systems are acting as authoritative servers for your zone.

That distinction matters because one of the most common DNS misunderstandings is treating whichever resolver answered you as though it were the source of truth.

It usually is not. It is often serving you a cached answer.

What actually happens during a lookup

At a high level, a normal lookup looks like this:

an application asks the local system to resolve a name
the local system sends the query to a recursive resolver
if the recursive resolver already has a fresh cached answer, it returns it
if not, it walks the DNS hierarchy until it reaches the authoritative servers for the relevant zone
it caches the answer and returns it to the client

That "walks the hierarchy" part is where DNS gets its shape.

Very roughly, a resolver may learn:

which servers know about the root
which servers know about the top-level domain, such as .com
which authoritative servers know about example.com
what the actual record for api.example.com is

In practice, recursive resolvers cache heavily, so most lookups are not full walks from the top every time.

That is why caching is not a side feature in DNS. It is the reason DNS can scale.

Why caching changes everything

Engineers often talk about DNS records as if changing a record updates the world.

It does not.

Changing a record updates the authoritative source of truth. Recursive resolvers and clients may still hold older cached answers until their cached entries expire.

That is why DNS changes can look inconsistent for a while:

one resolver has the new answer
another still has the old one
a client OS may have its own cache
a browser may also have cached behavior layered on top

This is the real idea behind what people casually call "propagation."

Usually the record is not slowly floating across the internet. Usually the new answer exists at the authoritative source, but different caches are aging out at different times.

That is a much more useful model than treating propagation like weather.

TTL is a control knob, not a guarantee

Each record usually carries a TTL, time to live, that tells caches how long they may retain the answer.

The important word there is "may."

TTL strongly influences cache lifetime, but DNS behavior in the real world can still involve:

stub resolver caches
operating system caches
application-level caches
resolvers with policy behavior you do not control

So a low TTL helps changes move faster, but it does not mean every client will observe the new answer instantly at TTL expiration.

Still, TTL matters a lot. It is how you trade off:

faster reaction to changes and failovers
lower query load and better cache efficiency

Low TTLs increase agility but reduce cache efficiency. High TTLs reduce load but make stale answers linger longer during changes.

That is an operational tradeoff, not just a record field.

Negative caching matters too

Caching is not only for successful answers.

Resolvers can also cache negative answers, like "this name does not exist."

That matters because a bad rollout can teach resolvers that a name is absent, and they may continue believing that for a while even after you add the record.

Teams often forget negative caching until they hit exactly this problem and wonder why adding the record did not fix things immediately.

Why `NXDOMAIN` and `SERVFAIL` are not the same

This distinction saves a lot of debugging time.

NXDOMAIN usually means the name does not exist.

SERVFAIL usually means the resolver could not successfully answer, often because something failed upstream or the authoritative side was broken in some way.

Those are very different failure modes.

If you treat them the same, you can waste a lot of time debugging the wrong layer.

As a rough rule:

NXDOMAIN points you toward naming or record existence
SERVFAIL points you toward resolution or authoritative health problems

CNAMEs, aliases, and the shape of the answer

Not every DNS answer is a direct address record.

Sometimes the answer is: "this name is actually another name, go resolve that next."

That is what a CNAME does.

This matters because:

one lookup can turn into multiple lookups
extra indirection affects caching and debugging
some providers flatten or alias records at zone apexes in provider-specific ways

The practical lesson is simple: a name that looks straightforward in a dashboard may still resolve through several layers before an address is returned.

Why backend and platform engineers should care

DNS is not just a browser concern.

It shows up in:

load balancer cutovers
service discovery
SMTP delivery through MX records
certificate validation flows
failover design
CDN and edge routing
database and cache endpoint naming

If you change traffic destinations with DNS, you are making a change whose rollout behavior is controlled by caches you do not fully own.

That is why DNS requires different operational instincts than a normal config push.

Where the "propagation" myth causes trouble

The phrase "wait for propagation" is often used as a placeholder for several different realities:

recursive resolvers still have old cached data
clients are using different recursive resolvers
the old answer is still inside application or OS caches
the authoritative change was made in one place but not another
the domain is delegated incorrectly
negative caching is still in effect

Saying "propagation" can be fine as shorthand, but it becomes harmful when it replaces actual diagnosis.

A better question is:

Which layer still has the old view of reality?

That question usually gets you somewhere useful.

Split-horizon DNS is powerful and confusing

Sometimes the same name intentionally resolves differently depending on where the query comes from.

That can happen for internal versus external clients, for private service discovery, or for environment-specific routing.

This is often called split-horizon DNS.

It is useful, but it also creates debugging traps:

the name resolves one way on your laptop
another way in production
and a third way from inside a cluster or VPC

If you forget that DNS answers can depend on the query path, logs from different places can look contradictory even when the system is behaving as designed.

What usually breaks in production

The recurring DNS failures are more mundane than people expect.

Common ones:

wrong record value
wrong zone or delegation
stale caches after a change
low TTL assumed to mean instant cutover
negative caching after a missing record rollout
split-horizon confusion
resolvers returning different cached views during failover
relying on DNS round-robin as though it were strong load balancing

The pattern is familiar: the failure is often not "DNS is random." It is "the naming system is cached and distributed, and we forgot that during the change."

What to measure and verify

DNS can feel slippery because engineers often inspect only one viewpoint.

Useful checks include:

querying multiple recursive resolvers
querying authoritative servers directly
checking TTL values on returned answers
checking whether you are seeing cached or authoritative results
checking whether failures are NXDOMAIN, SERVFAIL, or timeout-driven
verifying delegation and NS records

For operational dashboards, helpful signals often include:

resolver error rates
authoritative query latency
record change frequency
cache hit rates if you run recursive infrastructure
regional differences during planned cutovers

The main point is not to ask "what does DNS say?" as though there were one answer. The main point is to ask who answered, from which cache, at what time.

The mental model worth keeping

If you only want the durable version, keep this:

Authoritative servers hold the source of truth for a zone.
Recursive resolvers go fetch that truth and cache it.
Clients usually talk to the recursive side, not the authoritative side.
Changes appear inconsistent because caches expire at different times.
Most so-called propagation issues are really caching and delegation issues wearing a vague name.

That is why DNS can feel both simple and slippery. The namespace may be centralized per zone, but the observed answers are distributed through layers of caches and query paths.

If you keep one sentence from this post, keep this one:

DNS is a cached distributed naming system, not a globally synchronized database.

DNS feels simple right up until the day it ruins your rollout.

That is usually a sign that the team is using DNS every day without carrying the right mental model.

Once you understand the roles and the caches, a lot of DNS stops feeling mystical.

Start with the two roles people blur together

Two kinds of systems matter most in normal DNS lookups:

recursive resolvers
authoritative name servers

They do very different jobs.

Recursive resolver

This is the system your machine, container, or VPC resolver usually talks to first.

Its job is to go find the answer on your behalf, cache it, and return it.

It is the "please figure this out for me" side of DNS.

Examples include:

your ISP resolver
a public resolver such as Google Public DNS or Cloudflare
an enterprise or VPC resolver

Authoritative server

This is the system that serves the DNS data for a zone because it is responsible for that zone.

It is the "I am the source of truth for this name" side of DNS.

If your domain is managed by Route 53, Cloudflare DNS, NS1, or another DNS provider, those systems are acting as authoritative servers for your zone.

That distinction matters because one of the most common DNS misunderstandings is treating whichever resolver answered you as though it were the source of truth.

It usually is not. It is often serving you a cached answer.

What actually happens during a lookup

At a high level, a normal lookup looks like this:

an application asks the local system to resolve a name
the local system sends the query to a recursive resolver
if the recursive resolver already has a fresh cached answer, it returns it
if not, it walks the DNS hierarchy until it reaches the authoritative servers for the relevant zone
it caches the answer and returns it to the client

That "walks the hierarchy" part is where DNS gets its shape.

Very roughly, a resolver may learn:

which servers know about the root
which servers know about the top-level domain, such as .com
which authoritative servers know about example.com
what the actual record for api.example.com is

In practice, recursive resolvers cache heavily, so most lookups are not full walks from the top every time.

That is why caching is not a side feature in DNS. It is the reason DNS can scale.

Why caching changes everything

Engineers often talk about DNS records as if changing a record updates the world.

It does not.

Changing a record updates the authoritative source of truth. Recursive resolvers and clients may still hold older cached answers until their cached entries expire.

That is why DNS changes can look inconsistent for a while:

one resolver has the new answer
another still has the old one
a client OS may have its own cache
a browser may also have cached behavior layered on top

This is the real idea behind what people casually call "propagation."

Usually the record is not slowly floating across the internet. Usually the new answer exists at the authoritative source, but different caches are aging out at different times.

That is a much more useful model than treating propagation like weather.

TTL is a control knob, not a guarantee

Each record usually carries a TTL, time to live, that tells caches how long they may retain the answer.

The important word there is "may."

TTL strongly influences cache lifetime, but DNS behavior in the real world can still involve:

stub resolver caches
operating system caches
application-level caches
resolvers with policy behavior you do not control

So a low TTL helps changes move faster, but it does not mean every client will observe the new answer instantly at TTL expiration.

Still, TTL matters a lot. It is how you trade off:

faster reaction to changes and failovers
lower query load and better cache efficiency

Low TTLs increase agility but reduce cache efficiency. High TTLs reduce load but make stale answers linger longer during changes.

That is an operational tradeoff, not just a record field.

Negative caching matters too

Caching is not only for successful answers.

Resolvers can also cache negative answers, like "this name does not exist."

That matters because a bad rollout can teach resolvers that a name is absent, and they may continue believing that for a while even after you add the record.

Teams often forget negative caching until they hit exactly this problem and wonder why adding the record did not fix things immediately.

Why `NXDOMAIN` and `SERVFAIL` are not the same

This distinction saves a lot of debugging time.

NXDOMAIN usually means the name does not exist.

SERVFAIL usually means the resolver could not successfully answer, often because something failed upstream or the authoritative side was broken in some way.

Those are very different failure modes.

If you treat them the same, you can waste a lot of time debugging the wrong layer.

As a rough rule:

NXDOMAIN points you toward naming or record existence
SERVFAIL points you toward resolution or authoritative health problems

CNAMEs, aliases, and the shape of the answer

Not every DNS answer is a direct address record.

Sometimes the answer is: "this name is actually another name, go resolve that next."

That is what a CNAME does.

This matters because:

one lookup can turn into multiple lookups
extra indirection affects caching and debugging
some providers flatten or alias records at zone apexes in provider-specific ways

The practical lesson is simple: a name that looks straightforward in a dashboard may still resolve through several layers before an address is returned.

Why backend and platform engineers should care

DNS is not just a browser concern.

It shows up in:

load balancer cutovers
service discovery
SMTP delivery through MX records
certificate validation flows
failover design
CDN and edge routing
database and cache endpoint naming

If you change traffic destinations with DNS, you are making a change whose rollout behavior is controlled by caches you do not fully own.

That is why DNS requires different operational instincts than a normal config push.

Where the "propagation" myth causes trouble

The phrase "wait for propagation" is often used as a placeholder for several different realities:

recursive resolvers still have old cached data
clients are using different recursive resolvers
the old answer is still inside application or OS caches
the authoritative change was made in one place but not another
the domain is delegated incorrectly
negative caching is still in effect

Saying "propagation" can be fine as shorthand, but it becomes harmful when it replaces actual diagnosis.

A better question is:

Which layer still has the old view of reality?

That question usually gets you somewhere useful.

Split-horizon DNS is powerful and confusing

Sometimes the same name intentionally resolves differently depending on where the query comes from.

That can happen for internal versus external clients, for private service discovery, or for environment-specific routing.

This is often called split-horizon DNS.

It is useful, but it also creates debugging traps:

the name resolves one way on your laptop
another way in production
and a third way from inside a cluster or VPC

If you forget that DNS answers can depend on the query path, logs from different places can look contradictory even when the system is behaving as designed.

What usually breaks in production

The recurring DNS failures are more mundane than people expect.

Common ones:

wrong record value
wrong zone or delegation
stale caches after a change
low TTL assumed to mean instant cutover
negative caching after a missing record rollout
split-horizon confusion
resolvers returning different cached views during failover
relying on DNS round-robin as though it were strong load balancing

The pattern is familiar: the failure is often not "DNS is random." It is "the naming system is cached and distributed, and we forgot that during the change."

What to measure and verify

DNS can feel slippery because engineers often inspect only one viewpoint.

Useful checks include:

querying multiple recursive resolvers
querying authoritative servers directly
checking TTL values on returned answers
checking whether you are seeing cached or authoritative results
checking whether failures are NXDOMAIN, SERVFAIL, or timeout-driven
verifying delegation and NS records

For operational dashboards, helpful signals often include:

resolver error rates
authoritative query latency
record change frequency
cache hit rates if you run recursive infrastructure
regional differences during planned cutovers

The main point is not to ask "what does DNS say?" as though there were one answer. The main point is to ask who answered, from which cache, at what time.

The mental model worth keeping

If you only want the durable version, keep this:

That is why DNS can feel both simple and slippery. The namespace may be centralized per zone, but the observed answers are distributed through layers of caches and query paths.

If you keep one sentence from this post, keep this one:

DNS is a cached distributed naming system, not a globally synchronized database.

DNS for Engineers: Recursion, Authority, Caching, and Why 'Propagation' Gets Blamed for Everything

Start with the two roles people blur together

Recursive resolver

Authoritative server

What actually happens during a lookup

Why caching changes everything

TTL is a control knob, not a guarantee

Negative caching matters too

Why `NXDOMAIN` and `SERVFAIL` are not the same

CNAMEs, aliases, and the shape of the answer

Why backend and platform engineers should care

Where the "propagation" myth causes trouble

Split-horizon DNS is powerful and confusing

What usually breaks in production

What to measure and verify

The mental model worth keeping

On this page

DNS for Engineers: Recursion, Authority, Caching, and Why 'Propagation' Gets Blamed for Everything

Start with the two roles people blur together

Recursive resolver

Authoritative server

What actually happens during a lookup

Why caching changes everything

TTL is a control knob, not a guarantee

Negative caching matters too

Why `NXDOMAIN` and `SERVFAIL` are not the same

CNAMEs, aliases, and the shape of the answer

Why backend and platform engineers should care

Where the "propagation" myth causes trouble

Split-horizon DNS is powerful and confusing

What usually breaks in production

What to measure and verify

The mental model worth keeping

On this page

Start with the two roles people blur together

Recursive resolver

Authoritative server

What actually happens during a lookup

Why caching changes everything

TTL is a control knob, not a guarantee

Negative caching matters too

Why NXDOMAIN and SERVFAIL are not the same

CNAMEs, aliases, and the shape of the answer

Why backend and platform engineers should care

Where the "propagation" myth causes trouble

Split-horizon DNS is powerful and confusing

What usually breaks in production

What to measure and verify

The mental model worth keeping

On this page

Start with the two roles people blur together

Recursive resolver

Authoritative server

What actually happens during a lookup

Why caching changes everything

TTL is a control knob, not a guarantee

Negative caching matters too

Why NXDOMAIN and SERVFAIL are not the same

CNAMEs, aliases, and the shape of the answer

Why backend and platform engineers should care

Where the "propagation" myth causes trouble

Split-horizon DNS is powerful and confusing

What usually breaks in production

What to measure and verify

The mental model worth keeping

On this page

Why `NXDOMAIN` and `SERVFAIL` are not the same

Why `NXDOMAIN` and `SERVFAIL` are not the same