This blog post has been making the rounds. Since it is about a
sequence of DNS operational failures, it seems somewhat relevant here.

https://slack.engineering/what-happened-during-slacks-dnssec-rollout/

tl;dr first try was rolled back due to what turned out to be an unrelated 
failure at some ISP

second try was rolled back when they found they had a CNAME at a zone
apex, which they had never noticed until it caused DNSSEC validation
errors.

third try was rolled back when they found random-looking failures that
they eventually tracked down to bugs in Amazon's Route 53 DNS server.
They had a wildcard with A but not AAAA records. When someone did an
AAAA query, the response was wrong and said there were no records at
all, not just no AAAA records. This caused failures at 8.8.8.8 clients
since Google does aggressive NSEC, not at 1.1.1.1 because Cloudflare
doesn't.

They also got some bad advice, e.g., yes the .COM zone adds and
deletes records very quickly, but that doesn't mean you can unpublish
a DS and just turn off DNSSEC because its TTL is a day. Their tooling
somehow didn't let them republish the DNSKEY at the zone apex that
matched the DS, only a new one that didn't.

It is clear from the blog post that this is a fairly sophisticated
group of ops people, who had a reasonable test plan, a bunch of test
points set up in dnsviz and so forth.  Neither of these bugs seem
very exotic, and could have been caught by routine tests.

Can or should we offer advice on how to do this better, sort of like
RFC 8901 but one level of DNS expertise down?

R's,
John

_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to