This blog post has been making the rounds. Since it is about a sequence of DNS operational failures, it seems somewhat relevant here.
https://slack.engineering/what-happened-during-slacks-dnssec-rollout/ tl;dr first try was rolled back due to what turned out to be an unrelated failure at some ISP second try was rolled back when they found they had a CNAME at a zone apex, which they had never noticed until it caused DNSSEC validation errors. third try was rolled back when they found random-looking failures that they eventually tracked down to bugs in Amazon's Route 53 DNS server. They had a wildcard with A but not AAAA records. When someone did an AAAA query, the response was wrong and said there were no records at all, not just no AAAA records. This caused failures at 8.8.8.8 clients since Google does aggressive NSEC, not at 1.1.1.1 because Cloudflare doesn't. They also got some bad advice, e.g., yes the .COM zone adds and deletes records very quickly, but that doesn't mean you can unpublish a DS and just turn off DNSSEC because its TTL is a day. Their tooling somehow didn't let them republish the DNSKEY at the zone apex that matched the DS, only a new one that didn't. It is clear from the blog post that this is a fairly sophisticated group of ops people, who had a reasonable test plan, a bunch of test points set up in dnsviz and so forth. Neither of these bugs seem very exotic, and could have been caught by routine tests. Can or should we offer advice on how to do this better, sort of like RFC 8901 but one level of DNS expertise down? R's, John _______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop