Benjamin Kaduk has entered the following ballot position for draft-ietf-dnsop-serve-stale-09: No Objection
When responding, please keep the subject line intact and reply to all email addresses included in the To and CC lines. (Feel free to cut this introductory paragraph, however.) Please refer to https://www.ietf.org/iesg/statement/discuss-criteria.html for more information about IESG DISCUSS and COMMENT positions. The document, along with other ballot positions, can be found here: https://datatracker.ietf.org/doc/draft-ietf-dnsop-serve-stale/ ---------------------------------------------------------------------- COMMENT: ---------------------------------------------------------------------- Thanks for this document; it's some good comprehensive discussion of the issues related to this topic and will improve the stability of the internet. I have several minor coments and a few side notes that are expected to lead to at most my own elucidiation (but no textual changes). Section 2 For a comprehensive treatment of DNS terms, please see [RFC8499]. (side note: I myself would not use the word "comprehensive" when it explicitly says that "some DNS-related terms are interpreted quite differently by different DNS experts", but I understand why it is used here.) Section 3 There are a number of reasons why an authoritative server may become unreachable, including Denial of Service (DoS) attacks, network issues, and so on. If a recursive server is unable to contact the authoritative servers for a query but still has relevant data that side note: the way this is worded might make a reader wonder if the recursive is expected to attempt to contact all known authoritatives before declaring failure. Several recursive resolver operators, including Akamai, currently use stale data for answers in some way. A number of recursive resolver I did not follow the discussions that led to this wording, but one of my colleagues at Akamai suggested that "currently fall back to stale data for answers under some circumstances" might be a nicer wording, though I note that Adam has already proposed some text here as well, which is probably fine. Section 4 The definition of TTL in [RFC1035] Sections 3.2.1 and 4.1.3 is amended to read: TTL a 32-bit unsigned integer number of seconds that specifies the duration that the resource record MAY be cached before the source of the information MUST again be consulted. Zero values are interpreted to mean that the RR can only be used for the transaction in progress, and should not be cached. Values SHOULD be capped on the orders of days to weeks, with a recommended cap of 604,800 seconds (seven days). If the data is unable to be authoritatively refreshed when the TTL expires, the record MAY be used as though it is unexpired. See the Section 5 and Section 6 sections for details. I recommend using "[this document]" in the section references, since a reader reading the updated content in the context of RFC 1035 might look there instead of here. Section 5 The resolver then checks its cache for any unexpired records that satisfy the request and returns them if available. If it finds no relevant unexpired data and the Recursion Desired flag is not set in the request, it should immediately return the response without consulting the cache for expired records. Typically this response would be a referral to authoritative nameservers covering the zone, but the specifics are implementation-dependent. side note: I'm slightly surprised that the semantics of the absence of Recusion Desired are not more tightly nailed down, but neither is it the role of this document to specify them. When no authorities are able to be reached during a resolution attempt, the resolver should attempt to refresh the delegation and restart the iterative lookup process with the remaining time on the query resolution timer. This resumption should be done only once during one resolution effort. Is the "during one" more like a global cap or more like "during a given"? Section 6 The client response timer is another variable which deserves consideration. If this value is too short, there exists the risk that stale answers may be used even when the authoritative server is actually reachable but slow; this may result in sub-optimal answers being returned. Conversely, waiting too long will negatively impact user experience. Not just sub-optimal but potentially even wrong or actively harmful answers, no? The balance for the failure recheck timer is responsiveness in detecting the renewed availability of authorities versus the extra resource use for resolution. If this variable is set too large, stale answers may continue to be returned even after the authoritative server is reachable; per [RFC2308], Section 7, this should be no more than five minutes. If this variable is too small, authoritative servers may be rapidly hit with a significant amount of traffic when they become reachable again. I think part of the concern is also that setting the value too small will cause additional traffic towards the authoritative even while it is nonresponsive/nonreachable, which could aggravate any DoS attack ongoing against the authoritative. Which is to say, that perhaps "became reachable again" does not quite reflect the full set of considerations. Regarding the TTL to set on stale records in the response, historically TTLs of zero seconds have been problematic for some implementations, and negative values can't effectively be communicated to existing software. Other very short TTLs could lead to congestive collapse as TTL-respecting clients rapidly try to refresh. The recommended value of 30 seconds not only sidesteps those potential problems with no practical negative consequences, it also rate limits further queries from any client that honors the TTL, such as a forwarding resolver. I a little-bit wonder whether an RFC 8085 reference would make sense here, but that's not exactly my area of expertise. There's also no record of TTLs in the wild having the most significant bit set in DNS-OARC's "Day in the Life" samples. With no Should we have a reference for DNS-OARC's samples? apparent reason for operators to use them intentionally, that leaves either errors or non-standard experiments as explanations as to why such TTLs might be encountered, with neither providing an obviously compelling reason as to why having the leading bit set should be treated differently from having any of the next eleven bits set and then capped per Section 4. side note(?): This discussion, as roughly "we can't think of any reason why the change would be problematic", calls to mind the ongoing discussions of RFC (text) format changes, where arguments are being made for more-strict backwards/historical compatibility. That said, I have no reason to doubt the WG consensus position here, hence "side note". Section 7 Be aware that Canonical Name (CNAME) and DNAME [RFC6672] records mingled in the expired cache with other records at the same owner name can cause surprising results. This was observed with an initial implementation in BIND when a hostname changed from having an IPv4 Address (A) record to a CNAME. The version of BIND being used did not evict other types in the cache when a CNAME was received, which in normal operations is not a significant issue. However, after both records expired and the authorities became unavailable, the fallback to stale answers returned the older A instead of the newer CNAME. I'm not sure to what extent the lesson from this scenario is limited to "CNAME/DNAME are special" versus "when serving stale, serve the least-stale you have". Section 8 Details of Apple's implementation are not currently known. I'm amenable to the other reviewer's comment that this section might be interesting to keep, RFC 6982 notwithstanding, in which case this might be more appropriately worded as "publicly disclosed" -- one assumes that the Apple employees that wrote it know what it does! Section 10 The most obvious security issue is the increased likelihood of DNSSEC validation failures when using stale data because signatures could be returned outside their validity period. Stale negative records can We seem to be carefully not giving explicit guidance about using "stale" DNSSEC keys in addition to stale resolution records. If the consequences of potentially using expired key material are more severe than the consequences of potentially using expired DNS records (as it seems to me), perhaps we should explicitly reiterate that serve-stale is not an excuse to ignore key validity periods (as we are implicitly doing here)? In [CloudStrife], it was demonstrated how stale DNS data, namely hostnames pointing to addresses that are no longer in use by the owner of the name, can be used to co-opt security such as to get domain-validated certificates fraudulently issued to an attacker. While this document does not create a new vulnerability in this area, it does potentially enlarge the window in which such an attack could be made. A proposed mitigation is that certificate authorities should fully look up each name starting at the DNS root for every name lookup. Alternatively, CAs should use a resolver that is not serving stale data. [I think Adam has probably already covered this one, but keeping just in case.] I note that the target of this guidance (CAs) is not obviously in the expected readership set for a document about DNS recursive resolver operational considerations. Can we do more to expand the visibility of this guidance to the audience where it would be most useful? (I don't see an obvious candidate for, e.g., an additional Updates: relationship, but perhaps someone has other ideas.) _______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop