Paul, apologies for taking nearly a year to recall this message and respond to it:
https://www.ietf.org/mail-archive/web/dnsop/current/msg21367.html I'll trim down citation material for response, but that's not to mean the parts I am not responding to are ignored. For example, I pretty much agree with the first five paragraphs. The first part that jumps out to me as bearing further discussion starts here: Paul Vixie writes: > another method that's been deployed of avoiding simultaneous "don't > have" with "great need" is to liberally reinterpret TTL such that > RRsets can be reused beyond their explicit TTL lifetime, while their > refresh queries proceed in the background. commonly, the authority > servers responsible for answering these refresh events are down or > unreachable at the time of most acute need. Here "in the background" implies to me that there is a perception that the stale answer is given preference to refreshing the data, so to be clear that is not the case. The draft is explicit that an attempt should be made -- in the foreground -- to refresh before falling back to stale data. It is therefore used in exceptional circumstances, and I also would not be inclined to describe the unreachability of authorities as "commonly". Operational experience bears this out. > the danger of TTL stretching is that reuse beyond TTL may cause > RRsets that are in fact supposed to be unreachable, to be > effectively reachable. examples include security-related takedown of > criminal DNS servers or networks, or failover strategies where end > systems will not try to reach their backup servers unless they > cannot reach their primary servers, and the unreachability of those > primary servers is hidden from them by TTL > stretching. fundamentally, an RRset and its TTL are the property of > the zone administrator, and it's controversial for any other party > to use this data beyond its specified use parameters. This is the meat of this message to me. Can you please elaborate on the scenarios where this takedown situation is a problem? What are the circumstances by which a takedown is only able to be effected through some mechanism which would be subverted by serve-stale? Removing the delegation still works, as does repointing the delegation, or rerouting the authority addresses, or physically taking over the authorities and thereby being able to change their answers. The only scenario that I can imagine it not working is when you can physically disable links to the to-be-disabled authorities but have none of the other remedies available. Is this something that happens? I know you don't mean to say this, but it's also hard not to have it come to mind that this sounds a bit like "we can't do this because bad people might use it to their advantage." Maybe, but beyond the question of how, does that sufficiently outweigh the benefit non-baddies can get? We have lots of technology that bad people can use to do bad things. It's really hard to evaluate that without more a detailed look at the threat model. Similarly, I'm wondering about these other existing systems that rely on unusable primary delegations to fix those delegations to point to backup servers, especially with the typical TTLs in TLDs being the dominating consideration for actually being able to cause the failover. That's not to say I doubt such systems exist, because of course the DNS is constantly monitored by any reasonable provider. Every such monitoring system with which I have personal experience has many checks than would not be impacted by serve-stale. I'm specifically interested in learning more about the systems for which serve-stale causes breakage, and how they might end up getting a stale-serving resolver without an affirmative administrative choice to install and enable the feature in such a resolver for such a monitoring system. > most of us recognize that TTL's will continue to be stretched no > matter what changes are or are not made to the specification, and so > we expect the resulting RFC to document current practice _without > recommending it_ and to also document a new practice _with > recommendations_ as to its proper uses. I think you'll need more support for the assertion that "most of us .... expect". Based on the conversation that's happened around this so far, and with my best attempt at fairly evaluating feedback both on the list and in person, my own impression is that most implementers and operators with whom I've spoken are supportive of the immediate resilience benefit of serve-stale as described in the draft. > noone has proposed any new signaling between the stub and the > recursive, but it's possible that a stub may want a true TTL and so > we might add signaling from the stub (as initiator) saying, don't > stretch, or perhaps saying, if this is a stretched TTL, tell me so > explicitly. The draft that predated your message by a couple of weeks proposed the functionality whereby a stub could indeed know explicitly that any given RRSet in the response was stale. The recently republished draft offers a simplified alternative method for discussion. Personally I still prefer the more featureful option as providing the most clear information, but in any event signaling is and has been part of the document. _______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop