Hi Petr, thank you for the feedback!
> On Jul 28, 2022, at 5:06 AM, Petr Špaček <pspa...@isc.org> wrote: > > Caution: This email originated from outside the organization. Do not click > links or open attachments unless you recognize the sender and know the > content is safe. > On 27. 07. 22 19:42, internet-dra...@ietf.org wrote: >> A New Internet-Draft is available from the on-line Internet-Drafts >> directories. >> This draft is a work item of the Domain Name System Operations WG of the >> IETF. >> Title : Negative Caching of DNS Resolution Failures >> Authors : Duane Wessels >> William Carroll >> Matthew Thomas >> Filename : draft-ietf-dnsop-caching-resolution-failures-00.txt > > I think this is an important clarification to the protocol and we should > adopt it and work on it. > > I like the document up until end of section 2. > > After that I have reservations about the specific proposals put forth in the > section 3. > > I hope this will kick off discussion, please don't take points personally. > I'm questioning the technical aspects. > >> 3. DNS Negative Caching Requirements >> 3.1. Retries and Timeouts >> A resolver MUST NOT retry more than twice (i.e., three queries in >> total) before considering a server unresponsive. >> This document does not place any requirements on timeout values, >> which may be implementation- or configuration-dependent. It is >> generally expected that typical timeout values range from 3 to 30 >> seconds. > > I'm curious about reasoning about this. > > My motivation: > Random drop or temporarily saturated/malfunctioning link should not cause > resolver to fail for several seconds. This section can certainly be improved and we are open to specific suggestions. For example, I think we could say “MUST NOT retry a given query more than twice…” i.e., tie this to the concept of scope in section 3.3. > As an extreme case, think of validating resolver on a laptop forwarding > elsewhere. Should really two packet drops cause it to servfail for several > seconds? It was not our intention to say that three timeouts marks a forwarder as unusable for a long period of time. Maybe there are different rules for forwarders vs authoritative servers. Or maybe scoping it to individual queries would be sufficient. > > Related to this, I have a principal objection: > IMHO we should NOT be inventing flow control from scratch ourselves. On the > contrary - we should be borrowing prior art from existing flow control > algorithms and adapt them if necessary. Sure, I think we’re open to that if there is something appropriate we can reference. Can you think of any relevant prior art? > > >> 3.2. TTLs >> Resolvers MUST cache resolution failures for at least 5 seconds. >> Resolvers SHOULD employ an exponential backoff algorithm to increase >> the amount of time for subsequent resolution failures. For example, >> the initial TTL for negatively caching a resolution failure is set to >> 5 seconds. The TTL is doubled after each retry that results in >> another resolution failure. Consistent with [RFC2308], resolution >> failures MUST NOT be cached for longer than 5 minutes. > > My motivation: Rapid recovery. > > Why 5 seconds? Why not 1? Or why not 0.5 s? ... I would like to see reasoning > behind specific numbers. We put 5 seconds here simply because it feels like a reasonable amount of time that a person would be willing to wait for a retry, and as a starting point for a discussion (which we are now having — hooray!). > > IMHO most problems is caused by unlimited retries and as soon as _a_ limit is > in place the problem is alleviated, But the limit needs to be bound by some amount of time, right? > and with exponential backoff we should be able to start small. I'm not sure > that a specific number should be mandated. I agree that having exponential backoff would make a small initial TTL feasible. Would you support a MUST requirement for exponential backoff? > > >> 3.3. Scope >> Resolution failures MUST be cached against the specific query tuple >> <query name, type, class, server IP address>. > > Why this tuple was selected? Why not <class, zone, server IP> for, say, > timeouts? Or why not <server IP> for timeouts? This was copied from RFC 2308 (section 7.1 and 7.2). > What about transport protocol and its parameters? (TCP, UDP, DoT...) etc. Yes that is an aspect the draft hasn’t considered. Would you like to see that included in the tuple? > > My motivation: > - Simplify cache management. > - Imagine an attacker attempting to misuse this new cache. The cache has to > be bounded in size. It has to somehow manage overflow etc. > > Generally I think this MUST is too prescriptive. It should allow for less > specific caching if an implementation decides it is fit for a given type of > failure and configuration, or depending on operational conditions. This is similar to points raised by Mukund. How would you feel about something like this: MUST at least cache against <server IP address> SHOULD cache against <name, type, class, address> MAY cache against <name, type, class, address, transport> DW _______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop