I welcome birds of a feather. Need to define / refine the problem statement first.
On 12/7/23 12:30 AM, Petr Špaček wrote: > On 07. 12. 23 1:05, Fred Morris wrote: >> On Wed, 6 Dec 2023, Evan Hunt wrote: >> I say go ahead, if nothing else consider it a "scream test". But can >> you take a moment and tell us which stakeholder group(s) you think >> you're optimizing for, why, and how? > > On the technical level we optimize using real (anonymized!) traffic > provided to us by operators. Here's what we need: > https://kb.isc.org/docs/collecting-client-queries-for-dns-server-testing > > If you want us to optimize for your use-case let's talk how we can get > the data and replicate your setup! I run Dnstap (for $reasons), but I'd be able to run dnscap and from the look of that KB page you only want the queries. I'm not sure that really captures the qualitative issue(s). I plan to dig into this some more over the winter anyway, maybe I should turn the tables and ask if there are other systemic issues I should look at or for? I'm using DNS largely for purposes other than FQDN -> address mapping. The things I've written have gotten enough uptake that I'm past the "kook" stage and into the "conspiracy" stage, but although I get some feedback at this point it's all basically anecdotal I don't have a "movement" that I can ask for disciplined feedback. I've done a number of different things poking at the same elephant over the past few years, and what I consistently see is a focus on "a query and a response" and I'm not sure that that is adequate systems thinking for the issues at hand. There seem to be a number of them, and they all point to inadequate systems thinking. That happens. As a neighboring example, adding more packet buffering to routers and wifi hotspots should be an unambiguous Good Thing, right? Even a decade after finding out that it's not, there are still people and constituent groups which haven't gotten the memo. The key thing I'm going to set up and examine this winter is the impact of qname minimization. But there are enough of these maybe some sort of memo is in order. Maybe somebody else wants to work on it with me? So here are some things which I've noticed about DNS in the field and lack of systems thinking. The first two (frags and TC=1) are fairly well known, and are provided as known examples where systems thinking is weak and what this means. But most importantly: "systems thinking in the DNS is provably weak". Frags. Frags are good? No they are bad. If a single UDP frag isn't delivered, the packet can't be reassembled. The server thinks all is fine and good and Procrustes' algorithm has made it all fit. The packet failing to be reassembled means that at the application layer no reply was received from the server. It really doesn't matter whether TC=1 is set or not, because it will never make it to the application. If traffic shaping mistakenly and simplistically thinks "dropping UDP is ok" it is double good for UDP frags. TC=1 is permission-based; (different implication) what if it only works over TCP? There is no provision in the algorithm to try TCP if no response is received via UDP. The 1980s recursion algorithm makes the decision to use TCP a polite society thing. The querant doesn't just try it. It waits for the server to say "here you are, this is what I can do for you; but I encourage you to please try again with TCP" and the querant thinks "oh how nice of you, what an excellent idea; thank you I will". There is no provision in the algo to unilaterally try TCP when UDP has failed to perform well or at all. This is arguably most important for stub resolvers. If the issue was simply buffer bloat, then forcing queries over TCP wouldn't provide observably better performance (which is often the case and why this is worth mentioning). The suspicion has to be traffic shaping, but I don't know that that's the case; crappy SOHO routers are largely black boxes. As an aside: are people still blocking TCP/53? Wasn't that long ago when this was conventional security theater. Aggressive UDP retry presumes fast over correct responses, or at least "correct enough" even if not the most timely. In pursuit of happy eyeballs, speed over everything else! The fastest thing is a static zone file which never changes. But the real world today encompasses forwarders as well as database backends (and this is for FQDN -> address mappings!) and in the quest for the fastest possible response caches get built on top of the database so that something can be served meeting the objectives of what is measured (response time). Without going into technical details, please accept that this increases complexity and the work needed to be done to keep what's served to the querant as fresh as practicable. On the other hand if a typical response time of 1/10th of a second is acceptable, there's time to wait for the database and no need for the additional complexity. Some datastores might take even longer than that (nobody cares about happy eyeballs in that use case). What is the reason for caching resolvers? We see proposals to do prefetch for answers which are soon to expire from cache. If the network is slow enough that that matters, and it works, why the continued obsession with superfast authoritative responses? If this is a prefetch, is the retry schedule less aggressive? Aggressive UDP retry mints unique requests. Anecdotally it has been observed that the aggressive retries from caching resolvers directed at authoritatives mint new queries (query id) for each retry. (I have to ask because of the next item on the list) is this a de-optimization in the name of privacy? The same application (caching resolver) is issuing what are as far as the protocol is concerned different queries which presumably could have come from different applications on the same host. If there's a full-blown recursive resolver living on the host wouldn't those apps avail themselves of the resource? Personally I would hope so. So can the authoritative server debounce (reply to only one request within some time period), or does it have to reply to each and every one of them on the offchance that they're coming from different applications? (And if they're using the stub resolver, shouldn't it be caching? And if they're not using the stub resolver, maybe their "very good reason" should include dealing with whatever the issue is and not passing it off to the DNS? Or if they're not, maybe a competent sysadmin should be sandboxing that app with a caching resolver in front of it?) Qname minimization generates more requests. Without explaining in detail what qname minimization is for or what it entails, traditionally a DNS query contains the full accurate query when sent to every authoritative server regardless of whether or not it is conceivable that that server can answer the query rather than providing a referral; with qname minimization this is not the case and the query is tailored to the type of response(s) the authoritative server is anticipated to be able to provide. Aside from the additional traffic, the crux of most of what can go wrong happens with empty non-terminals. Empty non-terminals are comparatively rare in the portions of the namespace utilized for FQDN -> address mapping. Based on this observation maybe it should be limited to that use case? Why aren't there tuning / configuration options around this? (Won't be surprised if there are for at least some implementations.) If this resonates with you, feel free to reach out. If you use the trualias morris.dns.systems.thinking....@m3047.net that will help me manage things if there are more than a handful of interested parties. -- Fred Morris -- Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information. bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users