John Miller <johnm...@brandeis.edu> wrote: > Thanks for the reply, Tony. With the recent glibc bug, I figured most > folks would be off putting out those fires!
If they haven't done it by now then, gosh, I feel sorry for them. (It's SO NICE to have a redundant service that you can patch and upgrade without affecting users. Operational priority 1: reduce anxiety and improve mental health!) And, you (dear readers) should not have to care about the rest of this post - there's very little a hostmaster can do to improve things when you've lost your uplink. All I have been able to do for my users is explain why the DNS servers did the best they could, while my colleagues did the real work to improve our uplinks and reduce these failures. > > [...] but it's still "slow" when we lose external connectivity > > because (I think) of attempts at TLS OCSP lookups :-( > > We've run into similar issues in the past: people were hitting a > captive portal that didn't allow access to the CAs for OCSP > verification. OCSP is just one example of unexpected external dependencies. I mention it particularly because it was confusing - different browsers make different trade-offs about TLS certificate revocation, so we got inconsistent problem reports. > We're not quite there with regard to traffic volume: we're somewhere > around 150 qps on each server (maybe 5-600 qps campus-wide), but as > happened to you, we saw the same 3-4x spike in volume. Right. This number should be fairly consistent across different sites because the traffic increase is due to two things: firstly, stub resolvers usually try a query three times (with a 10s timeout for each query, so an overall 30s timout), and secondly, users might retry manually (or they might go and get coffee). > Likewise, we went from roughly 20 active clients per server (going off > of UDP socket stats from sar) to over 1000. The other statistic you can look at is the client count from `rndc status`. Its numbers should be basically the same as your socket stats, modulo a factor of 2 for IPv4 and v6. But! when you hit a limit the numbers will be clipped and will be no use for measuring demand and no use for estimating what will happen when you lift the limit. Also, there's another factor of 3 between queries from stub resolvers and queries by named - each query from named has a 3s request timeout and an overall 9s or 10s timeout (IIRC). So (I think) this basically means when you lose your uplink you can roughly expect the BIND client count to be about 3x3 = 9 times your normal qps, plus angry manual retries, minus frustrated coffee breaks. > The servers themselves were quietly twiddling their thumbs at 0.1 load: > strictly a case of the application doing the throttling. Yep, mostly waiting for replies that will never come, which doesn't require much CPU. Tony. -- f.anthony.n.finch <d...@dotat.at> http://dotat.at/ Forties, Cromarty: Southwest 4 or 5, backing south 6 to gale 8, perhaps severe gale 9 later. Slight or moderate, becoming rough or very rough later. Showers, then rain. Good, becoming moderate or poor. _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users