Thanks again for all the advice. No problems today so far. I suspect this was some type of network issue, which affected only part of the Google network. As noted, we had zero failures for all domains but this one, and we had no failures sending to paid Google Workspace domains.
Emanuel: > This was caused by a temporary issue reaching the DNS servers. Let me pass this along to see if it can be turned into a temporary (4xy) error. Thanks for letting me know. If you happen to know, which name server was failing? Benny: >dont include anything you dont control, unless you own a google server at all That's really kind of funny. We were having a certain amount of deliverability problems running our own mail servers for a few domains so we gave up and switched to Google Workspace. (You can see my comments to this list in the history somewhere.) Now when we have problems, I say complain to Google. I have to point our MXes, SPFs, DKIMs at Google. What I won't do is use them as a DNS provider, but maybe I'll have to if these "temporary DNS failures" crop up more frequently. Opti: > I agree on a longer TTL in general if you’re not doing maint but a short TTL shouldn’t cause failures by itself… unless you’re maxing a limit on lookups or something? > Looks like it’s on cloudflare who claims not to cap/cut off lookups but maybe you have some reporting on that end you could check out/confirm lookup errors idk. You’d think your DNS monitoring would catch it though. There are no lookup limits. It's not on Cloudflare. We run our own name servers at two different providers (Flexential and Linode/Akamai). Kai: > A TTL of just 300 seconds is way too short IMHO. If anything happens to > your DNS you just have five minutes to fix the problem. Set the TTL to > at least 3600 seconds. Benny: > more or less static data should be ttl of minimal 12h or 43200 seconds Google doesn't like this advice either: $ dig gmail.com gmail.com. 300 IN A 142.251.15.19 gmail.com. 300 IN A 142.251.15.17 gmail.com. 300 IN A 142.251.15.18 gmail.com. 300 IN A 142.251.15.83 TTL is only relevant with enough hits from the exact same caching server, which is not the usual situation for 99% of the mail/http servers out there. We only have a few dozen hits a day for this particular domain. Even with a million mail messages a day from 100 different sources, this would be at most 28,800 hits a day, or roughly a hit every 3 seconds. Most single core servers can handle that DNS load (our servers are much larger). Why so short? We have had two very bad experiences with a network provider that required us to switch our servers at a moment's notice. With a half day TTL, that's not possible. We have had zero problems with the 5 minute TTL. In fact, it makes sure that our name servers are used more frequently so that if there are problems, we hear about them more quickly than with a half day TTL. Rob
_______________________________________________ mailop mailing list mailop@mailop.org https://list.mailop.org/listinfo/mailop