Hi there list,
this week we stumbled upon an issue where we could not send mail to
certain domains, for instance em...@umcg.nl.
Nov 16 17:04:08 mail postfix/smtp[13330]: warning: no MX host for umcg.nl has a
valid address record
Nov 16 17:04:08 mail postfix/smtp[13330]: 1D1D21422C2: to=<em...@umcg.nl>,
relay=none, delay=2257, delays=2256/0.02/0.52/0, dsn=4.4.3, status=deferred (Host or
domain name not found. Name service error for
name=umcg-nl.mail.protection.outlook.com type=A: Host not found, try again)
It turned out that this was the cause:
$ dig MX umcg.nl +short
10 umcg-nl.mail.protection.outlook.com.
$ dig NS mail.protection.outlook.com. +short
ns1-proddns.glbdns.o365filtering.com.
ns2-proddns.glbdns.o365filtering.com.
$ dig A umcg-nl.mail.protection.outlook.com. \
@ns1-proddns.glbdns.o365filtering.com. +edns +dnssec |
grep FORMERR
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46904
;; WARNING: EDNS query returned status FORMERR -
retry with '+nodnssec +noedns'
Apparently some Microsoft Office 365 mail servers do not support EDNS
and return FORMERR. This propagated through our DNS recursors as
SERVFAIL and caused the lookup to fail.
A temporary workaround was to preheat the DNS cache by manually querying
said domain without EDNS and then flush the queue entries:
$ dig A umcg-nl.mail.protection.outlook.com. \
@ns1-proddns.glbdns.o365filtering.com. +noedns +nodnssec +short
213.199.154.87
213.199.154.23
# postqueue -i THE_ITEM
But that's obviously not the right solution.
Some more digging revealed that EDNS was enabled on the query through
`smtp_addr_list`:
else if (smtp_tls_insecure_mx_policy > TLS_LEV_MAY)
res_opt = RES_USE_DNSSEC;
The USE_DNSSEC causes the subsequent queries to use USE_EDNS0 with the
DO flag and that killed our interoperability with the Microsoft Office
365 DNS.
The fix was then to lower `smtp_tls_insecure_mx_policy` from 5 (dane) to
1 (may):
smtp_tls_dane_insecure_mx_policy=may # default: dane
For the record, this miscommunication started on our servers since the
2nd of November, according to the logs (although I cannot rule out if
anything changed on our side.) Running postfix 3.1.0-3 (Ubuntu Xenial) here.
My questions -- finally:
- Apart from Microsoft upgrading their servers to 2016 and supporting
EDNS, is this issue something postfix should handle?
- Would postfix have handled FORMERR but not SERVFAIL and are my caching
resolvers to blame?
- Should postfix retry the query without EDNS on unexpected errors?
- Should the default smtp_tls_dane_insecure_mx_policy be set to 'dane'?
Or should something more conservative be appropriate if it's able to
cause this kind of miscommunication?
Thanks for your input.
Cheers,
Walter Doekes
OSSO B.V.