On 10/14/2015 12:00 PM, Bill Cole wrote:
Describe, in detail, the new SA technology which fights abuse of new
TLDs.
Prior to v3.4.1, the mechanism for detecting and parsing hostnames to
identify body URIs used an embedded array of hardcoded domains in
Mail/SpamAssassin/Util/RegistrarBoundaries.pm. This resulted in many
URIs in the new TLDs not being detected and filtered as URIs. In
v3.4.1 there is the new Mail/SpamAssassin/RegistryBoundaries.pm and
the file 20_aux_tlds.cf in the canonical rules set which now contains
a comprehensive maintained list of TLDs and other registry-managed
domains.
A mention of why the list is even needed:
Most URLs are obvious and of the form
"http://sub.domain.tld/blahblahblah" and easy to detect. However, mail
clients will also accept things like "sub.domain.tld/blahblahblah"
without the protocol. We want to detect as many URLs as possible and
ideally zero non-URLs, because each can turn into multiple DNS lookups.
The list of TLDs gives us a way to eliminate obvious non-URLs, but it
was designed when the worst we had to deal with was 100-ish ccTLDs that
rarely changed. Nowadays it's easy for spammers to buy up garbage
domains like example.bacon / example.click / example.industries, making
an up to date list of TLDs much more important.