Hi,
On Mon, Nov 21, 2016 at 1:07 PM, Bill Cole <sausers-20150...@billmail.scconsult.com> wrote: > On 21 Nov 2016, at 3:18, Matus UHLAR - fantomas wrote: > >> On 20.11.16 19:46, Alex wrote: >>> >>> Am I reading this rule wrong, or does the presence of a .info domain >>> enough to warrant a 2.8 score? >>> >>> * 2.1 URI_NO_WWW_INFO_CGI URI: CGI in .info TLD other than third-level >>> "www" >>> >>> >>> <https://clientservices.ogletreedeakins.info/rs/vm.ashx?ct=3D24F76A1AD5E20A= >>> EDC1D180ACD125901ADFBE7BB3D38714D4CF371647BF8D90DDD78032>* >>> >>> uri URI_NO_WWW_INFO_CGI >>> /^(?:https?:\/\/)?[^\/]+(?<!\/www)\.[^.]{7,}\.info\/(?=\S{15,})\S*\?/i >>> >>> This particular email was scored at 5.30, and wouldn't have hit if it >>> didn't also hit SORBS, but such a score seemed quite high for just the >>> presence of a type of TLD. >> >> >> it's not based only on .info tld: >> >> 1. TLD .info >> 2. no 'www' >> 3. third level domain >> 4. at least 6 characters 2nd-level domain > > > That's a 7 not a 6 :) > > The RE says a bit more, and is maybe clearer using words: > > http[s]://<hostname: not 'www'>.<domainname: 7 or more non-dots>.info/<15 or > more non-whitespace characters including a literal ?> > > Note that the trailing '\?' in the RE means a literal '?' indicating that > the URI has a CGI-style query string. That makes this a very specific URI > pattern. There's nothing "wrong" with such a URI except for the fact that > objectively the frequency of that uncommon pattern is much higher in spam > than non-spam. > > I *suspect* that the pattern could be tightened a bit to reduce false > positives without missing the spam that hits this rule, but I don't have any > data to support that. Thank you all for your explanations. I understood that it also involved a CGI-style query string, but just didn't mention it. If it would help, I have a handful of other non-spam URIs that hit this rule, if it would help tighten it up a bit. The part I was unsure of was if those 2.1 points were warranted because I've only ever seen it in ham. Now I understand that it is.