Re: 0.001 rules - why?

Cedric Knight Mon, 10 Aug 2009 16:34:00 -0700

Matus UHLAR - fantomas wrote:
> On 09.08.09 11:33, Cedric Knight wrote:
>> I'm using Bayes and network tests, and have found a few rules with a
>> good ratio of ham to spam, but that score only 0.001 in the default
rules.
>
> apparently there's no use for them alone and the score isn't 0 just
because
> that would cause them not to be processed.


OK, but why are they determined to be useless?  Some do have scores in
other scoresets, but not in scoreset 3 (Bayes + network).  Is it a
consequence of score generation, based purely on ratio, or is it set
manually?  (In the cases where the rule scores 0.001 on all scoresets, I
assume it is pegged manually.)

>
>> Here are the ones I'm talking about:
>>
>> FH_HELO_EQ_D_D_D_D
>>
>> Overlaps with HELO_DYNAMIC_IPADDR2 and TVD_RCVD_IP,
>
> this is a big problem Imho, I've even filled a bugreport because of this

It does mean when you get FPs, they are serious FPs.  But if the rules
were reorganised so they didn't overlap (using meta rules or assertions
in the pattern like those below), I think there's something to be gained
from scoring each case separately.

Or, if rule A and rule B overlap such that B hits fewer spam, but has a
better ratio, then I'd say both should be active and score a positive
value: it's just that the score for B should be less if A will also hit.
 (If B hit only a subset of A's spam and had a *worse* ratio, then yes,
it should be scored near 0.)

Does score generation currently take this approach?

>
>> but if you redefine it as
>>
>> header   FH_HELO_EQ_D_D_D_D    X-Spam-Relays-Untrusted =~ /^[^\]]+
>>
helo=(?!(?:[a-z]\S*)?\d+[^\d\s]\d+[^\d\s]\d+[^\d\s]\d+[^\d\s][^\.]*\.\S+\.\S+[^\]]+
>> auth= )[^ ]{0,15}\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3}/
>>
>> it clearly hits a good number that are missed by the other rules, with a
>> similar ratio.
>
> it would match for every host send from generic IP address (if they
know the
> address and it's rdns) , which is very common for dsl,cable,dialup etc
> users.

Hmmm... (a) in most cases, hosts on a dynamic address will use a machine
name rather than their rDNS; (b) is it not an assumption that these will
not be connecting directly to the MTA unless they are trusted?   Cf
RCVD_IN_SORBS_DUL, RCVD_IN_PBL, DOS_OE_TO_MX (although these depend on
last external, rather than last untrusted).

Anyway, my point is that *empirically* this rule seems to do well,
testing against a sample over the past week, even if we exclude mail
that also hits HELO_DYNAMIC_IPADDR2 and TVD_RCVD_IP.  If I search
through recent hits, it's all botnet stuff, and the ratio last week was
similar to URIBL_JP_SURBL (~0.1% ham hits).   If it doesn't do as well
for someone else, maybe it's down to some interesting difference in
setup, e.g. using greylisting.  A score of 0.8 or 1.0 seems to work well
for me.

>> FH_HOST_EQ_VERIZON_P
>> Being based in the UK, don't have many dealings with Verizon customers,
>> so YMMV on this one.  Still, only around 0.2% of hits are ham.
>
> you should understand that SA has many users living in a country with many
> verizon customers and the rules should be done tht they could be used
> generally

Viz the US.  Certainly SpamAssassin should be as widely usable as
possible, although there are problems with non-Western character sets...
OK, I withdraw my suggestion about this one as it relies purely on
factor (b) above, but still think the others are worth a go.

CK

Re: 0.001 rules - why?

Reply via email to