On Tue, 2009-03-03 at 08:32 -0800, Marc Perkel wrote:
> Spamassassin works by adding up points. Rule A is 2 points, Rule B is 2 
> points therefore the score is 4 points. But is this the best way to 
> score? I don't think so.

While there are a lot of "fixed score rules", this description doesn't
measure up with SA. There are a lot of meta-rule that pretty much do
exactly that in various combinations of the logic.

> What I'm seeing in the real world is that it's the combinations of rules 
> in ways where Rule A + Rule B = 10 points rather than 4. Or maybe Rule A 
> + Rule B = -2 points because even though both are spam indicators 
> individually, together that are a ham indicator.

yup, meta-rules

> I do almost all my spam filtering with Exim rules and I use SA for the 
> 1% that are still undetermined from Exim. And in Exim I have more 
> information to work with than SA because I can look at things (behavior) 
> that SA doesn't see. But what I'm doing is combining more rules than SA 
> is and - my point - SA can benefit from rule combinations.

SA *does* benefit from rule combinations. SA *can* benefit from such MTA
visible stuff, if you make it visible to SA, too. For example by
injecting some headers that indicate these MTA tests and variables.

> As an example. The following by themselves are week indicators of spam.
> 
> Dynamic IP
> Bad HELO
> Hitting high numbers MX records
> Not closing with QUIT

Just tell SA about the latter two, and write a trivial meta.

> By themselves each would produce a LOT of false positives. But together 
> it's 100% definite it's a spam bot and not only can the message be 
> rejected, but the IP can be blacklisted. Another example. You do an RBL 
> lookup and the IP is listed in:
> 
> RBL-A 0.5
> RBL-B 0.5
> RBL-C 0.5
> 
> Score = 1.5 - NO - Score = 5! Usually multiple RBLs is a stronger 
> indicator than the sum of the scores. But suppose you find the IP listed 

Such rules do exist already, and are being evaluated in the sandboxes
for QA. grep for META...

> in the Hostkarma yellow list (yellow means mixed source of spam such as 
> yahoo, gmail, and hotmail) then the RBLs don't matter. In the above 
> example, say the 3 RBLs are US based and the spam is coming from 
> yahoo.fr. Most everything coming from yahoo France to American users is 
> spam and might get listed on low quality RBLs. But my point is that you 
> wouldn't want to assign a negative score for the yellow listing because 
> yellow doesn't mean it's not spam, it means it shouldn't be blacklisted. 
> The better logic is - if not yellow then add up the black scored. (A + B 
> + C) * !yellow. Better to look up yellow first and then skip the RBLs if 
> found.

Exactly such exonerating sub-rules are used already in a lot of
meta-rules. The existing and used concept easily can be applied to RBL
tests, too -- if SA uses such "yellow" lists.

There is a real problem with what you just proposed, though. E.g. the
existence of a URI that should not be blacklisted doesn't mean a hit for
another URI should be dropped. Sneaking in a link to apache.org doesn't
render mail.ru any less spammy.

Such a concept strictly must be applied to different RBL matches for one
and the same URI *only*.

> The important point here is that SA needs to evolve beyond the concept 
> of using addition to compute scores. Ideally there should be more hard 
> coded rule combinations or using baysian statistics to find how rule 
> combinations where the combinations are a more accurate indication than 
> the rules themselves.

Sounds like a dynamically adjusting GA run. I guess the problem here is
that this really needs a *lot* of ham, which most sites probably don't
have to the extent necessary. Also, keep in mind that's a rather costly
computation, to say the least.

> Anyhow - just throwing this out there for people to chew on and think about.

Not much news here, if you look closer at the existing rules. :)

I guess one would need a new plugin for the above "yellow" RBLs, due to
the problem of limiting all hits per URI / IP as mentioned above. Also,
of course, one first needs a reliably and publicly available
do-not-blacklist RBL.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to