Re: A different approach to scoring spamassassin hits

Nix Mon, 02 Jul 2007 12:00:41 -0700

On 2 Jul 2007, Justin Mason spake thusly:

>
> Tom Allison writes:
>> For some years now there has been a lot of effective spam filtering  
>> using statistical approaches with variations on Bayesian theory, some  
>> of these are inverse Chi Square modifications to Niave Bayes or even  
>> CRM114 and other "languages" have been developed to improve the  
>> scoring of statistical analysis of spam.  For all statistical  
>> processes the spamicity is always between 0 and 1.
>
> Actually, I think this is just a convention adopted by Paul Graham
> in his "Plan for Spam" blog post; SpamAssassin was there beforehand
> with the (ham < 5 < spam) range idea. ;)  But anyway...


Well, it's a probability, isn't it: P(spam). All probabilities are
expressed as numbers between 0 and 1, therefore...

But no, there's nothing magic about it.

> The big issue is that, as others have noted, there are very few
> negative-scoring rules, because it's trivial for spammers to forge them.
> The only safe way to do good ham rules, generally, are:
>
>     - network whitelisting
>     - SPF/DK/DKIM-driven whitelists
>     - site-specific rules
>     - Bayes-like "learned" tokens derived from a ham corpus

If you wanted to replace all other scoring mechanisms with the Bayes DB,
you'd need a second Bayes DB for this, anyway, or you'd need the tokens
corresponding to typically negative-scoring rules to have values which
cannot appear in the body of an email. Anything else would enable spammers
to force both FPs and FNs by customizing spam appropriately to include
suitable NO_FOO/YES_FOO values.

-- 
`... in the sense that dragons logically follow evolution so they would
 be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
 furiously

Re: A different approach to scoring spamassassin hits

Reply via email to