On 2 Jul 2007, Justin Mason spake thusly: > > Tom Allison writes: >> For some years now there has been a lot of effective spam filtering >> using statistical approaches with variations on Bayesian theory, some >> of these are inverse Chi Square modifications to Niave Bayes or even >> CRM114 and other "languages" have been developed to improve the >> scoring of statistical analysis of spam. For all statistical >> processes the spamicity is always between 0 and 1. > > Actually, I think this is just a convention adopted by Paul Graham > in his "Plan for Spam" blog post; SpamAssassin was there beforehand > with the (ham < 5 < spam) range idea. ;) But anyway...
Well, it's a probability, isn't it: P(spam). All probabilities are expressed as numbers between 0 and 1, therefore... But no, there's nothing magic about it. > The big issue is that, as others have noted, there are very few > negative-scoring rules, because it's trivial for spammers to forge them. > The only safe way to do good ham rules, generally, are: > > - network whitelisting > - SPF/DK/DKIM-driven whitelists > - site-specific rules > - Bayes-like "learned" tokens derived from a ham corpus If you wanted to replace all other scoring mechanisms with the Bayes DB, you'd need a second Bayes DB for this, anyway, or you'd need the tokens corresponding to typically negative-scoring rules to have values which cannot appear in the body of an email. Anything else would enable spammers to force both FPs and FNs by customizing spam appropriately to include suitable NO_FOO/YES_FOO values. -- `... in the sense that dragons logically follow evolution so they would be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep furiously