A different approach to scoring spamassassin hits

Tom Allison Fri, 29 Jun 2007 17:52:54 -0700

For some years now there has been a lot of effective spam filteringusing statistical approaches with variations on Bayesian theory, someof these are inverse Chi Square modifications to Niave Bayes or evenCRM114 and other "languages" have been developed to improve thescoring of statistical analysis of spam. For all statisticalprocesses the spamicity is always between 0 and 1.

Before this, and along side this, has been the approach ofspamassassin wherein every email is evaluated against a library ofrules and for each rule and number of points is assigned to it.Given enough points, the email is ham/spam. To accomodate theBayesian process, SA was modified with a Bayes engine and the abilityto add points depending on where the bayesian score fell (>.85, >.95...). And for all of these processes the score is betweensomething negative and something positive depending on the totalnumber of hits and the points assigned to them.

It occurred to me that this process of assigning points to each"HIT" (either addition or subtraction of points) is slightlyarbitrary. There is a long process of evaluating for the "mosteffective score" for each rule and then providing that as thedefault. The Mail Admin has the option to retune these variousparameters as needed. To me, this looks like a lot of knobs I canturn on a very complex machine I will probably never reallyunderstand. In short, if I touch it, I will break it. But thearbitrary part of the process is this manual balancing act betweenhow many points to apply to something and getting the call from theCEO about his over abundance of east european teenage solicitors (orlack thereof).

The thought I had, and have been working on for a while, is changinghow the scoring is done. Rather than making Bayes a part of thescoring process, make the scoring process a part of the Bayesstatistical Engine. As an example you would simply feed into theBayesian process, as tokens, the indications of scoring hits (binaryyes/no) would be examined next to the other tokens in the message.

It would be the Bayes process that determines the effective number ofpoints you assign for each HIT based on what it's learned about itfrom you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would berepresented as a token of format:

ADVANCE_FEE_1=YES or NO
ADVANCE_FEE_2=YES or NO

and each of these tokens would then be evaluated based on yourlearning process.

An advantage of this would be the elimination of the process todetermine the best number of points to assign or to determine if youeven want a rule included.

Point assignments would be determined based on the statistical hits(number of spam, number of ham) and would be tuned between a per siteor per user basis depending on the bayes engine configuration. Eachusers, by means of their feedback, would tune the importance of eachrule applied.

Determining if you wanted to include a rule would be automaticallydetermined for you based on the resulting scoring. if you have arule that has an overall historical performance of 0.499 then it'spretty obvious that it's incapable of "Seeing" your kind of spam/ham. But if you throw together a rule and run it for a week and findit's scoring 0.001 or 0.999 then you have evidence of how effectivethe rule is and can continue to use it. It is conceivable that youcould start with All known rules and later on remove all the rulesthat are nominally 0.500 to improve performance on a objectiveprocess. It would also apply to any of the networked rules likebotnet, dcc, razor because they just have a tagline and a YES/NOindication.

I've been working on something like this myself with great affect,but it would be far more practical to utilize much of the knowledgeand capability that already exists in spamassassin. But I'm notfamiliar enough with spamassassin to know how to gain visibility intoall the rules run and all their results (hits are easy inPerMsgStatus, but misses are not). If someone would be willing togive me some pointer to a roadmap of sorts it would be appreciated.

Many Thanks for those of you who have read this far for your patienceand consideration.

A different approach to scoring spamassassin hits

Reply via email to