Re: A different approach to scoring spamassassin hits

Tom Allison Sat, 30 Jun 2007 03:33:59 -0700


On Jun 30, 2007, at 4:46 AM, John Andersen wrote:


On Friday 29 June 2007, Tom Allison wrote:

It would be the Bayes process that determines the effective number of
points you assign for each HIT based on what it's learned about it
from you.  So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
represented as a token of format:
ADVANCE_FEE_1=YES or NO
ADVANCE_FEE_2=YES or NO
and each of these tokens would then be evaluated based on your
learning process.

Sort of like a multiple linear regression analysis, where yousimply start

dropping terms with low coefficients to simplify the calculation.

Interesting Idea.

You have a bit of a chicken and egg problem at the start.  Until
some learning takes place in the system.


For a purely bayesian filter this is always the case.

But I have found through mailing lists and personal experience thatthis can be mitigated through a variety of approaches.

The first approach is to impliment SA after you have trained it fromsome past corpus of mail you've captured. The opinion on how manyyou need to be effective varies from 10's to 1,000's. This isstrictly a YMMV issue.

Personally, I use an approach of train on error (never auto-train ortrain on everything but only the minimum to get right) with a resultof 10 emails gets me above 90%. But my scoring is a little vague --I use a ternary Yes, No, Maybe scoring process. If I exclude theMaybe I have 100% success in very short order. Including Maybe Ihave 98% success after training on ~100 messages. But the worse isover in the first day.

Another method would be to simply seed the data from a SQL script topreload certain tokens and values. Kind of a "hack" in my opinionbut it would be effective and any discrepancies would be quicklyresolved by training. In the case of SA I would seed the rules intothe tables for the simplest, yet effective results.

Re: A different approach to scoring spamassassin hits

Reply via email to