On Jun 30, 2007, at 4:46 AM, John Andersen wrote:


On Friday 29 June 2007, Tom Allison wrote:

It would be the Bayes process that determines the effective number of
points you assign for each HIT based on what it's learned about it
from you.  So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
represented as a token of format:
ADVANCE_FEE_1=YES or NO
ADVANCE_FEE_2=YES or NO
and each of these tokens would then be evaluated based on your
learning process.

Sort of like a multiple linear regression analysis, where you simply start
dropping terms with low coefficients to simplify the calculation.

Interesting Idea.

You have a bit of a chicken and egg problem at the start.  Until
some learning takes place in the system.


For a purely bayesian filter this is always the case.
But I have found through mailing lists and personal experience that this can be mitigated through a variety of approaches.

The first approach is to impliment SA after you have trained it from some past corpus of mail you've captured. The opinion on how many you need to be effective varies from 10's to 1,000's. This is strictly a YMMV issue.

Personally, I use an approach of train on error (never auto-train or train on everything but only the minimum to get right) with a result of 10 emails gets me above 90%. But my scoring is a little vague -- I use a ternary Yes, No, Maybe scoring process. If I exclude the Maybe I have 100% success in very short order. Including Maybe I have 98% success after training on ~100 messages. But the worse is over in the first day.

Another method would be to simply seed the data from a SQL script to preload certain tokens and values. Kind of a "hack" in my opinion but it would be effective and any discrepancies would be quickly resolved by training. In the case of SA I would seed the rules into the tables for the simplest, yet effective results.


Reply via email to