I run a small ISP and have installed SpamAssassin to stop spam.  It catches a 
lot of spam.  It's especially good at filtering out the worst, most offensive 
mail, but a good deal of spam still gets through the filter, even after a 
user's bayes db gets big enough to start adding the bayes tests.

I've noticed that a lot of the spam that makes it to my inbox has scores of 
between 4 and 4.9 -- mail that has scored positive on at least 5-10 rules, and 
that SA should be able to file as spam without worrying that it's a false 
positive, but doesn't.

The flaw, IMO, is the additive scoring.  Sure, a lot of these rules triggered 
in isolation should only add .3 or .1 to the final score.  But the probability 
that an item is spam should go sky high when, say, five substantially 
different .2 and .1 rules all came back positive for a single message.

The statistics should bear this out as a useful test.

Without ditching the current scoring altogether in favor of a multiplicative 
model (a la bayes), what if there were a post-analysis scoring step that just 
took into account the total number of positive rules (or rule families, if 
there is such a division)?  Instead of looking at each test as though it 
occurred in isolation, this can put all the tests into sharper context without 
throwing away a lot of scoring code.

I'm sure perceptron can come up with a more accurate gradation, but I imagine 
it would look something like this:
 0 rules - 0.0
 1 rule  - 0.0
 2 rules - 0.0
 3 rules - 0.0
 4 rules - 1.0
 5 rules - 2.0
 6 rules - 3.0
 7-10 rules - 4.0
 10+ rules - 5.0

Thoughts?
 -tom

Reply via email to