Re: update on floating dividing score between spam and ham messages

jdow Mon, 11 Jul 2005 11:08:19 -0700

From: "Matt Kettler" <[EMAIL PROTECTED]>

> Joe Flowers wrote:
> > I don't know if this will help anyone or not, but I wanted to report
> > back just in case.
> >
> > In early April, I completely unhinged the dividing line between what SA
> > score is used to mark a message as spam or ham (5.00 = default). This
> > allows the system and this dividing line to drift "freely" to anywhere
> > that SA will allow, without bound. This anti-spam setup has worked
> > consistently much much better the whole time than in any previous
> > implementation that we have done and with very little maintenance. We
> > are very happy with it and are looking forward to implementing future SA
> > versions in the same fashion.
> >
> > I'm not exactly sure the following numbers represent the whole time
> > since April, but they should be pretty close.
> >
> > We've had 360,922 spam messages and 396,983 ham messages with a
> > normalized average spam score of 6.8714134 and a normalized average ham
> > score of -2.1532284.  I have the divding line "set" at 30% of the
> > distance between the average ham score and average spam score (30% above
> > the average ham score). So, the dividing line is currently floating
> > around 0.55416414.
>
>
> The only problem I see with this approach is that it treats false
positives and
> false negatives as being equally bad.
>
> In general, you're adjusting the score bias so the number of FP's and FNs
are
> approximately equal. Although STATISTICS*.txt would suggest that this
boundary
> occurs somewhere near 2.0, your own local biases could change this
considerably.
>
>
> SA's normal scoreset is evolved with the concept that it's better to have
99
> false negatives than 1 false positive. The concept here is most people use
> scripts to move their spam into a separate folder, or auto delete it. With
that
> going on, a FP is potentially lost valid email, whereas a FN is a minor
> inconvenience.


Operating experience here seems to indicate that the SA score evolution
is not optimum. What you want to do is create a <cough> brassiere curve
for the markups for ham and spam. The greater the separation <choke> the
better the results for a decision point between them. The bias to
prevent false negatives probably means you do not want the decision
point right in the center. But anything you can do that widens the
typical score distribution between ham and spam is a good thing. It makes
the decision point less sensitive to set and the overall error rates
lower. I think this is part of the reason I have so much success on a
box vastly overloaded with SARE and other rules. The good rules pile
one on the other until it's VERY clear what is ham and what is spam.

(It surely would be nice if there were some really good indications of
"not spam". However, nothing has ever appeared other than absence of
hits on spam-sign.)

{^_^}

Re: update on floating dividing score between spam and ham messages

Reply via email to