Matt:

I know you know a lot more about this than I do, but for what it's worth, you're impressions/intuitions are very close to mine. Originally back in April, I started off using the "average of the means", but that let through way too much spam.

So, what I have now is it set to 30% above the average spam score, which is 20% below the "average of the means". The assumption being that the optimal spot is somewhere between the two averages.

Also, that nastly drop off that produces a lot of FPs is in my intuition too and as of yet, we haven't run into it.

Now, if the two curves could be slid apart wider so that there is a big deadzone,... Although, without upgrading to a newer version of SA, I don't see how I can expect much better results.

BTW, if anyone knows a command line program that can easy run thu a bunch of mbox files and tell how many messages are in them, I will report back how many ham and how many spam messages that I have fed to bayes. It's far from perfect, but it may offer some interesting info regarding the 100:1 (fn:fp) ratio.

Joe


Matt Kettler wrote:

Joe Flowers wrote:
Matt Kettler wrote:

The only problem I see with this approach is that it treats false
positives and
false negatives as being equally bad.


We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
We certainly don't get 1 fp for every fn.

In general, you're adjusting the score bias so the number of FP's and
FNs are
approximately equal.
This is not what we are seeing in practice. It's not even close to 50-50.


Based on JM's comments about the score distribution for hams being non-linear,
this makes sense. If the distribution was linear for both you'd get 50/50 by
dividing the score between the two means.

Since the ham is going to have a pretty sharp drop-off somewhere slightly above
it's mean your split score approach won't be as bad as 1:1, but it's also likely
to not be as good as 100:1 which the 5.0 threshold should get you.

It's an interesting concept, and it would be very interesting to graph out FP vs
FN rates against thresholds.

This graph from JM's post is real data:
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html

But it doesn't go below 5.0. It would be interesting to see how those curves
continue as you approach 0.

This graph is a good conceptual one in the "normal" sense of numbers:
http://taint.org/xfer/2005/score-dist-doodle.gif

That graph would suggest that somewhere below 5.0 there is a threshold at which
the ham FP rate gets MUCH worse in a very sudden way. However, there's no score
associated. I'd venture to guess that your "average of the means" is going to
wind up picking something near, but just above that threshold.

That's a bit of an intuitive guess, but also it has some roots in reality. The
average score of a ham message on a curve like that is going to wind up being
somewhere in the middle of that nasty drop off. By biasing just above that you
should bring yourself into the second part of the curve, where decreases in
score have a somewhat modest impact on FP rate.



Reply via email to