Matt:
I know you know a lot more about this than I do, but for what it's
worth, you're impressions/intuitions are very close to mine.
Originally back in April, I started off using the "average of the
means", but that let through way too much spam.
So, what I have now is it set to 30% above the average spam score, which
is 20% below the "average of the means".
The assumption being that the optimal spot is somewhere between the two
averages.
Also, that nastly drop off that produces a lot of FPs is in my intuition
too and as of yet, we haven't run into it.
Now, if the two curves could be slid apart wider so that there is a big
deadzone,... Although, without upgrading to a newer version of SA, I
don't see how I can expect much better results.
BTW, if anyone knows a command line program that can easy run thu a
bunch of mbox files and tell how many messages are in them, I will
report back how many ham and how many spam messages that I have fed to
bayes. It's far from perfect, but it may offer some interesting info
regarding the 100:1 (fn:fp) ratio.
Joe
Matt Kettler wrote:
Joe Flowers wrote:
Matt Kettler wrote:
The only problem I see with this approach is that it treats false
positives and
false negatives as being equally bad.
We do get many more false negatives than false positives, even though we
don't get false positives very often - they are rare.
We certainly don't get 1 fp for every fn.
In general, you're adjusting the score bias so the number of FP's and
FNs are
approximately equal.
This is not what we are seeing in practice. It's not even close to 50-50.
Based on JM's comments about the score distribution for hams being non-linear,
this makes sense. If the distribution was linear for both you'd get 50/50 by
dividing the score between the two means.
Since the ham is going to have a pretty sharp drop-off somewhere slightly above
it's mean your split score approach won't be as bad as 1:1, but it's also likely
to not be as good as 100:1 which the 5.0 threshold should get you.
It's an interesting concept, and it would be very interesting to graph out FP vs
FN rates against thresholds.
This graph from JM's post is real data:
http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
But it doesn't go below 5.0. It would be interesting to see how those curves
continue as you approach 0.
This graph is a good conceptual one in the "normal" sense of numbers:
http://taint.org/xfer/2005/score-dist-doodle.gif
That graph would suggest that somewhere below 5.0 there is a threshold at which
the ham FP rate gets MUCH worse in a very sudden way. However, there's no score
associated. I'd venture to guess that your "average of the means" is going to
wind up picking something near, but just above that threshold.
That's a bit of an intuitive guess, but also it has some roots in reality. The
average score of a ham message on a curve like that is going to wind up being
somewhere in the middle of that nasty drop off. By biasing just above that you
should bring yourself into the second part of the curve, where decreases in
score have a somewhat modest impact on FP rate.