Re: update on floating dividing score between spam and ham messages

Justin Mason Mon, 11 Jul 2005 10:42:31 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


the real-world figures can be seen for various thresholds in
the rules/STATISTICS*.txt files...

- --j.

Matt Kettler writes:
> Joe Flowers wrote:
> > Matt Kettler wrote:
> > 
> >> The only problem I see with this approach is that it treats false
> >> positives and
> >> false negatives as being equally bad.
> >>  
> >>
> > 
> > We do get many more false negatives than false positives, even though we
> > don't get false positives very often - they are rare.
> > We certainly don't get 1 fp for every fn.
> > 
> >> In general, you're adjusting the score bias so the number of FP's and
> >> FNs are
> >> approximately equal.
> > 
> > 
> > This is not what we are seeing in practice. It's not even close to 50-50.
> > 
> 
> Based on JM's comments about the score distribution for hams being non-linear,
> this makes sense. If the distribution was linear for both you'd get 50/50 by
> dividing the score between the two means.
> 
> Since the ham is going to have a pretty sharp drop-off somewhere slightly 
> above
> it's mean your split score approach won't be as bad as 1:1, but it's also 
> likely
> to not be as good as 100:1 which the 5.0 threshold should get you.
> 
> It's an interesting concept, and it would be very interesting to graph out FP 
> vs
> FN rates against thresholds.
> 
> This graph from JM's post is real data:
> http://spamassassin.apache.org/presentations/HEANet_2002/img12.html
> 
> But it doesn't go below 5.0. It would be interesting to see how those curves
> continue as you approach 0.
> 
> This graph is a good conceptual one in the "normal" sense of numbers:
> http://taint.org/xfer/2005/score-dist-doodle.gif
> 
> That graph would suggest that somewhere below 5.0 there is a threshold at 
> which
> the ham FP rate gets MUCH worse in a very sudden way. However, there's no 
> score
> associated. I'd venture to guess that your "average of the means" is going to
> wind up picking something near, but just above that threshold.
> 
> That's a bit of an intuitive guess, but also it has some roots in reality. The
> average score of a ham message on a curve like that is going to wind up being
> somewhere in the middle of that nasty drop off. By biasing just above that you
> should bring yourself into the second part of the curve, where decreases in
> score have a somewhat modest impact on FP rate.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFC0q8dMJF5cimLx9ARAuLrAKCQnoc8eo2rAvIDYIWX0DfW/T0NZgCePoyH
WZS8C6aamuWZ3H6C6n8k2n0=
=Hruw
-----END PGP SIGNATURE-----

Re: update on floating dividing score between spam and ham messages

Reply via email to