-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
the real-world figures can be seen for various thresholds in the rules/STATISTICS*.txt files... - --j. Matt Kettler writes: > Joe Flowers wrote: > > Matt Kettler wrote: > > > >> The only problem I see with this approach is that it treats false > >> positives and > >> false negatives as being equally bad. > >> > >> > > > > We do get many more false negatives than false positives, even though we > > don't get false positives very often - they are rare. > > We certainly don't get 1 fp for every fn. > > > >> In general, you're adjusting the score bias so the number of FP's and > >> FNs are > >> approximately equal. > > > > > > This is not what we are seeing in practice. It's not even close to 50-50. > > > > Based on JM's comments about the score distribution for hams being non-linear, > this makes sense. If the distribution was linear for both you'd get 50/50 by > dividing the score between the two means. > > Since the ham is going to have a pretty sharp drop-off somewhere slightly > above > it's mean your split score approach won't be as bad as 1:1, but it's also > likely > to not be as good as 100:1 which the 5.0 threshold should get you. > > It's an interesting concept, and it would be very interesting to graph out FP > vs > FN rates against thresholds. > > This graph from JM's post is real data: > http://spamassassin.apache.org/presentations/HEANet_2002/img12.html > > But it doesn't go below 5.0. It would be interesting to see how those curves > continue as you approach 0. > > This graph is a good conceptual one in the "normal" sense of numbers: > http://taint.org/xfer/2005/score-dist-doodle.gif > > That graph would suggest that somewhere below 5.0 there is a threshold at > which > the ham FP rate gets MUCH worse in a very sudden way. However, there's no > score > associated. I'd venture to guess that your "average of the means" is going to > wind up picking something near, but just above that threshold. > > That's a bit of an intuitive guess, but also it has some roots in reality. The > average score of a ham message on a curve like that is going to wind up being > somewhere in the middle of that nasty drop off. By biasing just above that you > should bring yourself into the second part of the curve, where decreases in > score have a somewhat modest impact on FP rate. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) Comment: Exmh CVS iD8DBQFC0q8dMJF5cimLx9ARAuLrAKCQnoc8eo2rAvIDYIWX0DfW/T0NZgCePoyH WZS8C6aamuWZ3H6C6n8k2n0= =Hruw -----END PGP SIGNATURE-----