Joe Flowers wrote on Mon, 11 Jul 2005 12:09:29 -0400: > We are very glad and happy about this concept and implementation.
Well, the big question is: How many of your spam messages score between the default 5 and your "floating score"? If it is many there's obviously something wrong with your setup: your spam is not scoring high enough. Additionally, it means that your Bayes auto-learn will feed less spam to learn than it could because your overall spam score is way too low. Our average spam score is indeed around -2 as yours is. And it's a very high peak, -2 mails are more than any other ham mails combined. However, our spam score peak is *way* higher than yours is: it "flattens" over 18 and 30, so the average is somewhere around 25 or so. (I deduced that from looking at the raw figures not by calculating a median or average.) I consider your average spam score of 6 as *extremely* bad from a detection standpoint. With a score of 0.5 I would get a *considerable* amount of ham scored as spam. With the default of 5 we get almost none, not even one per day. I doubt that your rate of FPs is nearly non-existant with a spam threshold of 0.5. There *must* be a considerable rate of FPs, you just don't hear about it. I think the general approach on this list is to make spam score as spammy as possible. That's what we do as well. Instead of driving spam to the sky you are trying to find some non-existing "barrier" which may indeed float because tomorrow's messages score different than yesterday's. It does not float at all in the long run. And it exists *only* in the long run. It may throw off next day's detection quite heavily, since there's no guarantee spam and ham look the same next day or even float around that point. It's not even a statistical figure, you deliberately set it to 30%, probably because you get too much spam if you set it higher. That's bad, really bad detection ... If much of your spam is lower than 5 than the spam detection rate of your SA is quite bad. You should improve that instead of trying to find a barrier which gives you the best FP:FN ratio. It may indeed give you the best ratio with your bad setup but not the lowest FP rate and probably not the best ratio compared with a setup that drives spam to the sky. I see your approach as an interesting way of optimizing the threshold when you don't get optimal scores. But you would be better off to optimize the scores. BTW: what does "normalized" exactly mean in this context? Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com IE-Center: http://ie5.de & http://msie.winware.org