List Mail User wrote: >> ... >> > > Matt Kettler replied: > > >> John Tice wrote: >> >>> Greetings, >>> This is my first post after having lurked some. So, I'm getting these >>> same "RE: good" spams but they're hitting eight rules and typically >>> scoring between 30 and 40. I'm really unsophisticated compared to you >>> guys, and it begs the question––what am I doing wrong? All I use is a >>> tweaked user_prefs wherein I have gradually raised the scores on >>> standard rules found in spam that slips through over a period of time. >>> These particular spams are over the top on bayesian (1.0), have >>> multiple database hits, forged rcvd_helo and so forth. Bayesian alone >>> flags them for me. I'm trying to understand the reason you would not >>> want to have these type of rules set high enough? I must be way over >>> optimized––what am I not getting? >>> >> BAYES_99, by definition, has a 1% false positive rate. >> >> > > Matt, > > If we were to presume a uniform distribution between a estimate of > 99% and 100%, then the FP rate would be .5%, not 1%. You're right Paul, my bad..
But again, I don't care if it's 0.01%. The question here is "is jacking up the score of BAYES_99 to be greater than required_hits a good idea". The answer is "No, because BAYES_99 is NOT a 100% accurate test. By definition it does have a non-zero FP rate. > And for large sites > (i.e. 10s or thousands or messages a day or more), this may be what occurs; > But what I see and what I assume many other small sites see is a very much > non-uniform distribution; From the last 30 hours, the average estimate (re. > the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99 > rule is .999941898013269 with about two thirds of them reporting bayes=1 and > a lowest value of bayes=0.998721756590216. > Yes, that's to be expected with Chi-Squared combining. > While SA is quite robust largely because of the design feature that > no single reason/cause/rule should by itself mark a message as spam, I have > to guess that the FP rate that the majority of users see for BAYES_99 is far > below 1%. From the estimators reported above, I would expect that I would > have seen a .003% FP rate for the last day plus a little, if only I received > 100,000 or so spam messages to have been able to see it:). > True, but it's still not nearly zero. Even in the corpus testing, which is run by "the best of the best" in SA administration and maintenance, BAYES_99 matched 0.0396% of ham, or 21 out of 53,091 hams. (Based on set-3 of SA 3.1.0) Given we are dealing with user who doesn't even understand why you might not want this set "high enough", I would expect the level of sophistication in bayes maintenance Besides.. If you want to make a mathematics based argument against me, start by explaining how the perceptron mathematically is flawed. It assigned the original score based on real-world data. Not our vast over simplifications. You should have good reason to question its design before second guessing it's scoring based on speculation such as this. > I don't change the scoring from the defaults, but if people were to > want to, maybe they could change the rules (or add a rule) for BAYES_99_99 > which would take only scores higher than bayes=.9999 and which (again with > a uniform distribution) have an expected FP rate of .005% - than re-score > that just closer (but still less) than the spam threshold, I'd agree.. However, the OP has already made BAYES_99 > required_hits. Bad idea. Period.