Re: Those "Re: good obfupills" spams

Matt Kettler Sat, 29 Apr 2006 06:55:04 -0700

List Mail User wrote:
>> ...
>>     
>
> Matt Kettler replied:
>
>   
>> John Tice wrote:
>>     
>>> Greetings,
>>> This is my first post after having lurked some. So, I'm getting these
>>> same "RE: good" spams but they're hitting eight rules and typically
>>> scoring between 30 and 40. I'm really unsophisticated compared to you
>>> guys, and it begs the question––what am I doing wrong? All I use is a
>>> tweaked user_prefs wherein I have gradually raised the scores on
>>> standard rules found in spam that slips through over a period of time.
>>> These particular spams are over the top on bayesian (1.0), have
>>> multiple database hits, forged rcvd_helo and so forth. Bayesian alone
>>> flags them for me. I'm trying to understand the reason you would not
>>> want to have these type of rules set high enough? I must be way over
>>> optimized––what am I not getting? 
>>>       
>> BAYES_99, by definition, has a 1% false positive rate.
>>
>>     
>
>       Matt,
>
>       If we were to presume a uniform distribution between a estimate of
> 99% and 100%, then the FP rate would be .5%, not 1%. 
You're right Paul, my bad..


But again, I don't care if it's 0.01%. The question here is "is jacking
up the score of BAYES_99 to be greater than required_hits a good idea".
The answer is "No, because BAYES_99 is NOT a 100% accurate test. By
definition it does have a non-zero FP rate.

>  And for large sites
> (i.e. 10s or thousands or messages a day or more), this may be what occurs;
> But what I see and what I assume many other small sites see is a very much
> non-uniform distribution;  From the last 30 hours, the average estimate (re.
> the value reported in the "bayes=xxx" clause) for spam hitting the BAYES_99
> rule is .999941898013269 with about two thirds of them reporting bayes=1 and
> a lowest value of bayes=0.998721756590216.
>   
Yes, that's to be expected with Chi-Squared combining.
>       While SA is quite robust largely because of the design feature that
> no single reason/cause/rule should by itself mark a message as spam, I have
> to guess that the FP rate that the majority of users see for BAYES_99 is far
> below 1%.  From the estimators reported above, I would expect that I would
> have seen a .003% FP rate for the last day plus a little, if only I received
> 100,000 or so spam messages to have been able to see it:).
>   
True, but it's still not nearly zero. Even in the corpus testing, which
is run by "the best of the best" in SA administration and maintenance,
BAYES_99 matched 0.0396% of ham, or 21 out of 53,091 hams. (Based on
set-3 of SA 3.1.0)

Given we are dealing with user who doesn't even understand why you might
not want this set "high enough", I would expect the level of
sophistication in bayes maintenance

Besides.. If you want to make a mathematics based argument against me,
start by explaining how the perceptron mathematically is flawed. It
assigned the original score based on real-world data. Not our vast over
simplifications. You should have good reason to question its design
before second guessing it's scoring based on speculation such as this.

>       I don't change the scoring from the defaults, but if people were to
> want to, maybe they could change the rules (or add a rule) for BAYES_99_99
> which would take only scores higher than bayes=.9999 and which (again with
> a uniform distribution) have an expected FP rate of .005% - than re-score
> that just closer (but still less) than the spam threshold, 

I'd agree.. However, the OP has already made BAYES_99 > required_hits.
Bad idea. Period.

Re: Those "Re: good obfupills" spams

Reply via email to