From: "John Rudd" <[EMAIL PROTECTED]>
On Jul 26, 2006, at 6:40 AM, Chris Santerre wrote:
> -----Original Message-----
> From: John Rudd [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, July 26, 2006 6:38 AM
> To: Sietse van Zanen
> Cc: SpamAssassin Users
> Subject: Re: SA Score -> Confidence Percentage
>
>
>
> I can see how plugins and add-on rules all affect it, but certainly
> they have some sort of base comparison that lets them know
> when they've
> gotten the right score values for the base rules, right?
I'm confused by your statement. (I'm also distracted by shiny
objects....)
When rules scores are formed, they are scored based off a large
corpus, additional rules, and set in the very moment they are scored.
Yes, that is the corpus I am referring to.
When that score is developed, how is it decided that the scores have
settled? When a "95% of the spam in the corpus got ranked 5 or
higher"? 80%? 100%? That's the comparison I'm looking for.
Such a thing is inherently impossible to generate. It varies from
machine setup to machine setup, per mail account on any given
setup, and with time.
I am a bit of a heretic in this group because I take the nasty step
of taking rules that are almost always right (one error per thousand
or more hits) and make sure the score on the rule is designed to
push the score AWAY from 5.0 in the appropriate direction. (I have
BAYES_00 scored a little more negative than default and BAYES_99
scored all the way up at 5. Erm, and I'm surprised that latter has
not caused me real false hits yet.) I try to get my setup to avoid
scoring at 5.0, and in fact be as far away from 5.0 as practical.
So while I see a lot of scores well above 10 or even a fair number
above 50 I see very few in the 5 to 7 range. Then I see a lot of
ham coming through with scores well below 5. The distribution is
more or less a <choke> brassiere curve. So basically I am setup
such that very nearly 100% of my spam scores above 5 and even
more nearly 100% of my ham scores below 5. The errors are on the
order of 1/1000 and 1/4000, respectively, for my personal mail
these days. I do not have any idea how many spams score above 6 or 7
or 8. I don't care. I do notice that ham can on VERY rare instances
score as high as 10. That happens maybe once a year. I simply review
the spam folder sorted by scores and look at the low scores for false
positives and the high scores for the amusement value. (In the past I
have had some that scored over 100 all on "small score" rules. I do
admit to having some rules that score 100. I consider that to be a
twit filter score. I've noticed enough other people have these sites
effectively blocked that I don't see any more "ebav.com" spams any
more. That was one of the 100s.)
The stock scoring is a little more conservative than my system with its
40+ sets of rules plus user_prefs rules. So I'd expect to see the error
rate be a somewhat higher than what I experience. But, your numbers will
change dramatically as Bayes training proceeds IF it proceeds without
any false training, which can happen as a SA system self starts. So
any "this number corresponds to this percent spam and this percent ham"
is impossible. It's a dynamic number. You could build an adaptation of
one or the other sa-stats.pl scripts that would do so for YOUR system,
however.
{^_^}