[SAtalk] Default Bayes scoring, and default cutoff value - too many false positives

Gary Funck Thu, 14 Aug 2003 11:58:30 -0700

Hello,

I've been running SA with Bayes enabled only the past few days. Bayes has been
auto-learned on two rather large corpuses, which yielded about 1100 auto-learn
messages
(per the Bayes journal file). I've noticed the number of false negatives
(ie, spam mis-classified as ham) have dropped to almost zero, but I'm seeing
maybe half a dozen false positives (ham mis-classified as spam) per day. I'm
having
to white list friends and newsletters that previously went through just fine.


Generally, I'm using SA in local mode, and backing out to network mode only
when local says no ham was found. Given my ham to spam ratio (roughly 1 to 5)
that's been okay, but it probably leads to a surprising result where spam is
over-aggressively mis-classified. I'm using 2.60 cvs (6/30) at the moment, but
I
think the same problem would come up on version 2.55.

The problem is that I'm seeing these misclassified spams as having only, or
nearly
only, BAYES_99 asserted. The various BAYES rules are scored as follows:

score BAYES_00 0 0 -5.300 -5.200
score BAYES_01 0 0 -5.400 -5.400
score BAYES_10 0 0 -5.300 -4.701
score BAYES_20 0 0 -4.701 -2.601
score BAYES_30 0 0 -1.070 -0.927
score BAYES_40 0 0 -0.001 -0.001
score BAYES_44 0 0 -0.001 -0.001
score BAYES_50 0 0 0.001 0.001
score BAYES_56 0 0 0.001 0.001
score BAYES_60 0 0 1.997 1.101
score BAYES_70 0 0 2.593 2.310
score BAYES_80 0 0 5.300 2.862
score BAYES_90 0 0 4.027 3.002
score BAYES_99 0 0 5.200 3.008

Using BAYES_99 as an example, it will be scored 5.2 with Bayes enabled, while
running in local (non-network) mode, and only 3.008 when networking is enabled.
Trouble is, that 5.2 exceeds the default cut off of 5. So, only if a large
auto-whitelist value, or some other negative score kicked in would this
message escape being mis-classified as spam. The 3.008 network value might be
nearer the mark, a very high weighting, but one that would require some other
tests to kick in before the message is classified as spam.

What I'm working up to here: For those of you using Bayes, did you also move
your threshold value up (to say, 7 or above), or do you simply tolerate more
false positives? (I'd have to say that the four/five false positives I'm now
seeing per day, and didn't see before is too high a number for my tastes).




-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

[SAtalk] Default Bayes scoring, and default cutoff value - too many false positives

Reply via email to