I have a sitewide bayes DB that I'm using and am wondering if having too large of a Bayes DB reduces it's efficiency? I normally don't look at SA too much unless users start complaining about misclassified mail. But as Joanne pointed out in a different thread, my Bayes DB seems to be trained rather poorly. I have it set to autolearn spam that scores over 20, ham has to be a negative score before it's autolearned, and I have a cron job that runs once a week feeding my spamtraps to sa-learn. The mail is forwarded on to a Notes system after scanning, and Notes tends to mangle the headers somewhat once it gets the message, so I haven't found an easy way to get messages out of Notes to sa-learn.
Here's some bayes stats from the sa-stats script:
RANK RULE NAME COUNT %OFRULES %OFMAIL %OFSPAM %OFHAM
------------------------------------------------------------
1 BAYES_99 5209 8.16 13.20 74.55 0.43
18 BAYES_50 676 1.06 1.71 9.68 35.23
24 BAYES_60 365 0.57 0.93 5.22 2.11
29 BAYES_95 324 0.51 0.82 4.64 0.63
36 BAYES_80 284 0.44 0.72 4.06 1.28
2 BAYES_00 15842 11.90 40.16 1.09 48.80
3 BAYES_50 11437 8.59 28.99 9.68 35.23
18 BAYES_40 1378 1.04 3.49 0.43 4.24
20 BAYES_20 1313 0.99 3.33 0.21 4.04
24 BAYES_05 1047 0.79 2.65 0.11 3.23
30 BAYES_60 686 0.52 1.74 5.22 2.11
40 BAYES_80 416 0.31 1.05 4.06 1.28
65 BAYES_95 205 0.15 0.52 4.64 0.63
91 BAYES_99 138 0.10 0.35 74.55 0.43
And here's a dump magic:
0.000 0 3 0 non-token data: bayes db version
0.000 0 179879 0 non-token data: nspam
0.000 0 39790 0 non-token data: nham
0.000 0 768075 0 non-token data: ntokens
0.000 0 1137449552 0 non-token data: oldest atime
0.000 0 1138907710 0 non-token data: newest atime
0.000 0 1138907665 0 non-token data: last journal sync atime
0.000 0 1138729032 0 non-token data: last expiry atime
0.000 0 1279299 0 non-token data: last expire atime delta
0.000 0 255614 0 non-token data: last expire reduction count
SA 3.1 BTW.
Thanks
Andy