I have a sitewide bayes DB that I'm using and am wondering if having too large of a Bayes DB reduces it's efficiency?  I normally don't look at SA too much unless users start complaining about misclassified mail.  But as Joanne pointed out in a different thread, my Bayes DB seems to be trained rather poorly.  I have it set to autolearn spam that scores over 20, ham has to be a negative score before it's autolearned, and I have a cron job that runs once a week feeding my spamtraps to sa-learn.  The mail is forwarded on to a Notes system after scanning, and Notes tends to mangle the headers somewhat once it gets the message, so I haven't found an easy way to get messages out of Notes to sa-learn.

Here's some bayes stats from the sa-stats script:

RANK    RULE NAME                       COUNT %OFRULES %OFMAIL %OFSPAM  %OFHAM
------------------------------------------------------------
   1    BAYES_99                         5209     8.16   13.20   74.55    0.43
  18    BAYES_50                          676     1.06    1.71    9.68   35.23
  24    BAYES_60                          365     0.57    0.93    5.22    2.11
  29    BAYES_95                          324     0.51    0.82    4.64    0.63
  36    BAYES_80                          284     0.44    0.72    4.06    1.28

   2    BAYES_00                        15842    11.90   40.16    1.09   48.80
   3    BAYES_50                        11437     8.59   28.99    9.68   35.23
  18    BAYES_40                         1378     1.04    3.49    0.43    4.24
  20    BAYES_20                         1313     0.99    3.33    0.21    4.04
  24    BAYES_05                         1047     0.79    2.65    0.11    3.23
  30    BAYES_60                          686     0.52    1.74    5.22    2.11
  40    BAYES_80                          416     0.31    1.05    4.06    1.28
  65    BAYES_95                          205     0.15    0.52    4.64    0.63
  91    BAYES_99                          138     0.10    0.35   74.55    0.43

And here's a dump magic:

0.000          0          3          0  non-token data: bayes db version
0.000          0     179879          0  non-token data: nspam
0.000          0      39790          0  non-token data: nham
0.000          0     768075          0  non-token data: ntokens
0.000          0 1137449552          0  non-token data: oldest atime
0.000          0 1138907710          0  non-token data: newest atime
0.000          0 1138907665          0  non-token data: last journal sync atime
0.000          0 1138729032          0  non-token data: last expiry atime
0.000          0    1279299          0  non-token data: last expire atime delta
0.000          0     255614          0  non-token data: last expire reduction count

SA 3.1 BTW.

Thanks
Andy

Reply via email to