On Thu, 7 Feb 2013, Bob Proulx wrote:

I am having Bayes false positive misclassifications and am trying to
tune and improve this situation.  I am using SpamAssassin to classify
mailing list messages and so there is a lot of mail from a variety of
sources feeding SA.  And a lot of spam of course.

[snip..]

Various details:

$ ll -hog .spamassassin/bayes_*
-rw------- 1  12K Feb  7 16:30 .spamassassin/bayes_journal
-rw------- 1  75M Feb  7 16:29 .spamassassin/bayes_seen
-rw------- 1 5.0M Feb  7 16:29 .spamassassin/bayes_toks

$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0       1557          0  non-token data: nspam
0.000          0      57088          0  non-token data: nham
0.000          0     179777          0  non-token data: ntokens
0.000          0 1360231447          0  non-token data: oldest atime
0.000          0 1360277897          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1360274921          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire atime delta
0.000          0       5817          0  non-token data: last expire reduction 
count


Something's really wrong here, those "dump magic" numbers don't match up with
the size of your bayes files.
For example, you have a non-empty 'bayes_journal' file but the last journal sync atime is zero (implying never synced). The size of your bayes_seen file
is consistent with several million messages learned, not a few 
tens-of-thousands.

Are you -sure- those bayes files correspond to the bayes database your "dump magic" is reporting? (which one is your SA using for its operations?)

If you watch that "bayes_journal" file over an hour or two does it gradually increase in size then suddenly drop? (that's normal operation). If so then
the 'last journal sync atime' should correspond to when it dropped in size
(the sync operation). When the journal cycles the nspam/nham should go up.

If you learn some spam/ham by hand do the nspam/nham counters go up?


--
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Reply via email to