On Thu, 7 Feb 2013, Bob Proulx wrote:
I am having Bayes false positive misclassifications and am trying to
tune and improve this situation. I am using SpamAssassin to classify
mailing list messages and so there is a lot of mail from a variety of
sources feeding SA. And a lot of spam of course.
[snip..]
Various details:
$ ll -hog .spamassassin/bayes_*
-rw------- 1 12K Feb 7 16:30 .spamassassin/bayes_journal
-rw------- 1 75M Feb 7 16:29 .spamassassin/bayes_seen
-rw------- 1 5.0M Feb 7 16:29 .spamassassin/bayes_toks
$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 1557 0 non-token data: nspam
0.000 0 57088 0 non-token data: nham
0.000 0 179777 0 non-token data: ntokens
0.000 0 1360231447 0 non-token data: oldest atime
0.000 0 1360277897 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync atime
0.000 0 1360274921 0 non-token data: last expiry atime
0.000 0 43200 0 non-token data: last expire atime delta
0.000 0 5817 0 non-token data: last expire reduction
count
Something's really wrong here, those "dump magic" numbers don't match up with
the size of your bayes files.
For example, you have a non-empty 'bayes_journal' file but the last journal sync
atime is zero (implying never synced). The size of your bayes_seen file
is consistent with several million messages learned, not a few
tens-of-thousands.
Are you -sure- those bayes files correspond to the bayes database your "dump
magic" is reporting? (which one is your SA using for its operations?)
If you watch that "bayes_journal" file over an hour or two does it gradually
increase in size then suddenly drop? (that's normal operation). If so then
the 'last journal sync atime' should correspond to when it dropped in size
(the sync operation). When the journal cycles the nspam/nham should go up.
If you learn some spam/ham by hand do the nspam/nham counters go up?
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{