Here is what I'm using to do the same globally based on each users mail,
but it could be tweaked to do per user. This happens to be a family
only server, so I'm generally doing the spam/ham review for each user as
needed:
root@omega:/usr/local/bin# more sa-learn-systemwide
#!/bin/sh
#
# sa-learn-systemwide
#
# Run sa-lean against user Maildir folders for ham / spam token learning
#
#
LOGFILE="/var/log/sa-learn-run.log"
SALEARNBIN="/usr/bin/sa-learn"
SAUSERNAME="Debian-exim"
SADBPATH="/var/spool/exim4/.spamassassin/bayes"
SAFOLDERS="/etc/spamassassin/sa-learn-folders.conf"
MAILTO="root@localhost"
#
# Execute sa-learn token database expire of old tokens
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token expiration ..." >> $LOGFILE
$SALEARNBIN --force-expire --username=$SAUSERNAME --dbpath=$SADBPATH
2>&1 >> $LOGFILE
#
# Execute sa-learn against configured folders
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting Learning ..." >> $LOGFILE
$SALEARNBIN --no-sync --username=$SAUSERNAME --dbpath=$SADBPATH
--folders=$SAFOLDERS 2>&1 >> $LOGFILE
#
# Execute sa-learn sync
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token journal sync ..." >> $LOGFILE
$SALEARNBIN --sync --username=$SAUSERNAME --dbpath=$SADBPATH 2>&1 >>
$LOGFILE
#
# Execute chown
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Fixing file permissions ..." >> $LOGFILE
chown -c Debian-exim.Debian-exim $SADBPATH* 2>&1 >> $LOGFILE
#
# Execute sa-learn stats dump
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting stats dump ..." >> $LOGFILE
$SALEARNBIN --dump magic --progress --username=$SAUSERNAME
--dbpath=$SADBPATH >> $LOGFILE
root@omega:/usr/local/bin# more /etc/spamassassin/sa-learn-folders.conf
spam:dir:/home/*/Maildir/.SPAM.Spam-Missed/{cur,new}
spam:dir:/home/*/Maildir/.SPAM.Spam-Mail/{cur,new}
ham:dir:/home/*/Maildir/.SPAM.Spam-Ham/{cur,new}
ham:dir:/home/*/Maildir/{cur,new}
ham:dir:/home/*/Maildir/.Sent/{cur,new}
root@omega:/usr/local/bin#
Log snip:
Mon Mar 30 09:00:01 EDT 2015 sa-learn: Starting token expiration ...
bayes: synced databases from journal in 0 seconds: 304 unique entries
(605 total entries)
Mon Mar 30 09:00:06 EDT 2015 sa-learn: Starting Learning ...
Learned tokens from 24 message(s) (6971 message(s) examined)
Mon Mar 30 09:06:11 EDT 2015 sa-learn: Starting token journal sync ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Fixing file permissions ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Starting stats dump ...
0.000 0 3 0 non-token data: bayes db version
0.000 0 84238 0 non-token data: nspam
0.000 0 379365 0 non-token data: nham
0.000 0 142093 0 non-token data: ntokens
0.000 0 1427425402 0 non-token data: oldest atime
0.000 0 1427720336 0 non-token data: newest atime
0.000 0 1427720773 0 non-token data: last journal
sync atime
0.000 0 1427720406 0 non-token data: last expiry atime
0.000 0 228435 0 non-token data: last expire
atime delta
0.000 0 0 0 non-token data: last expire
reduction count
Obvious issues if users leave spam sitting in their inbox, but if they
move to the spam folder it will get relearned correctly. In this case
I trust the users with well behaved mail clients, so I also feed the
sent mail in as ham.
Spam older then 14 days gets deleted from the spam folder.
-James
On 3/27/2015 2:09 PM, RW wrote:
On Fri, 27 Mar 2015 15:16:13 +0000
Michael wrote:
Hi,
I would like automatically learn each users Bayes database in the
following way:
Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur
The idea is to train the Bayes for each user without the need to
take care of learning Spam/Ham on their own.
The reason for taking the "cur" folder instead of the "new" folder
is that I assume that the contents of these folders have already
been verified for false-positives/negatives by the user.
"cur" doesn't imply that the mail has been read; for that you
need to check the seen flag in the filename, an S somewhere after the
colon.
A problem that could occur is when the user always deletes all mails
in .Spam/cur. Then the Bayes is only trained with Ham, but never
Spam. Or isn't that a problem?
Not if you tell them - then it's their fault if it doesn't work.
Alternately you could have a separate train-spam folder and empty it
after training.
You could also supplement spam training by autolearning only spam, e.g.
I have:
bayes_auto_learn 1
bayes_auto_learn_on_error 1
bayes_auto_learn_threshold_nonspam -2000.0
Personally I've never seen a spam miss-trained as a ham with the
default threshold, and sensible rule scores.
I think where some people go wrong is that they don't specify
aggressive custom scores correctly. With autolearning it's better to
keep conservative scores in the non-Bayes scoresets e.g.
score SOME_RULE 2 2 8 8
not
score SOME_RULE 8
There's no difference in classification, but the latter is more like to
cause miss-training on FPs.