Here is what I'm using to do the same globally based on each users mail, but it could be tweaked to do per user. This happens to be a family only server, so I'm generally doing the spam/ham review for each user as needed:

root@omega:/usr/local/bin# more sa-learn-systemwide
#!/bin/sh
#
# sa-learn-systemwide
#
# Run sa-lean against user Maildir folders for ham / spam token learning
#
#

LOGFILE="/var/log/sa-learn-run.log"

SALEARNBIN="/usr/bin/sa-learn"
SAUSERNAME="Debian-exim"
SADBPATH="/var/spool/exim4/.spamassassin/bayes"
SAFOLDERS="/etc/spamassassin/sa-learn-folders.conf"
MAILTO="root@localhost"


#
# Execute sa-learn token database expire of old tokens
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token expiration ..." >> $LOGFILE
$SALEARNBIN --force-expire --username=$SAUSERNAME --dbpath=$SADBPATH 2>&1 >> $LOGFILE

#
# Execute sa-learn against configured folders
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting Learning ..." >> $LOGFILE
$SALEARNBIN --no-sync --username=$SAUSERNAME --dbpath=$SADBPATH --folders=$SAFOLDERS 2>&1 >> $LOGFILE

#
# Execute sa-learn sync
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token journal sync ..." >> $LOGFILE
$SALEARNBIN --sync --username=$SAUSERNAME --dbpath=$SADBPATH 2>&1 >> $LOGFILE

#
# Execute chown
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Fixing file permissions ..." >> $LOGFILE
chown -c Debian-exim.Debian-exim $SADBPATH* 2>&1 >> $LOGFILE


#
# Execute sa-learn stats dump
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting stats dump ..." >> $LOGFILE
$SALEARNBIN --dump magic --progress --username=$SAUSERNAME --dbpath=$SADBPATH >> $LOGFILE


root@omega:/usr/local/bin# more /etc/spamassassin/sa-learn-folders.conf
spam:dir:/home/*/Maildir/.SPAM.Spam-Missed/{cur,new}
spam:dir:/home/*/Maildir/.SPAM.Spam-Mail/{cur,new}
ham:dir:/home/*/Maildir/.SPAM.Spam-Ham/{cur,new}
ham:dir:/home/*/Maildir/{cur,new}
ham:dir:/home/*/Maildir/.Sent/{cur,new}
root@omega:/usr/local/bin#

Log snip:

Mon Mar 30 09:00:01 EDT 2015 sa-learn: Starting token expiration ...
bayes: synced databases from journal in 0 seconds: 304 unique entries (605 total entries)
Mon Mar 30 09:00:06 EDT 2015 sa-learn: Starting Learning ...
Learned tokens from 24 message(s) (6971 message(s) examined)
Mon Mar 30 09:06:11 EDT 2015 sa-learn: Starting token journal sync ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Fixing file permissions ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Starting stats dump ...
0.000          0          3          0  non-token data: bayes db version
0.000          0      84238          0  non-token data: nspam
0.000          0     379365          0  non-token data: nham
0.000          0     142093          0  non-token data: ntokens
0.000          0 1427425402          0  non-token data: oldest atime
0.000          0 1427720336          0  non-token data: newest atime
0.000 0 1427720773 0 non-token data: last journal sync atime
0.000          0 1427720406          0  non-token data: last expiry atime
0.000 0 228435 0 non-token data: last expire atime delta 0.000 0 0 0 non-token data: last expire reduction count

Obvious issues if users leave spam sitting in their inbox, but if they move to the spam folder it will get relearned correctly. In this case I trust the users with well behaved mail clients, so I also feed the sent mail in as ham.

Spam older then 14 days gets deleted from the spam folder.


-James

On 3/27/2015 2:09 PM, RW wrote:
On Fri, 27 Mar 2015 15:16:13 +0000
Michael wrote:

Hi,

I would like automatically learn each users Bayes database in the
following way:

Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur

The idea is to train the Bayes for each user without the need to
take care of learning Spam/Ham on their own.

The reason for taking the "cur" folder instead of the "new" folder
is that I assume that the contents of these folders have already
been verified for false-positives/negatives by the user.
"cur" doesn't imply that the mail has been read; for that you
need to check the seen flag in the filename, an S somewhere after the
colon.


A problem that could occur is when the user always deletes all mails
in .Spam/cur. Then the Bayes is only trained with Ham, but never
Spam. Or isn't that a problem?
Not if you tell them - then it's their fault if it doesn't work.
Alternately you could have a separate train-spam folder and empty it
after training.

You could also supplement spam training by autolearning only spam, e.g.
I have:

bayes_auto_learn 1
bayes_auto_learn_on_error 1
bayes_auto_learn_threshold_nonspam -2000.0

Personally I've never seen a spam miss-trained as a ham with the
default threshold, and sensible rule scores.

I think where some people go wrong is that they don't specify
aggressive custom scores correctly. With autolearning it's better to
keep conservative scores in the non-Bayes scoresets e.g.

score SOME_RULE  2 2 8 8

not

score SOME_RULE  8

There's no difference in classification, but the latter is more like to
cause miss-training on FPs.



Reply via email to