Re: How to automatically train each users Bayes?

James Michael Keller Mon, 30 Mar 2015 07:42:40 -0700

Here is what I'm using to do the same globally based on each users mail,but it could be tweaked to do per user. This happens to be a familyonly server, so I'm generally doing the spam/ham review for each user asneeded:


root@omega:/usr/local/bin# more sa-learn-systemwide
#!/bin/sh
#
# sa-learn-systemwide
#
# Run sa-lean against user Maildir folders for ham / spam token learning
#
#


LOGFILE="/var/log/sa-learn-run.log"

SALEARNBIN="/usr/bin/sa-learn"
SAUSERNAME="Debian-exim"
SADBPATH="/var/spool/exim4/.spamassassin/bayes"
SAFOLDERS="/etc/spamassassin/sa-learn-folders.conf"
MAILTO="root@localhost"


#
# Execute sa-learn token database expire of old tokens
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token expiration ..." >> $LOGFILE

$SALEARNBIN --force-expire --username=$SAUSERNAME --dbpath=$SADBPATH2>&1 >> $LOGFILE


#
# Execute sa-learn against configured folders
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting Learning ..." >> $LOGFILE

$SALEARNBIN --no-sync --username=$SAUSERNAME --dbpath=$SADBPATH--folders=$SAFOLDERS 2>&1 >> $LOGFILE


#
# Execute sa-learn sync
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting token journal sync ..." >> $LOGFILE

$SALEARNBIN --sync --username=$SAUSERNAME --dbpath=$SADBPATH 2>&1 >>$LOGFILE


#
# Execute chown
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Fixing file permissions ..." >> $LOGFILE
chown -c Debian-exim.Debian-exim $SADBPATH* 2>&1 >> $LOGFILE


#
# Execute sa-learn stats dump
#
TIMESTAMP=`date`
echo $TIMESTAMP "sa-learn: Starting stats dump ..." >> $LOGFILE

$SALEARNBIN --dump magic --progress --username=$SAUSERNAME--dbpath=$SADBPATH >> $LOGFILE



root@omega:/usr/local/bin# more /etc/spamassassin/sa-learn-folders.conf
spam:dir:/home/*/Maildir/.SPAM.Spam-Missed/{cur,new}
spam:dir:/home/*/Maildir/.SPAM.Spam-Mail/{cur,new}
ham:dir:/home/*/Maildir/.SPAM.Spam-Ham/{cur,new}
ham:dir:/home/*/Maildir/{cur,new}
ham:dir:/home/*/Maildir/.Sent/{cur,new}
root@omega:/usr/local/bin#

Log snip:

Mon Mar 30 09:00:01 EDT 2015 sa-learn: Starting token expiration ...

bayes: synced databases from journal in 0 seconds: 304 unique entries(605 total entries)

Mon Mar 30 09:00:06 EDT 2015 sa-learn: Starting Learning ...
Learned tokens from 24 message(s) (6971 message(s) examined)
Mon Mar 30 09:06:11 EDT 2015 sa-learn: Starting token journal sync ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Fixing file permissions ...
Mon Mar 30 09:06:14 EDT 2015 sa-learn: Starting stats dump ...
0.000          0          3          0  non-token data: bayes db version
0.000          0      84238          0  non-token data: nspam
0.000          0     379365          0  non-token data: nham
0.000          0     142093          0  non-token data: ntokens
0.000          0 1427425402          0  non-token data: oldest atime
0.000          0 1427720336          0  non-token data: newest atime

0.000 0 1427720773 0 non-token data: last journalsync atime

0.000          0 1427720406          0  non-token data: last expiry atime

0.000 0 228435 0 non-token data: last expireatime delta0.000 0 0 0 non-token data: last expirereduction count

Obvious issues if users leave spam sitting in their inbox, but if theymove to the spam folder it will get relearned correctly. In this caseI trust the users with well behaved mail clients, so I also feed thesent mail in as ham.


Spam older then 14 days gets deleted from the spam folder.


-James

On 3/27/2015 2:09 PM, RW wrote:

On Fri, 27 Mar 2015 15:16:13 +0000
Michael wrote:

Hi,

I would like automatically learn each users Bayes database in the
following way:

Do the following once a day for each user:
1.) sa-learn -u username --ham ../maildir/cur
2.) sa-learn -u username --spam ../maildir/.Spam/cur

The idea is to train the Bayes for each user without the need to
take care of learning Spam/Ham on their own.

The reason for taking the "cur" folder instead of the "new" folder
is that I assume that the contents of these folders have already
been verified for false-positives/negatives by the user.

"cur" doesn't imply that the mail has been read; for that you
need to check the seen flag in the filename, an S somewhere after the
colon.

A problem that could occur is when the user always deletes all mails
in .Spam/cur. Then the Bayes is only trained with Ham, but never
Spam. Or isn't that a problem?

Not if you tell them - then it's their fault if it doesn't work.
Alternately you could have a separate train-spam folder and empty it
after training.

You could also supplement spam training by autolearning only spam, e.g.
I have:

bayes_auto_learn 1
bayes_auto_learn_on_error 1
bayes_auto_learn_threshold_nonspam -2000.0

Personally I've never seen a spam miss-trained as a ham with the
default threshold, and sensible rule scores.

I think where some people go wrong is that they don't specify
aggressive custom scores correctly. With autolearning it's better to
keep conservative scores in the non-Bayes scoresets e.g.

score SOME_RULE  2 2 8 8

not

score SOME_RULE  8

There's no difference in classification, but the latter is more like to
cause miss-training on FPs.

Re: How to automatically train each users Bayes?

Reply via email to