Bayes training via inotify (incron)

Eric Wong Fri, 22 Aug 2014 01:36:06 -0700

Hi all, I'm a happy SA user since around 2004/2005.

Since 2008, I've been using Linux inotify (via incron) to do automatic
Bayes training.  Previously I did something similar using:
        find ... | spamc -L ...  via cron.


I also used to run several filters with SA (crm114+dspam), but since
2008, I've been using SA-only.

I always thought inotify was an obvious way to train for anybody using
Maildirs on Linux, so I set it up for my server and basically forgot
about it since it worked well.  Fast forward to 2014 and I realize what
I do is not widespread.  I figure I'll attempt to document things here
to a wider audience on this sa-users list and hopefully help other users
out.

Code-wise, there are two non-standard primary shell scripts (below)
revolving around incron <http://incron.aiken.cz/>

The idea is to use inotify (via incron) to watch for new files appearing
in Maildirs.  We only want to train seen messages as ham, and old (but
not necessarily seen) messages as spam.  The overall goal of this is to
allow a user to train their filters without leaving his favorite mail
user agent (mine is mutt); all I do is move mail to the right folder.
I have one "spam" folder where I throw spam and every other directory
is not spam.

Every message written to Maildir involves a rename, so we only have
incron watch for IN_MOVED_TO events.

The flow is as follows, all for a single Unix user account:

    incron -> report-spam -> sendmail -> MTA -> dc-dlvr -> spamc -> spamd
                      (my sendmail+MTA is postfix)

The scripts I wrote are report-spam and dc-dlvr, the latter is also a
system-wide MDA.  I use dovecot (dc-dlvr => "dovecot deliver") as my
IMAP server, but other IMAP servers supporting Maildir should also work.
This should also work with MTAs other than postfix.

More comments inline in the scripts:
------------------------ report-spam -------------------------------
#!/bin/sh
# License: GPLv3 or later <http://www.gnu.org/licenses/gpl-3.0.txt>
# Usage: report-spam /path/to/message/in/maildir
# This is intended for use with incron or similar systems.
# my incrontab(5) looks like this:
#  /path/to/maildir/.INBOX.good/cur IN_MOVED_TO /path/to/report-spam $@/$#
#  /path/to/maildir/.INBOX.spam/cur IN_MOVED_TO /path/to/report-spam $@/$#
#  (note: I have many "good" dirs for various mailing lists I follow)

# skip gigantic emails which SA does not handle
bytes=$(stat -c %s $1)
if test $bytes -gt 512000
then
        exit
fi

# Only tested with the /usr/sbin/sendmail which ships with postfix
#
# *** Why not call spamc directly in this script? ***
# I route this through my MTA so it gets queued properly.
# incrond has no concurrency limits and will fork a new process on
# every single event, which sucks with rename storms when a client
# commits folder changes.  The sendmail executable exits quickly and
# queues up the message for training.  This should also ensure fairness
# to newly arriving mail.  Instead of installing/configuring
# another queueing system, I reuse the queue in the MTA.
# See dc-dlvr for corresponding trainspam/trainham handlers,

DO_SENDMAIL='/usr/sbin/sendmail -oi'

# trainspam and trainham plus-addresses are handled by dc-dlvr:
case $1 in
*[/.]spam/cur/*) # non-new messages in spam get trained
        exec $DO_SENDMAIL $USER+trainspam < $1
        ;;
*:2,*S*) # otherwise, seen messages only
        case $1 in
        *:2,*T*) exit 0 ;; # ignore trashed messages
        esac
        exec $DO_SENDMAIL $USER+trainham < $1
        ;;
esac
---------------------------- dc-dlvr ---------------------------------
#!/bin/sh
# License: GPLv3 or later <http://www.gnu.org/licenses/gpl-3.0.txt>
# This is installed as /etc/dc-dcvr on my system
# to use with postfix main.cf: mailbox_command = /etc/dc-dlvr "$EXTENSION"
DELIVER=/usr/lib/dovecot/deliver

# change if your spamc/spamd listens elsewhere
spamc='spamc'

# delivery targets for report-spam.
# allow plus addressing to train spam filters, $1 is the $EXTENSION
# which may be "trainspam" or "trainham".  Only allow spam training
# when $CLIENT_ADDRESS is empty (local client)
# I check $CLIENT_ADDRESS (set by Postfix) so spammers who read this
# script won't be able to mistrain my Bayes remotely.
case $1,$CLIENT_ADDRESS in
trainspam,) exec $spamc -L spam > /dev/null 2>&1 ;;
trainham,) exec $spamc -L ham > /dev/null 2>&1 ;;
esac

TMPMSG=$(mktemp -t dc-dlvr.orig.$USER.XXXXXX || exit 1)
rm_list=$TMPMSG

# pre-filter, for infrequently read lists which do their own spam filtering:
# this runs whatever shell commands a user desires (may read $TMPMSG)
if test -r ~/.dc-dlvr.pre
then
        # this branch is only for huge, high-traffic lists which I only skim.
        # I short-circuit SA for some huge lists which do good
        # filtering on their own.
        set -e

        cat > $TMPMSG
        DEFAULT_INBOX=$(. ~/.dc-dlvr.pre)
        case $DEFAULT_INBOX in
        '') exec rm -f $rm_list ;;
        INBOX) ;; # do nothing
        *)
                $DELIVER -m $DEFAULT_INBOX < $TMPMSG
                exec rm -f $rm_list
                ;;
        esac

        # normal SA filtering path
        PREMSG=$(mktemp -t dc-dlvr.orig.$USER.XXXXXX || exit 1)
        rm_list="$rm_list $PREMSG"
        set +e
        mv -f $TMPMSG $PREMSG
        $spamc -E --headers < $PREMSG > $TMPMSG
else
        # normal SA filtering for email I'm expected to read:
        $spamc -E --headers > $TMPMSG
end
err=$?

# normal delivery
set -e

case $err in
1) $DELIVER -m INBOX.spam < $TMPMSG ;;
*)
        # users may override normal delivery and have it go elsewhere
        if test -r ~/.dc-dlvr.rc
        then
                # whatever shell commands a user desires
                . ~/.dc-dlvr.rc
        else
                $DELIVER -m INBOX < $TMPMSG
        fi
        ;;
esac

# cleanup
exec rm -f $rm_list
---------------------------------- 8< -----------------------------

I wonder if other users out there have similar setups.  It should be
possible for kqueue users (assuming there is something like incrond for
kqueue OSes).  Anyways I hope this all makes sense and helps some SA
users out there.  I'll be glad to try and answer any questions/comments.

I haven't needed to do much else to configure SA after all these years,
most of the Debian-provided defaults are good but I enable the automatic
cron updates.

Thanks for reading this far! :)

-- 
EW

Bayes training via inotify (incron)

Reply via email to