Hi all, I'm a happy SA user since around 2004/2005. Since 2008, I've been using Linux inotify (via incron) to do automatic Bayes training. Previously I did something similar using: find ... | spamc -L ... via cron.
I also used to run several filters with SA (crm114+dspam), but since 2008, I've been using SA-only. I always thought inotify was an obvious way to train for anybody using Maildirs on Linux, so I set it up for my server and basically forgot about it since it worked well. Fast forward to 2014 and I realize what I do is not widespread. I figure I'll attempt to document things here to a wider audience on this sa-users list and hopefully help other users out. Code-wise, there are two non-standard primary shell scripts (below) revolving around incron <http://incron.aiken.cz/> The idea is to use inotify (via incron) to watch for new files appearing in Maildirs. We only want to train seen messages as ham, and old (but not necessarily seen) messages as spam. The overall goal of this is to allow a user to train their filters without leaving his favorite mail user agent (mine is mutt); all I do is move mail to the right folder. I have one "spam" folder where I throw spam and every other directory is not spam. Every message written to Maildir involves a rename, so we only have incron watch for IN_MOVED_TO events. The flow is as follows, all for a single Unix user account: incron -> report-spam -> sendmail -> MTA -> dc-dlvr -> spamc -> spamd (my sendmail+MTA is postfix) The scripts I wrote are report-spam and dc-dlvr, the latter is also a system-wide MDA. I use dovecot (dc-dlvr => "dovecot deliver") as my IMAP server, but other IMAP servers supporting Maildir should also work. This should also work with MTAs other than postfix. More comments inline in the scripts: ------------------------ report-spam ------------------------------- #!/bin/sh # License: GPLv3 or later <http://www.gnu.org/licenses/gpl-3.0.txt> # Usage: report-spam /path/to/message/in/maildir # This is intended for use with incron or similar systems. # my incrontab(5) looks like this: # /path/to/maildir/.INBOX.good/cur IN_MOVED_TO /path/to/report-spam $@/$# # /path/to/maildir/.INBOX.spam/cur IN_MOVED_TO /path/to/report-spam $@/$# # (note: I have many "good" dirs for various mailing lists I follow) # skip gigantic emails which SA does not handle bytes=$(stat -c %s $1) if test $bytes -gt 512000 then exit fi # Only tested with the /usr/sbin/sendmail which ships with postfix # # *** Why not call spamc directly in this script? *** # I route this through my MTA so it gets queued properly. # incrond has no concurrency limits and will fork a new process on # every single event, which sucks with rename storms when a client # commits folder changes. The sendmail executable exits quickly and # queues up the message for training. This should also ensure fairness # to newly arriving mail. Instead of installing/configuring # another queueing system, I reuse the queue in the MTA. # See dc-dlvr for corresponding trainspam/trainham handlers, DO_SENDMAIL='/usr/sbin/sendmail -oi' # trainspam and trainham plus-addresses are handled by dc-dlvr: case $1 in *[/.]spam/cur/*) # non-new messages in spam get trained exec $DO_SENDMAIL $USER+trainspam < $1 ;; *:2,*S*) # otherwise, seen messages only case $1 in *:2,*T*) exit 0 ;; # ignore trashed messages esac exec $DO_SENDMAIL $USER+trainham < $1 ;; esac ---------------------------- dc-dlvr --------------------------------- #!/bin/sh # License: GPLv3 or later <http://www.gnu.org/licenses/gpl-3.0.txt> # This is installed as /etc/dc-dcvr on my system # to use with postfix main.cf: mailbox_command = /etc/dc-dlvr "$EXTENSION" DELIVER=/usr/lib/dovecot/deliver # change if your spamc/spamd listens elsewhere spamc='spamc' # delivery targets for report-spam. # allow plus addressing to train spam filters, $1 is the $EXTENSION # which may be "trainspam" or "trainham". Only allow spam training # when $CLIENT_ADDRESS is empty (local client) # I check $CLIENT_ADDRESS (set by Postfix) so spammers who read this # script won't be able to mistrain my Bayes remotely. case $1,$CLIENT_ADDRESS in trainspam,) exec $spamc -L spam > /dev/null 2>&1 ;; trainham,) exec $spamc -L ham > /dev/null 2>&1 ;; esac TMPMSG=$(mktemp -t dc-dlvr.orig.$USER.XXXXXX || exit 1) rm_list=$TMPMSG # pre-filter, for infrequently read lists which do their own spam filtering: # this runs whatever shell commands a user desires (may read $TMPMSG) if test -r ~/.dc-dlvr.pre then # this branch is only for huge, high-traffic lists which I only skim. # I short-circuit SA for some huge lists which do good # filtering on their own. set -e cat > $TMPMSG DEFAULT_INBOX=$(. ~/.dc-dlvr.pre) case $DEFAULT_INBOX in '') exec rm -f $rm_list ;; INBOX) ;; # do nothing *) $DELIVER -m $DEFAULT_INBOX < $TMPMSG exec rm -f $rm_list ;; esac # normal SA filtering path PREMSG=$(mktemp -t dc-dlvr.orig.$USER.XXXXXX || exit 1) rm_list="$rm_list $PREMSG" set +e mv -f $TMPMSG $PREMSG $spamc -E --headers < $PREMSG > $TMPMSG else # normal SA filtering for email I'm expected to read: $spamc -E --headers > $TMPMSG end err=$? # normal delivery set -e case $err in 1) $DELIVER -m INBOX.spam < $TMPMSG ;; *) # users may override normal delivery and have it go elsewhere if test -r ~/.dc-dlvr.rc then # whatever shell commands a user desires . ~/.dc-dlvr.rc else $DELIVER -m INBOX < $TMPMSG fi ;; esac # cleanup exec rm -f $rm_list ---------------------------------- 8< ----------------------------- I wonder if other users out there have similar setups. It should be possible for kqueue users (assuming there is something like incrond for kqueue OSes). Anyways I hope this all makes sense and helps some SA users out there. I'll be glad to try and answer any questions/comments. I haven't needed to do much else to configure SA after all these years, most of the Debian-provided defaults are good but I enable the automatic cron updates. Thanks for reading this far! :) -- EW