On Mon, 28 Dec 2015 23:42:03 -0500
Bill Cole wrote:

> Using these facts, my learning script that runs as root and reads
> from multiple real users' Maildirs does this to learn ham:
> 
>    for AFILE in $HAMS ; do formail < $AFILE ; done| sudo -H -u
> $SAUSER sa-learn --ham --mbox
> 
> Where $HAMS is the list of ham message files and $SAUSER is the user 
> handling the system-wide BayesDB. I use formail there just to give
> each message a leading 'From ' line (i.e. mbox format) so that the
> whole bunch can be piped into a single sa-learn invocation.

IIRC when you do that sa-learn just creates a temporary file and then
runs on that. 

> The alternative without formail would be to pipe each raw message into
> its own sa-learn. 

The alternative is to give it a directory. It can work out for itself
whether it's maildir or just a directory of files. If you need to train
an arbitrary  selection of files, you could symlink them into a
temporary directory. If you run spamd it's also possible to train via
spamc.


Personally I'd avoid the unforced use of mbox around Bayes without
being sure that "From-escaping" is taken account of . The problem is
that formail will replace a "From" at the beginning of a body line with
">From" which changes the msgid hash and prevents the correct
retraining of mail that was trained without going through formail -
e.g. the correction of autotraining.

I just had a quick look and I can't see any support for this in
SpamAssassin. It's not a major problem, but in this case it's an easily
avoidable one.

Reply via email to