Re: sa-learn from a cronjob?

Bob Proulx Thu, 24 Apr 2014 13:39:12 -0700

RW wrote:
> Ian Zimmerman wrote:
> > RW wrote:
> > RW> I don't think it will work for the purpose mentioned, and if it's
> > RW> working properly for you, there's a lot you're not mentioning.

I looked at the script and it looks like an example that would work
for Ian fine.  There are some points of shell programming style that I
would like to avoid seeing propagated in an example though. :-)  But I
think that it is great that Ian shared his script just the same.  This
is one of those things where if ten of us showed all of our working
examples that we would have 12 different scripts.

The biggest thing that hurts Ian's script as a general example is that
it is using ssh to connect to the server running spamassassin.  Most
developers use ssh every day and so that is very normal.  But most of
the masses of email users will not be in a position to use ssh
effectively.  A mail adminstrator would be able to see the example for
what it is and then write that part differently though.

> > RW> It's only looking for mail in the immediate post-delivery state
> > RW> after it's been put into the mailbox by an MTA or MDA and before
> > RW> it's been detected as new mail by an MUA (directly or via IMAP).
> > RW> It wont learn mail put into the folders by an MUA or IMAP at all.

No.  That isn't what the script is doing.

The script is looping through mail files in a maildir and processing
them remotely on the server through sa-learn.  After processing the
messages it is moving the messages to mark them as having been read.

The script is obviously meant to be run periodically by cron.  At that
time it will walk through every message that has been stored into the
ham and spam mailboxes.  A user would only need to store the message
into the appropriate mailbox.  A spam message into the spam mailbox
and then later in the background the cron task will send the spam
message through sa-learn --spam for learning.  Same for --ham.  The
script is fairly obvious, straight forward, and brute force.

> > RW> You need to use separate destination mailboxes.
> > 
> > These are _not_ general purpose Maildirs.  The normal mail processing
> > pipe (MTA -> LDA -> IMAP -> MUA) knows nothing about them.  To mark
> > something as spam/ham, a user (me) executes a custom macro in the MUA
> > which pipes the message through the safecat command to "deliver" it
> > explicitly to one of these directories. 
> 
> You might have mentioned that because it means it's not the solution you
> implied when you wrote "Here is my cronjob for that purpose". It's
> certainly not appropriate to users that don't like the command line.

Sorry but you are incorrect.  Users of Ian's system need not use the
command line.  His solution directly answered the Dan's question.

Dan Mahoney wrote:
> I'd like to basically have my IMAP server default to handing out two
> imap mailboxes that get auto-crontabbed to training bayes.

Ian Zimmerman wrote:
> Here is my cronjob for that purpose, in its entirety.  Note that
> each of ~/spam-corpora{ham,spam} is a Maildir.  There is a small
> race condition between the sa-learn run and the move to cur, which
> wasn't worth fixing in my case; if you use this and fix it let me
> know :)

Which is exactly what his script does.  (I don't like the
implementation as written because the shell scripting has some rough
spots.  But...)

> > Basically, Maildir is just a convenient container format here.  It
> > could be a database or whatever.
> > 
> > Does that answer your objections?
> 
> A Maildir isn't any more convenient than two simple directories. It
> doesn't really matter if you are the only user, but in general putting
> a Maildir that mustn't be opened in home directories wouldn't be a
> very good idea.

I am having a hard time understanding what you are objecting to here.
Dan was the one with the question.  Ian shared something that would do
the task.  It looks like you are having a hard time understanding how
this worked.  If so then please ask questions so as to understand it.
It doesn't make sense to gripe about it without reason.  Sharing and
commenting and peer review and iterating a solution and improving it
is how community efforts work and succeed and grow.

Your comment that a maildir isn't better than two simple directories
implies that you are not familiar with the maildir mailbox format.
Maildir is an ad-hoc standard mailbox format used by most imap
servers.  Using maildir mailboxes would definitely be better than
using two simple directories.  Standard is better than better!

There isn't any reason that it "mustn't be opened".  In fact the
opposite.  The user must be able to open the mailbox and must be able
to save misclassified messages there for learning.  If they do that by
mistake then they can pull the message back out before the crontask
runs.  (That timing is one of my issues with the script that I would
want to see improved.)

Using a maildir for these two purposes makes a lot of sense.  The user
reading email using any of the popular ways to read email these days
then can simply save the message into the appropriate mailbox.  That
could be an imap client or a web mail browser.  If they get a spam
message they can simply save the message into the spam mailbox.  Then
Ian's process is to use a cron task to periodically send all email
that has been saved into the spam mailbox through to sa-learn --spam
on the server training the SpamAssasin Bayes engine on the message.
And the opposite for non-spam for misclassified messages.  For the end
mail reading user no command line knowledge is needed.  They simply
need to be able to save email into mail folders.  Simple for them.
All of the effort is in the backend on the mail server.  Would work
for a large number of users.

Bob

Re: sa-learn from a cronjob?

Reply via email to