On Thu, 2013-07-25 at 09:39 +0100, James Griffin wrote:
> Thu 25.Jul'13 at  1:31:16 +0200, Karsten Bräckelmann

> > NOTE: Be careful of using sa-learn in different environments or ways in
> > parallel. For example via the dovecot anti-spam plugin, from a cron job
> > harvesting mbox files, maildir, processed through formail or even worse
> > an MUA...
>  
> I'm new to this list but have been using SA for a number of years.
> Having read your note above, I thought I'd ask for a little more info,
> in particular piping a message from mutt, my MUA, to mark a single mail
> as spam and move it to an appropriate mailbox.

> You say this a bad idea - so I'm wondering if it's best I no longer do
> that, and why?
> 
> I'm using SA 3.3.2. My mail is also scanned using procmail prior to
> being filtered into MH mailboxes.

That's how almost everyone does it. ;)  Auto-learn prior to delivery
(procmail calling SA and later delivering in your case) and manually
training hand-classified mail.


The important part here is mixing different ways to feed mail for Bayes
training. I have observed trailing newline issues between

(a) the dovecot anti-spam plugin's output,  (b) 'formail' splitting out
single messages from mbox files, and  (c) running 'sa-learn --mbox'.
Over the years, and IIRC only. Might even have been specific to a
system.

The point is, differences in trailing newline, slightly altered MIME
structures or headers will be invisible to Bayes as far as the tokens
(the content) is concerned. The internal hash identifying a given
message to have been seen by Bayes will differ, though.

As a result, messages could be learned twice, or not be forgotten.


Simple test if you're safe and everything works as expected:

Identify a message M that has been learned already, e.g. via the dovecot
anti-spam plugin, or SA auto-learning. Then apply your usual other
method of training, like sa-learn'ing the whole mbox or maildir storage
containing the message M, or running the mutt macro in your case.

If the message M has been learned *again*, it has been altered by one of
the methods. Which is bad, obviously. If Bayes identifies M to have been
seen before and refuses re-training, you're good.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to