Matt,

Thank you, that makes things a lot clearer, is there any way to utilise
forwarded messages or is it a lost cause?

Thanks
Andrew

On Fri, 2006-11-24 at 10:22 -0500, Matt Kettler wrote:
> Andrew Sykes wrote:
> > Hi,
> >
> > I'm writing some code to integrate SpamAssassin with Apache JAMES.
> >
> > I want to setup an address to allow me to pipe spam into sa-learn. I
> > have a prototype of this working fine, but would like to allow various
> > webmail client users to be able to forward spam messages to this
> > address.
> >
> > As I have very limited understanding of how SA works, I don't want to
> > end up blocking the forwarding addresses.
> >
> > If I whitelist the forwarding addresses, can I then simply pipe a
> > forwarded spam from that address into sa-learn or is there more to it?
> >   
> 
> There's MUCH more to it.. In fact, whitelisting won't really affect what
> sa-learn does at all.
> 
> Generally speaking, forwarded messages are mostly useless to sa-learn.
> Exactly how useless depends a bit on the mail client..
> 
> SA tokenizes MANY mail headers, including Received:, not just From: and
> To. All the headers in a forwarded message are completely new, thus the
> sa-learn process will be learning the headers generated by forwarding,
> and not spam.
> 
> SA also tokenizes the body of the message. However, most mail clients
> substantially modify the body of the message when you forward. 
> Generally speaking they only preserve one of the mime sections in a
> multipart/alternative message. Spammers FREQUENTLY have text/plain
> sections which are dissimilar from the text/html. By forwarding you're
> loosing all but one mime section (generally text/html is kept).
> 
> On top of this, most mail clients also insert "Forwarded message:" type
> text into the body, and add Fwd: to the subject.
> 
> SA also tokenizes the in-body mime headers describing how the message
> was encoded. However, when you forward, the mail client doing the
> forward re-encodes things its own way. What might have been base64
> encoded may now be quoted-printable, 8 bit, or 7 bit.
> 
> So, fundamentally, as far as bayes is concerned the forwarded message is
> a completely different message than the original spam.
> 
> You can try this sometime by taking an original spam, and a forwarded
> version of it and feed them both to spamassassin or sa-learn with "-D
> bayes" added. This will cause the debug output to list all the tokens
> used. Take a look at the tokens. .some are the same, but many are different.
> 
> 
> 
> 
> 
> 
> 
-- 
Kind Regards
Andrew Sykes <[EMAIL PROTECTED]>
Sykes Development Ltd
http://www.sykesdevelopment.com

Reply via email to