SpamAssassins bayes mechanism and message headers

2009-03-17 Thread floss

Hello,

instead of disabling a lot possibly set message headers using  
"bayes_ignore_header" and ending up in strange configs like:


bayes_ignore_header Return-Path
bayes_ignore_header Received
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Level
bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Resent-For
bayes_ignore_header X-Resent-By
bayes_ignore_header X-Resent-To
bayes_ignore_header Resent-To
bayes_ignore_header Sender
bayes_ignore_header Precedence
bayes_ignore_header X-Antispam
bayes_ignore_header X-Sieve
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity
bayes_ignore_header To
bayes_ignore_header X-Sieve
bayes_ignore_header X-WEBDE-FORWARD

bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Antispam
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity

(found on the net)

shouldn't SpamAssassins bayes mechanism just ignore the complete  
message header and just look at the body?


This seems useful in my opinion. What do you mean?
(Are static tests not good enough for the message headers?)
It seems also more useful for me to activate just special header  
fields and ignoring all other. I undestand for example From, To or the  
Subject may contain useful tokenizable informations but the most  
fields seems not interesing and hard to find out or to be sure you got  
them all.


Is there a config option to tell SpamAssassins bayes mechanism not to  
look at the message header or does SpamAssassin still not look at the  
header by default?

Perhaps there are regular expressions ?

If it parses the message header, it seems you have to read the RFC's  
and look at some tools to find out what kind of message headers are set.


Thanks.
Philipp



Re: SpamAssassins bayes mechanism and message headers

2009-03-18 Thread floss

Matt Kettler  wrote:


fl...@pbartels.info wrote:

Hello,

instead of disabling a lot possibly set message headers using
"bayes_ignore_header" and ending up in strange configs like:

bayes_ignore_header Return-Path
bayes_ignore_header Received
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Level
bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Resent-For
bayes_ignore_header X-Resent-By
bayes_ignore_header X-Resent-To
bayes_ignore_header Resent-To
bayes_ignore_header Sender
bayes_ignore_header Precedence
bayes_ignore_header X-Antispam
bayes_ignore_header X-Sieve
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity
bayes_ignore_header To
bayes_ignore_header X-Sieve
bayes_ignore_header X-WEBDE-FORWARD

bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Antispam
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity

(found on the net)

Where?


Just search bayes_ignore_header and you'll find a lot of results  
partially with long lists like the one above of bayes_ignore_header  
settings.


Because if found it often I'm thinking if it's really useful or not.

There is also an example in the default local.cf:
#   Set headers which may provide inappropriate cues to the Bayesian
#   classifier
#
# bayes_ignore_header X-Bogosity
# bayes_ignore_header X-Spam-Flag
# bayes_ignore_header X-Spam-Status



shouldn't SpamAssassins bayes mechanism just ignore the complete
message header and just look at the body?
This seems useful in my opinion.

It seems like a very misguided idea to me.

Is there any reason to think headers make bad tokens?


For example the "X-Spam-Flag: NO" can cause Problems if you don't  
remove it before parsing and don't set it yourself. (You'll never do  
that and I don't know how SA really handle it internally but its a  
good example, because its exactly a header that tells the mail is ham.)


For me it seems bayes would think now all messages with "X-Spam-Flag:  
NO" are not spam. Sure bayes is not a binary thinking system but this  
header field would push the mail a bit to be treated as no spam. (Or  
if all spammers set this Flag, no spam messages are pushed to be  
treated as spam.)


Problem:
Now there could exist other fields that normally indicates the message  
is no spam. If they are used by a spammer and it is not ignored by the  
bayes system the message is handled more like no spam.



Do you have any test data showing this improves your bayes accuracy?

I'd expect a significant reduction in accuracy from this, but if you've
got real data showing otherwise, I'd love to see it.  My own informal
testing shows header tokens are *VERY* useful, particularly Received:
header tokens.

No, I'm just thinking about it.



SpamAssassin contains quite a bit of code to break the headers down when
tokenize them in a useful way. It doesn't just extract a bunch of words
from the headers and throw them in the database, it actually encodes
things like what header a word exists in as a part of the token itself.
ie: "Drug" in the From: header is a different token  than "Drug" in the
To: header  which is different from "Drug" in the body.



What do you mean?
(Are static tests not good enough for the message headers?)

No.  Static rules are not any better for headers than they are for body
text. Bayes allows SA to adapt to rapid mutations in spam. These
mutations exist in both the headers, and the body.

It seems also more useful for me to activate just special header
fields and ignoring all other. I undestand for example From, To or the
Subject may contain useful tokenizable informations but the most
fields seems not interesing and hard to find out or to be sure you got
them all.

Is there a config option to tell SpamAssassins bayes mechanism not to
look at the message header or does SpamAssassin still not look at the
header by default?

No, the entire design of the SA bayes mechanism intentionally tries to
tokenize headers.  A lot of work went into making it do this very well.
Why would you want to disable it?



See above.


If you don't like bayes, by all means disable it, but why cut off its
legs? If you're going to use the CPU and IO time to run bayes, let it
run well.

Perhaps there are regular expressions ?

If it parses the message header, it seems you have to read the RFC's
and look at some tools to find out what kind of message headers are set.


SA extensively parses the headers. It parses *all* headers, even
nonstandard ones that I could randomly configure a server to add like
"X-Matts-funky-header: Hi!".

There is no complete list of headers in the RFCs, because you can add a
X- header with any name you can think of.





Yes I know. But there is a list of standarized and a li