Re: SpamAssassins bayes mechanism and message headers

floss Wed, 18 Mar 2009 00:00:02 -0700

Matt Kettler <mkettler...@verizon.net> wrote:

fl...@pbartels.info wrote:

Hello,


instead of disabling a lot possibly set message headers using
"bayes_ignore_header" and ending up in strange configs like:

bayes_ignore_header Return-Path
bayes_ignore_header Received
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Level
bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Resent-For
bayes_ignore_header X-Resent-By
bayes_ignore_header X-Resent-To
bayes_ignore_header Resent-To
bayes_ignore_header Sender
bayes_ignore_header Precedence
bayes_ignore_header X-Antispam
bayes_ignore_header X-Sieve
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity
bayes_ignore_header To
bayes_ignore_header X-Sieve
bayes_ignore_header X-WEBDE-FORWARD

bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Antispam
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity

(found on the net)

Where?

Just search bayes_ignore_header and you'll find a lot of resultspartially with long lists like the one above of bayes_ignore_headersettings.


Because if found it often I'm thinking if it's really useful or not.

There is also an example in the default local.cf:
#   Set headers which may provide inappropriate cues to the Bayesian
#   classifier
#
# bayes_ignore_header X-Bogosity
# bayes_ignore_header X-Spam-Flag
# bayes_ignore_header X-Spam-Status


shouldn't SpamAssassins bayes mechanism just ignore the complete
message header and just look at the body?
This seems useful in my opinion.

It seems like a very misguided idea to me.

Is there any reason to think headers make bad tokens?

For example the "X-Spam-Flag: NO" can cause Problems if you don'tremove it before parsing and don't set it yourself. (You'll never dothat and I don't know how SA really handle it internally but its agood example, because its exactly a header that tells the mail is ham.)

For me it seems bayes would think now all messages with "X-Spam-Flag:NO" are not spam. Sure bayes is not a binary thinking system but thisheader field would push the mail a bit to be treated as no spam. (Orif all spammers set this Flag, no spam messages are pushed to betreated as spam.)


Problem:

Now there could exist other fields that normally indicates the messageis no spam. If they are used by a spammer and it is not ignored by thebayes system the message is handled more like no spam.

Do you have any test data showing this improves your bayes accuracy?

I'd expect a significant reduction in accuracy from this, but if you've
got real data showing otherwise, I'd love to see it.  My own informal
testing shows header tokens are *VERY* useful, particularly Received:
header tokens.

No, I'm just thinking about it.


SpamAssassin contains quite a bit of code to break the headers down when
tokenize them in a useful way. It doesn't just extract a bunch of words
from the headers and throw them in the database, it actually encodes
things like what header a word exists in as a part of the token itself.
ie: "Drug" in the From: header is a different token  than "Drug" in the
To: header  which is different from "Drug" in the body.

What do you mean?
(Are static tests not good enough for the message headers?)

No.  Static rules are not any better for headers than they are for body
text. Bayes allows SA to adapt to rapid mutations in spam. These
mutations exist in both the headers, and the body.

It seems also more useful for me to activate just special header
fields and ignoring all other. I undestand for example From, To or the
Subject may contain useful tokenizable informations but the most
fields seems not interesing and hard to find out or to be sure you got
them all.

Is there a config option to tell SpamAssassins bayes mechanism not to
look at the message header or does SpamAssassin still not look at the
header by default?

No, the entire design of the SA bayes mechanism intentionally tries to
tokenize headers.  A lot of work went into making it do this very well.
Why would you want to disable it?


See above.

If you don't like bayes, by all means disable it, but why cut off its
legs? If you're going to use the CPU and IO time to run bayes, let it
run well.

Perhaps there are regular expressions ?

If it parses the message header, it seems you have to read the RFC's
and look at some tools to find out what kind of message headers are set.


SA extensively parses the headers. It parses *all* headers, even
nonstandard ones that I could randomly configure a server to add like
"X-Matts-funky-header: Hi!".

There is no complete list of headers in the RFCs, because you can add a
X- header with any name you can think of.

Yes I know. But there is a list of standarized and a list of oftenused mail headers. I wrote also you have to look into some tools tofind out what headers they are setting. And as you write you coulddefine your own headers, that will be used everytime until Ideactivate them. If there is a problem with tokenizing headers itwould be increased with the problem that you can't know all headers.

The question is now: How spam-like are this headers?
A good answer seems to be let SA find it out using bayes ;)

Using SAs Bayses mechanism sounds like a nice solution for unknownheaders or headers you specially want to be used by SA but there is myproblem above and because of it I'm feeling unsure if it's useful toignore some headers or not.

Actually I think some wrong identified tokens won't be a problembecause there would be some (hopefully more) tokens identifying themessage as spam. And thats just the way bayes works. So it seems youdon't have to deactivate headers yourself but why are some peopledeactivating so much headers?

(Search the web or look here in the list.)

Deactivating Recived, Sender, To makes really no sense...

Thanks
Philipp

Re: SpamAssassins bayes mechanism and message headers

Reply via email to