Matt Kettler <mkettler...@verizon.net> wrote:
fl...@pbartels.info wrote:
Hello,
instead of disabling a lot possibly set message headers using
"bayes_ignore_header" and ending up in strange configs like:
bayes_ignore_header Return-Path
bayes_ignore_header Received
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Level
bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Resent-For
bayes_ignore_header X-Resent-By
bayes_ignore_header X-Resent-To
bayes_ignore_header Resent-To
bayes_ignore_header Sender
bayes_ignore_header Precedence
bayes_ignore_header X-Antispam
bayes_ignore_header X-Sieve
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity
bayes_ignore_header To
bayes_ignore_header X-Sieve
bayes_ignore_header X-WEBDE-FORWARD
bayes_ignore_header X-purgate
bayes_ignore_header X-purgate-ID
bayes_ignore_header X-purgate-Ad
bayes_ignore_header X-GMX-Antispam
bayes_ignore_header X-Antispam
bayes_ignore_header X-Spamcount
bayes_ignore_header X-Spamsensitivity
(found on the net)
Where?
Just search bayes_ignore_header and you'll find a lot of results
partially with long lists like the one above of bayes_ignore_header
settings.
Because if found it often I'm thinking if it's really useful or not.
There is also an example in the default local.cf:
# Set headers which may provide inappropriate cues to the Bayesian
# classifier
#
# bayes_ignore_header X-Bogosity
# bayes_ignore_header X-Spam-Flag
# bayes_ignore_header X-Spam-Status
shouldn't SpamAssassins bayes mechanism just ignore the complete
message header and just look at the body?
This seems useful in my opinion.
It seems like a very misguided idea to me.
Is there any reason to think headers make bad tokens?
For example the "X-Spam-Flag: NO" can cause Problems if you don't
remove it before parsing and don't set it yourself. (You'll never do
that and I don't know how SA really handle it internally but its a
good example, because its exactly a header that tells the mail is ham.)
For me it seems bayes would think now all messages with "X-Spam-Flag:
NO" are not spam. Sure bayes is not a binary thinking system but this
header field would push the mail a bit to be treated as no spam. (Or
if all spammers set this Flag, no spam messages are pushed to be
treated as spam.)
Problem:
Now there could exist other fields that normally indicates the message
is no spam. If they are used by a spammer and it is not ignored by the
bayes system the message is handled more like no spam.
Do you have any test data showing this improves your bayes accuracy?
I'd expect a significant reduction in accuracy from this, but if you've
got real data showing otherwise, I'd love to see it. My own informal
testing shows header tokens are *VERY* useful, particularly Received:
header tokens.
No, I'm just thinking about it.
SpamAssassin contains quite a bit of code to break the headers down when
tokenize them in a useful way. It doesn't just extract a bunch of words
from the headers and throw them in the database, it actually encodes
things like what header a word exists in as a part of the token itself.
ie: "Drug" in the From: header is a different token than "Drug" in the
To: header which is different from "Drug" in the body.
What do you mean?
(Are static tests not good enough for the message headers?)
No. Static rules are not any better for headers than they are for body
text. Bayes allows SA to adapt to rapid mutations in spam. These
mutations exist in both the headers, and the body.
It seems also more useful for me to activate just special header
fields and ignoring all other. I undestand for example From, To or the
Subject may contain useful tokenizable informations but the most
fields seems not interesing and hard to find out or to be sure you got
them all.
Is there a config option to tell SpamAssassins bayes mechanism not to
look at the message header or does SpamAssassin still not look at the
header by default?
No, the entire design of the SA bayes mechanism intentionally tries to
tokenize headers. A lot of work went into making it do this very well.
Why would you want to disable it?
See above.
If you don't like bayes, by all means disable it, but why cut off its
legs? If you're going to use the CPU and IO time to run bayes, let it
run well.
Perhaps there are regular expressions ?
If it parses the message header, it seems you have to read the RFC's
and look at some tools to find out what kind of message headers are set.
SA extensively parses the headers. It parses *all* headers, even
nonstandard ones that I could randomly configure a server to add like
"X-Matts-funky-header: Hi!".
There is no complete list of headers in the RFCs, because you can add a
X- header with any name you can think of.
Yes I know. But there is a list of standarized and a list of often
used mail headers. I wrote also you have to look into some tools to
find out what headers they are setting. And as you write you could
define your own headers, that will be used everytime until I
deactivate them. If there is a problem with tokenizing headers it
would be increased with the problem that you can't know all headers.
The question is now: How spam-like are this headers?
A good answer seems to be let SA find it out using bayes ;)
Using SAs Bayses mechanism sounds like a nice solution for unknown
headers or headers you specially want to be used by SA but there is my
problem above and because of it I'm feeling unsure if it's useful to
ignore some headers or not.
Actually I think some wrong identified tokens won't be a problem
because there would be some (hopefully more) tokens identifying the
message as spam. And thats just the way bayes works. So it seems you
don't have to deactivate headers yourself but why are some people
deactivating so much headers?
(Search the web or look here in the list.)
Deactivating Recived, Sender, To makes really no sense...
Thanks
Philipp