On Wed, 2009-03-18 at 08:00 +0100, fl...@pbartels.info wrote: > Matt Kettler <mkettler...@verizon.net> wrote:
> > Is there any reason to think headers make bad tokens? > > For example the "X-Spam-Flag: NO" can cause Problems if you don't > remove it before parsing and don't set it yourself. (You'll never do > that and I don't know how SA really handle it internally but its a > good example, because its exactly a header that tells the mail is ham.) > > For me it seems bayes would think now all messages with "X-Spam-Flag: > NO" are not spam. Sure bayes is not a binary thinking system but this > header field would push the mail a bit to be treated as no spam. (Or > if all spammers set this Flag, no spam messages are pushed to be > treated as spam.) Nah, you're reading too much into that header, from a human point of view. Bayes does not understand the semantics of "Spam Flag == No" as you do... > Problem: > Now there could exist other fields that normally indicates the message > is no spam. If they are used by a spammer and it is not ignored by the > bayes system the message is handled more like no spam. You ignored Bayes in that example. :) If spammers start injecting previously innocent headers en masse, the Bayes spam probability for that token quickly will become neutral or even spammy, depending on the amount of ham and spam with that header, upon learning. > Using SAs Bayses mechanism sounds like a nice solution for unknown > headers or headers you specially want to be used by SA but there is my > problem above and because of it I'm feeling unsure if it's useful to > ignore some headers or not. It is useful to bayes_ignore_header custom headers you add *locally* (by your MDA, MUA, maybe MTA), which do not provide useful information or have been injected *after* scanning -- to prevent a subsequent manual sa-learn run from picking up those useless headers. By default, SA ignores commonly used, useless headers already. So this really applies to your custom headers only. > Actually I think some wrong identified tokens won't be a problem > because there would be some (hopefully more) tokens identifying the > message as spam. And thats just the way bayes works. So it seems you > don't have to deactivate headers yourself but why are some people > deactivating so much headers? You should ask those who do. We don't, and we don't advocate it. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}