From: Matt Kettler <mkettler...@verizon.net>
   Date: Wed, 18 Mar 2009 19:49:53 -0400
   
   Jeff Mincy wrote:
   >    From: Matt Kettler <mkettler...@verizon.net>
   >    Date: Tue, 17 Mar 2009 21:30:02 -0400
   >    
   >    fl...@pbartels.info wrote:
   >    > Hello,
   >    >
   >    > instead of disabling a lot possibly set message headers using
   >    > "bayes_ignore_header" and ending up in strange configs like:
   >    >
   >    > bayes_ignore_header Return-Path
   >    ...
   >    > (found on the net)
   >    Where?
   >    >
   >    > shouldn't SpamAssassins bayes mechanism just ignore the complete
   >    > message header and just look at the body?
   >    > This seems useful in my opinion.
   >    It seems like a very misguided idea to me.
   >    
   >    Is there any reason to think headers make bad tokens?
   >    Do you have any test data showing this improves your bayes accuracy?
   >
   > Yes - I think some headers make extremely bad tokens for bayes, for
   > example the X-Mailer/User-Agent headers.   40% of the spam I get
   > claims to  have Microsoft Outlook as a x-Mailer.   So bayes rapidly
   > determines that *UAMicrosoft (etc) is an extremely strong token.
   > These *UA tokens were enough to push a short ham message to BAYES_99.
   > When I added an bayes_ignore_header the score dropped to ~BAYES_40
   >   
   That seems rather extraordinarily strange. Did the messages match no
   other tokens at all?  (ie: did you run it through spamaassassin -D bayes
   before and after?)
   
This was the X-Spam-Bayes header that was added at the time:
   X-Spam-Bayes: bayes=1.0000, N=27(19-0+13), ham=(), spam=(HTo:U*mincy, 
HTo:D*com, HTo:D*rcn.com, H*F:D*net, H*UA:Build)

This header was added using:
   add_header all Bayes bayes=_BAYES_, 
N=_BAYESTC_(_BAYESTCLEARNED_-_BAYESTCHAMMY_+_BAYESTCSPAMMY_), 
ham=(_HAMMYTOKENS(5,short)_), spam=(_SPAMMYTOKENS(5,short)_)


So, there are 27 tokens, 0 hammy, 13 spammy.

   I'd be very interested in what's going on there, because it makes very
   little sense unless the message really matched very, very little other
   existing training.
   
3 of the top 5 spammy tokens eg: HTo:U*mincy, HTo:D*com, HTo:D*rcn.com
come from the To: mi...@rcn.com header.  The  H*UA:Build came from a
  'X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)'
header.  As I recall, there were various H*UA:Outlook etc headers.

Bayes was 100.000% sure that this message was spam based on the To,
X-Mailer, and From headers.  The envelope on all email message that I
read at home are addressed to mi...@rcn.com (ignoring for the moment
that mi...@starpower.net also happens to get to me).  The 'To:' header
is either going to be mi...@rcn.com or some made up email address that
will never be repeated or it is my email address. So Bayes will see my
email address in both spam and ham.  At the time more than 80% of
email I was getting at rcn.com was spam so, To: mi...@rcn.com was
turned into three strong spam tokens.  My real mi...@rcn.com email
address in the To header says nothing about the spamminess of the
message.  This is in contrast to the mi...@starpower.net email address
which is almost certainly spam and has been added to the
blacklist_to).  So my solution was to add 'bayes_ignore_header To
From' and use blacklist_to/blacklist_from for the suspect email
addresses.  I came up with similar justification for adding
'bayes_ignore_header X-Mailer'.

The body of the message was a single sentence asking me about my
primary music software.

If you want to see more detail lets take it off the public mailing
list.

-jeff

Reply via email to