Re: When Bayes goes bad... How to fix?

Matt Kettler Sun, 12 Nov 2006 15:40:21 -0800

Bob Proulx wrote:
>
>
> I am guessing that H*c is a header and some specific token.
> If there a key somewhere that will help decode these?
>   
>From Bayes.pm:


%HEADER_NAME_COMPRESSION = (
  'Message-Id'          => '*m',
  'Message-ID'          => '*M',
  'Received'            => '*r',
  'User-Agent'          => '*u',
  'References'          => '*f',
  'In-Reply-To'         => '*i',
  'From'                => '*F',
  'Reply-To'            => '*R',
  'Return-Path'         => '*p',
  'Return-path'         => '*rp',
  'X-Mailer'            => '*x',
  'X-Authentication-Warning' => '*a',
  'Organization'        => '*o',
  'Organisation'        => '*o',
  'Content-Type'        => '*c',
  'X-Spam-Relays-Trusted' => '*RT',
  'X-Spam-Relays-Untrusted' => '*RU',

);


So H*r = Received: header, etc.
>   [15528] dbg: bayes: token 'H*MI:OEA0023' => 0.985096774193548
>   [15528] dbg: bayes: token 'H*M:OEA0023' => 0.985096774193548
>   [15528] dbg: bayes: token 'H*UA:Express' => 0.985060557114832
>   [15528] dbg: bayes: token 'H*x:Express' => 0.985059973253254
>   [15528] dbg: bayes: token 'HX-MimeOLE:V6.00.2900.2962' => 0.976898908840907
>   [15528] dbg: bayes: token 'HX-MimeOLE:MimeOLE' => 0.976313886128059
>   [15528] dbg: bayes: token 'HX-MSMail-Priority:Normal' => 0.974305670960733
>   [15528] dbg: bayes: token 'HX-MimeOLE:Microsoft' => 0.959224439139177
>   [15528] dbg: bayes: token 'HX-MimeOLE:Produced' => 0.959178732453666
>
> It has really learned outlook as a spam source.  But there should be
> plenty of valid messages to have offset these.  I keep running
> sa-learn --ham on all valid messages hoping that it would offset the
> spam ones.  As you can see from the numbers there are 150,000 messages
> and apparently all in the last 2.34 days too.  (But that does not
> quite make sense to me either.)
>
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:lists.example.com' => 
> 0.950917490471412
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:sk:monty-p' => 
> 0.95091594711816
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:199.232.76.173' => 
> 0.95091594711816
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:envfrom' => 
> 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:auth' => 
> 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:helo' => 
> 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:intl' => 
> 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:ident' => 
> 0.950880625609595
>   [15528] dbg: bayes: token 'HX-Spam-Relays-Internal:rdns' => 
> 0.950880625609595
>
> It seems to have learned one of the trusted_network machines as a spam
> relay.  Hmm...  That seems like a bug.
>   
Perhaps.. either that or you're doing your spam learning after this
machine has added it's headers, but very little of your ham learning has it.

>   [8683] dbg: received-header: relay 199.232.76.173 trusted? yes internal? yes
>
>   
>> That should at least let you know what it is your bayes DB has learned
>> that's bad.
>>
>> If it's not too horible you might be able to use sa-learn --backup to
>> dump the DB, edit it by hand, and sa-learn --restore it.
>>     
>
> Hmm...  That is an idea.  A good suggestion.  Of course everything has
> been hashed so I would need to reverse engineer them back to something
> meaningful but should be possible with a message to test against.
>
> I think the bayes is learning things from the mime structure that it
> should not be learning such as multipart/alternative.  Is there a way
> to whitelist tokens so that it does not show up in the bayes at all?
>
>   
>> However, you'd need to find the correct SHA1 of the offending tokens..
>> not sure if that will be in the debug output.
>>     
>
> Yes.  Correlating one to the other is going to be a pain.
>
> Thanks for the suggestions.
>
> Bob
>
>

Re: When Bayes goes bad... How to fix?

Reply via email to