From: Axb <axb.li...@gmail.com>
   Date: Tue, 14 Oct 2014 23:37:36 +0200
   
   On 10/14/2014 11:08 PM, Adam Katz wrote:
   >> On Tue, 14 Oct 2014 16:10:52 +0200 Axb <axb.li...@gmail.com> wrote:
   >>> and to avoid further discussions of what header may pollute bayes or
   >>> not, I've removed all header entries which are not directly related
   >>> to AV/filter products.
   >
   > On 10/14/2014 07:17 AM, David F. Skoll wrote:
   >> I'm not sure I agree with being too clever about Bayes.  Surely by its
   >> very nature, the Bayes algorithm will itself indicate which tokens
   >> are relevant and which are not?  Isn't that the whole point of Bayes?
   >>
   >> I think being to clever about massaging the data that gets fed to
   >> Bayes may be counter-productive.  For sure, *some* massaging is in order;
   >> a token should be a semantic unit, so something like "www.example.com"
   >> should probably be one token rather than three, but beyond that I wonder
   >> if it's good or not to massage the data?
   >
   > The purpose of bayes_ignore_header is twofold:
   >
   >   1. Prevent inheriting other systems' false positives (ensure better
   >      independence)
   >   2. Prevent relying upon headers that won't exist at delivery time (e.g.
   >      added by the mailbox server)
   >
   > This is why it's so important to ignore other spam engines, which
   > basically fit into both of those categories.
   
   I'd love to have the option (switch) to use Bayes on msg bodies ONLY, 
   though I doubt anybody would be a taker for such a project.
   (I'd even be willing to "$pon$or" such an addition to SA)
   
Wouldn't that be fairly easy to implement  by intercepting the call to
_tokenize_headers in Plugin/Bayes.pm?

  # Tokenize the headers
  my %hdrs = $self->_tokenize_headers ($msg);
  while( my($prefix, $value) = each %hdrs ) {
    push(@tokens, $self->_tokenize_line ($value, "H$prefix:", 0));
  }

-jeff

Reply via email to