From: Axb <axb.li...@gmail.com> Date: Tue, 14 Oct 2014 23:37:36 +0200 On 10/14/2014 11:08 PM, Adam Katz wrote: >> On Tue, 14 Oct 2014 16:10:52 +0200 Axb <axb.li...@gmail.com> wrote: >>> and to avoid further discussions of what header may pollute bayes or >>> not, I've removed all header entries which are not directly related >>> to AV/filter products. > > On 10/14/2014 07:17 AM, David F. Skoll wrote: >> I'm not sure I agree with being too clever about Bayes. Surely by its >> very nature, the Bayes algorithm will itself indicate which tokens >> are relevant and which are not? Isn't that the whole point of Bayes? >> >> I think being to clever about massaging the data that gets fed to >> Bayes may be counter-productive. For sure, *some* massaging is in order; >> a token should be a semantic unit, so something like "www.example.com" >> should probably be one token rather than three, but beyond that I wonder >> if it's good or not to massage the data? > > The purpose of bayes_ignore_header is twofold: > > 1. Prevent inheriting other systems' false positives (ensure better > independence) > 2. Prevent relying upon headers that won't exist at delivery time (e.g. > added by the mailbox server) > > This is why it's so important to ignore other spam engines, which > basically fit into both of those categories. I'd love to have the option (switch) to use Bayes on msg bodies ONLY, though I doubt anybody would be a taker for such a project. (I'd even be willing to "$pon$or" such an addition to SA) Wouldn't that be fairly easy to implement by intercepting the call to _tokenize_headers in Plugin/Bayes.pm?
# Tokenize the headers my %hdrs = $self->_tokenize_headers ($msg); while( my($prefix, $value) = each %hdrs ) { push(@tokens, $self->_tokenize_line ($value, "H$prefix:", 0)); } -jeff