On Thu, 2013-07-25 at 08:07 -0400, Ian Turner wrote:
> > See where I am heading? Any chance your Bayes DB is completely borked?
> >   sa-learn --dump magic
> 
> Not sure what to do with this, but here you go:

> 0.000          0      29074          0  non-token data: nspam
> 0.000          0      46274          0  non-token data: nham
> 0.000          0     158157          0  non-token data: ntokens

You have trained more ham than spam. That's not necessarily a problem,
and opinions differ greatly. But it might be indication your Bayes is
skewed.

> > And for further digging, which are the top hammy / spammy tokens? See
> > M::SA::Conf [1], section Template Tags.
> 
> They are in the pastes in the X-Spam-JPW-Report: header.

That's useful. All three samples are rather similar, including the
tokens. Some headers from the first one:

  Return-path: <bounce-248-77802767-someone=example....@clickettle.com>

  Hammy tokens:
   0.001-+--HX-Envelope-From:sk:someone.,
   0.005-+--HX-Envelope-From:sk:bounce-,
   0.006-+--HX-Spam-Relays-External:sk:someone.,
   0.006-+--H*RU:sk:someone.,
   0.016-1--loans;

  Spammy tokens:
   0.903-+--Fast,
   0.862-1--33179,
   0.847-1--Miami,
   0.847-1--miami

HAM:  The strong hammy tokens are almost exclusively from the headers
(*RU stands for Untrusted Relays). In particular the X-Envelope-From
tokens are suspicious, given the Envelope-From / Return-Path value.

Are you filtering mailing-list traffic through SA? Do you manually train
them as ham, or did that happen by auto-learning?

If it's not mailing-lists but e.g. due to newsletters and stuff, if
might be worth to bayes_ignore_header the Envelope-From header. The most
hammy tokens are "bounce detection" and "have own address in RP".

SPAM:  The spammy tokens are highly suspicious, too. As you confirmed,
you are manually training these as spam. And all three samples feature
an address in "Miami, FL 33179" at the bottom.

Yet, the declassification distance for "33179", "Miami" and "miami" (lc
version of the former, generated by SA Bayes) is a mere 1. Which means,
learning the token as the opposite just *once* makes them lose the
current classification.

Which seems rather unlikely, unless you frequently have such addresses
in ham, too. Nah, still unlikely.

Besides, once these are declassified, there would be only a single
spammy token left -- the header above shows there are only 4. There
simply was no 5th spammy token in that message also in the database.


Do you use site-wide or per-user Bayes? Do you (manually) train by the
same user SA runs as while filtering?

You also might need some serious spam training.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to