On Thu, 2013-07-25 at 08:07 -0400, Ian Turner wrote: > > See where I am heading? Any chance your Bayes DB is completely borked? > > sa-learn --dump magic > > Not sure what to do with this, but here you go:
> 0.000 0 29074 0 non-token data: nspam > 0.000 0 46274 0 non-token data: nham > 0.000 0 158157 0 non-token data: ntokens You have trained more ham than spam. That's not necessarily a problem, and opinions differ greatly. But it might be indication your Bayes is skewed. > > And for further digging, which are the top hammy / spammy tokens? See > > M::SA::Conf [1], section Template Tags. > > They are in the pastes in the X-Spam-JPW-Report: header. That's useful. All three samples are rather similar, including the tokens. Some headers from the first one: Return-path: <bounce-248-77802767-someone=example....@clickettle.com> Hammy tokens: 0.001-+--HX-Envelope-From:sk:someone., 0.005-+--HX-Envelope-From:sk:bounce-, 0.006-+--HX-Spam-Relays-External:sk:someone., 0.006-+--H*RU:sk:someone., 0.016-1--loans; Spammy tokens: 0.903-+--Fast, 0.862-1--33179, 0.847-1--Miami, 0.847-1--miami HAM: The strong hammy tokens are almost exclusively from the headers (*RU stands for Untrusted Relays). In particular the X-Envelope-From tokens are suspicious, given the Envelope-From / Return-Path value. Are you filtering mailing-list traffic through SA? Do you manually train them as ham, or did that happen by auto-learning? If it's not mailing-lists but e.g. due to newsletters and stuff, if might be worth to bayes_ignore_header the Envelope-From header. The most hammy tokens are "bounce detection" and "have own address in RP". SPAM: The spammy tokens are highly suspicious, too. As you confirmed, you are manually training these as spam. And all three samples feature an address in "Miami, FL 33179" at the bottom. Yet, the declassification distance for "33179", "Miami" and "miami" (lc version of the former, generated by SA Bayes) is a mere 1. Which means, learning the token as the opposite just *once* makes them lose the current classification. Which seems rather unlikely, unless you frequently have such addresses in ham, too. Nah, still unlikely. Besides, once these are declassified, there would be only a single spammy token left -- the header above shows there are only 4. There simply was no 5th spammy token in that message also in the database. Do you use site-wide or per-user Bayes? Do you (manually) train by the same user SA runs as while filtering? You also might need some serious spam training. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}