On Fri, 2014-05-16 at 11:24 -0700, Ian Zimmerman wrote: > In the last few (~10) days, I have seen a marked increase in FNs, > usually with Bayes values in the 50s and 60s.
That's a neutral bayes classification. Other rules should be able to still identify the spam. > On close inspection, I see that the hash-busting garbage appended is > (faux) technical computing talk instead of the usual cookbooks or > classical literature :-p That is, scrambled Stack Overflow discussions > and the like. And of course that is what most of my ham is about, so > it makes very good sense that Bayes gets confused. Tech talk might result in a lower bayes score for you and me, but it sure doesn't for the average user. Classical English literature tokens on the other hand have been learned as spammy here, due to past efforts of copy-n-paste bayes "poison". Same thing, different audience... These attempts usually end up merely "confusing" bayes (as you aptly put it). It's very uncommon to result in significantly hammy scoring. > [...] isn't this a situation where something like James' suggestion > would help? No. What makes you believe spammers will stick to dumping the fodder at the end? There are multiple techniques to inject text and hide it, with the visible payload at the end. (HTML tricks and clever MIME structure come to mind immediately.) Besides, arbitrarily limiting the number of tokens makes you vulnerable to a whole new class of attacks. And the actual spammy payload at the end will not be seen by bayes, because the threshold was exceeded by all the random strings at the beginning. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}