On Fri, 2014-05-16 at 11:24 -0700, Ian Zimmerman wrote:
> In the last few (~10) days, I have seen a marked increase in FNs,
> usually with Bayes values in the 50s and 60s.

That's a neutral bayes classification. Other rules should be able to
still identify the spam.

> On close inspection, I see that the hash-busting garbage appended is
> (faux) technical computing talk instead of the usual cookbooks or
> classical literature :-p  That is, scrambled Stack Overflow discussions
> and the like.  And of course that is what most of my ham is about, so
> it makes very good sense that Bayes gets confused.

Tech talk might result in a lower bayes score for you and me, but it
sure doesn't for the average user. Classical English literature tokens
on the other hand have been learned as spammy here, due to past efforts
of copy-n-paste bayes "poison". Same thing, different audience...

These attempts usually end up merely "confusing" bayes (as you aptly put
it). It's very uncommon to result in significantly hammy scoring.


> [...] isn't this a situation where something like James' suggestion
>  would help?

No. What makes you believe spammers will stick to dumping the fodder at
the end? There are multiple techniques to inject text and hide it, with
the visible payload at the end. (HTML tricks and clever MIME structure
come to mind immediately.)

Besides, arbitrarily limiting the number of tokens makes you vulnerable
to a whole new class of attacks. And the actual spammy payload at the
end will not be seen by bayes, because the threshold was exceeded by all
the random strings at the beginning.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Reply via email to