John et al,

I recall from a prior thread last year that there were supposed to be some 
rules to check for zero-width joiner characters... but I'm seeing spams 
recently that have these, but don't hit any such rules.

Here's one spample, where the ZWJ entity #x200B is being used to try to 
sidestep Bayes detection of highly spammy words.
https://pastebin.com/kx0jVBtZ

I know there are legitimate uses for ZWJ chars in some scripts, so we can't use 
their mere existence as evidence of spam, but presumably those charsets would 
be denoted explicitly so could be meta'd out... and, presumably, would also not 
contain the ZWJ chars in between obviously roman chars.

Any idea why this spample didn't hit the ZWJ obfuscation rules?

I'm getting quite a lot of zero-hour snowshoe spam lately and it's not even 
hitting BOGUS_MIME_VERSION any more... almost all of it is BAYES_50.  Some of 
that is because of these ZWJ tricks, though the rest, I dunno -- I'm still not 
sure if my DB is misbehaving or if these spams are just carefully crafted 
enough to avoid highly-spammy words.  (Training and rescanning gets me BAYES_99 
on those same spams so the DB is definitely training...)

Thoughts on nuking these Bayes-evading stealthy spams?

Thanks!

--- Amir

Reply via email to