On Wed, 26 Jun 2019 14:07:00 -0600 Amir Caspi wrote: > John et al, > > I recall from a prior thread last year that there were supposed to be > some rules to check for zero-width joiner characters... but I'm > seeing spams recently that have these, but don't hit any such rules. > > Here's one spample, where the ZWJ entity #x200B is being used to try > to sidestep Bayes detection of highly spammy words. > https://pastebin.com/kx0jVBtZ
It's actually a zero-width space. I created a second version with the ZWSs globally stripped, and ran both through Bayes and diffed the tokens that contributed to the result. YMMV, but I found it made no difference at all. With the previous run of ZW[N]J spams I found that they actually helped Bayes by breaking-up words into fragments that repeated in related spams.