Re: Bayes underperforming, HTML entities?

RW Fri, 30 Nov 2018 16:25:09 -0800

On Fri, 30 Nov 2018 15:49:31 -0700
Amir Caspi wrote:

> > It make it harder to write rules detecting these tricks, but it may
> > happen eventually. As far as Bayes is concerned, it would be a
> > shame to lose the information.  
> 
> I'm not sure I see how Bayes can take decent advantage out of these
> zero-width chars.  If they are interspersed randomly within words,
> then Bayes has to tokenize each and every permutation (or, at least,
> very many permutations) of each word in order to be decently
> effective.  But if the zero-width chars are stripped out, then Bayes
> only has to tokenize the regular, displayable word.  Am I missing
> something?


Yes, you need something in between. A tokenization that avoids
learning the hundreds of obfuscation variants, but doesn't throw away
the existence of obfuscation. 

> But offering both converted and non-converted options is likely the
> best option, and then having Bayes work on the normalized version
> resolves the above.

Not simply on the normalized text, that way you lose information. In the
example I gave, the word:
 
  <inv>has 

would get tokenized twice, once through the body and once through
the list of obfuscated words in the pseudo-header, producing the tokens:

   'has'
   'HX-Obfuscated-Norm:has'

the former token would likely be neutral and drop out, but the second
would probably only appear in spam. 

The upshot of this is that invisible obfuscation:

- no longer breaks body rules
- is easier for Bayes to learn than non-obfuscated text
- can still be tested via X-Obfuscated-Orig without the
  complexity of rawbody

Re: Bayes underperforming, HTML entities?

Reply via email to