On Fri, 30 Nov 2018 15:49:31 -0700 Amir Caspi wrote: > > It make it harder to write rules detecting these tricks, but it may > > happen eventually. As far as Bayes is concerned, it would be a > > shame to lose the information. > > I'm not sure I see how Bayes can take decent advantage out of these > zero-width chars. If they are interspersed randomly within words, > then Bayes has to tokenize each and every permutation (or, at least, > very many permutations) of each word in order to be decently > effective. But if the zero-width chars are stripped out, then Bayes > only has to tokenize the regular, displayable word. Am I missing > something?
Yes, you need something in between. A tokenization that avoids learning the hundreds of obfuscation variants, but doesn't throw away the existence of obfuscation. > But offering both converted and non-converted options is likely the > best option, and then having Bayes work on the normalized version > resolves the above. Not simply on the normalized text, that way you lose information. In the example I gave, the word: <inv>has would get tokenized twice, once through the body and once through the list of obfuscated words in the pseudo-header, producing the tokens: 'has' 'HX-Obfuscated-Norm:has' the former token would likely be neutral and drop out, but the second would probably only appear in spam. The upshot of this is that invisible obfuscation: - no longer breaks body rules - is easier for Bayes to learn than non-obfuscated text - can still be tested via X-Obfuscated-Orig without the complexity of rawbody