On Sat, 10 Oct 2015 10:56:14 +0200 Mark Martinec wrote:
> > BTW with normalize_charset 0 it looks like a spammer can effectively > > turn-off body tokenization by using UTF-16 (with correct > > endianness). > > Yes. There are also other tricks that a spammer can't play. > It's not possible to emulate all different behaviours of > various mail reading programs. Still, in the case we have > it would make sense to try also the utf-16le, since this is > a default endianness in Windows. It might be sensible to strip nulls. That way if text contains unconverted UTF-16 (either because conversion failed or normalization is off), encoded ASCII characters get converted correctly into single bytes. Most body rules will then work, and Bayes can tokenize the text.