On Sat, 10 Oct 2015 10:56:14 +0200
Mark Martinec wrote:

> > BTW with normalize_charset 0 it looks like a spammer can effectively
> > turn-off body tokenization by using UTF-16 (with correct
> > endianness).
> 
> Yes. There are also other tricks that a spammer can't play.
> It's not possible to emulate all different behaviours of
> various mail reading programs. Still, in the case we have
> it would make sense to try also the utf-16le, since this is
> a default endianness in Windows.

It might be sensible to strip nulls. That way if text
contains unconverted UTF-16 (either because conversion failed or
normalization is off), encoded ASCII characters get converted correctly
into single bytes. Most body rules will then work, and Bayes can
tokenize the text. 

Reply via email to