Re: charset=utf-16 tricks out SA

Mark Martinec Fri, 09 Oct 2015 05:23:11 -0700

Reindl Harald wrote:

no custom body rules hit like they do for ISO/UTF8 :-(

What is your normalize_charsets setting?


enabled, that's what i meant with "like they do for ISO/UTF8" and
adding "dear potencial partner" to CUST_BODY_17 did not change the
score

see attached sample and rule below

body      CUST_BODY_17    /.*(1st page ranking of google|dear
potencial partner).*/i
score     CUST_BODY_17    1.0
describe  CUST_BODY_17    Contains Low


The problem with this message is that it declares encoding
as UTF-16, i.e. not explicitly stating endianness like
UTF-16BE or UTF-16LE, and there is no BOM mark at the
beginning of each textual part, so endianness cannot be
determined. The RFC 2781 says that big-endian encoding
should be assumed in absence of BOM.
See https://en.wikipedia.org/wiki/UTF-16

In the provided message the actual endianness is LE, and
BOM is missing, so decoding as UTF-16BE fails and the
rule does not hit. Garbage-in, garbage-out.

If you manually edit the sample and replace UTF-16
with UTF-16LE (and normalize is enabled), your rule should
hit - at least it does so in the current trunk code.

If this seems to be common in the wild, please open a
bug ticket, as Kevin suggested, and attach the sample there.

  Mark

Re: charset=utf-16 tricks out SA

Reply via email to