Re: Operations on headers in UTF-8

Karsten Bräckelmann Tue, 10 Jun 2014 12:27:08 -0700

On Tue, 2014-06-10 at 13:53 -0400, Alex wrote:
> I'm not very familiar with how to manage language encoding, and hoped
> someone could help. Some time ago I wrote a rule that looks for
> subjects that consist of a single word that's more than N characters.
> It works, but I'm learning that it's performed before the content of
> the subject is converted into something human-readable.


This is not true. Header rules are matched against the decoded string by
default. To prevent decoding of quoted-printable or base-64 encoded
headers, the :raw modifier needs to be appended to the header name.


> Instead, it operates on something like:
> 
> Subject: =?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=

That's a base-64 encoded UTF-8 string, decoded for header rules. To see
for yourself, just echo your test header into

  spamassassin -D -L --cf="header TEST Subject =~ /.+/"

and the debug output will show you what it matched.

  dbg: rules: ran header rule TEST ======> got hit: "《环球旅讯》原创：在线旅游"


> How can I write a header rule that operates on the decoded utf
> content?
> 
> header          __SUB_NOSPACE   Subject =~ /^.\S+$/
> header          __SUB_VERYLONG  Subject =~ /^.{20,200}\S+$/
> meta            LOC_SUBNOSPACE  (__SUB_VERYLONG && __SUB_NOSPACE)

Again, header rules by default operate on the decoded string.

I assume your actual problem is with the SUB_VERYLONG rule hitting.
Since the above test rule shows the complete decoded Subject, we can
tell it's 13 chars long, clearly below the "verylong" threshold of 20
chars.

That is not caused by the encoding, though, but because the regex
operates on bytes rather than characters.


Let's see what a 20 bytes chunk of that UTF-8 string looks like. A
modified rule will match the first 20 bytes only:

  header TEST Subject =~ /^.{20}/

The result shows the string is longer than 20 bytes, and the match even
ends right within a single UTF-8 encoded char.

  got hit: "《环球旅讯》<E5><8E>"


To make the regex matching aware of UTF-8 encoding, and match chars
instead of (raw) bytes, we will need the normalize_charset option.

  header TEST Subject =~ /^.{10}/"
  normalize_charset 1

Along with yet another modification of the test rule, now matching the
first 10 chars only.

  got hit: "《环球旅讯》原创：在"

The effect is clear. That 10 (chars) long match with normalize_charset
enabled is even longer than the above 20 (byte) match.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Operations on headers in UTF-8

Reply via email to