On Tue, 2014-06-10 at 13:53 -0400, Alex wrote: > I'm not very familiar with how to manage language encoding, and hoped > someone could help. Some time ago I wrote a rule that looks for > subjects that consist of a single word that's more than N characters. > It works, but I'm learning that it's performed before the content of > the subject is converted into something human-readable.
This is not true. Header rules are matched against the decoded string by default. To prevent decoding of quoted-printable or base-64 encoded headers, the :raw modifier needs to be appended to the header name. > Instead, it operates on something like: > > Subject: =?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?= That's a base-64 encoded UTF-8 string, decoded for header rules. To see for yourself, just echo your test header into spamassassin -D -L --cf="header TEST Subject =~ /.+/" and the debug output will show you what it matched. dbg: rules: ran header rule TEST ======> got hit: "《环球旅讯》原创:在线旅游" > How can I write a header rule that operates on the decoded utf > content? > > header __SUB_NOSPACE Subject =~ /^.\S+$/ > header __SUB_VERYLONG Subject =~ /^.{20,200}\S+$/ > meta LOC_SUBNOSPACE (__SUB_VERYLONG && __SUB_NOSPACE) Again, header rules by default operate on the decoded string. I assume your actual problem is with the SUB_VERYLONG rule hitting. Since the above test rule shows the complete decoded Subject, we can tell it's 13 chars long, clearly below the "verylong" threshold of 20 chars. That is not caused by the encoding, though, but because the regex operates on bytes rather than characters. Let's see what a 20 bytes chunk of that UTF-8 string looks like. A modified rule will match the first 20 bytes only: header TEST Subject =~ /^.{20}/ The result shows the string is longer than 20 bytes, and the match even ends right within a single UTF-8 encoded char. got hit: "《环球旅讯》<E5><8E>" To make the regex matching aware of UTF-8 encoding, and match chars instead of (raw) bytes, we will need the normalize_charset option. header TEST Subject =~ /^.{10}/" normalize_charset 1 Along with yet another modification of the test rule, now matching the first 10 chars only. got hit: "《环球旅讯》原创:在" The effect is clear. That 10 (chars) long match with normalize_charset enabled is even longer than the above 20 (byte) match. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}