On Tue, 2014-06-10 at 21:22 -0400, Daniel Staal wrote: > --As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged to > have said: > > Worse, enabling charset normalization completely breaks UTF-8 chars > > in the regex. At least in my ad-hoc --cf command line testing. > > --As for the rest, it is mine. > > This sounds like something where `use feature 'unicode_strings'` might have > an affect
Possibly. > enabling normalization is probably setting the internal utf8 > flag on incoming text, which could change the semantics of the regex > matching. Nope. *digging into code* This option mainly affects rendered textual parts and headers, treating them with Encode::Detect. More complex than just setting an internal flag. What exactly made the ad-hoc regex rules fail is beyond the scope of tonight's code-diving. > If that's the case, it raises the question of if we want Spamassassin to > require Perl 5.12 (which includes that feature) - the current base version > is 5.8.1. Unicode support has been evolving in Perl; 5.8 supports it > generally, but there were bugs. I think 5.12 got most of them, but I'm not > sure. (And of course it's not the current version of Perl.) The normalize_charset option requires Perl 5.8.5. All the ad-hoc rule testing in this thread has been done with SA 3.3.2 on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more recent Perl version. While of course something to potentially improve on itself, the topic of charset normalization is just a by-product explaining the original issue: Header rules and string encoding, with a grain of charset encoding salt. -- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}