Re: Operations on headers in UTF-8

Karsten Bräckelmann Tue, 10 Jun 2014 19:26:28 -0700

On Tue, 2014-06-10 at 21:22 -0400, Daniel Staal wrote:
> --As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged to 
> have said:
> >     Worse, enabling charset normalization completely breaks UTF-8 chars
> >     in the regex. At least in my ad-hoc --cf command line testing.
> 
> --As for the rest, it is mine.
> 
> This sounds like something where `use feature 'unicode_strings'` might have 
> an affect


Possibly.

> enabling normalization is probably setting the internal utf8 
> flag on incoming text, which could change the semantics of the regex 
> matching.

Nope. *digging into code*

This option mainly affects rendered textual parts and headers, treating
them with Encode::Detect. More complex than just setting an internal
flag. What exactly made the ad-hoc regex rules fail is beyond the scope
of tonight's code-diving.


> If that's the case, it raises the question of if we want Spamassassin to 
> require Perl 5.12 (which includes that feature) - the current base version 
> is 5.8.1.  Unicode support has been evolving in Perl; 5.8 supports it 
> generally, but there were bugs.  I think 5.12 got most of them, but I'm not 
> sure.  (And of course it's not the current version of Perl.)

The normalize_charset option requires Perl 5.8.5.

All the ad-hoc rule testing in this thread has been done with SA 3.3.2
on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more
recent Perl version.


While of course something to potentially improve on itself, the topic of
charset normalization is just a by-product explaining the original
issue: Header rules and string encoding, with a grain of charset
encoding salt.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Operations on headers in UTF-8

Reply via email to