[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sun, 14 Aug 2011 20:31:22 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

I wrote:


>> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

> So I'm finding.  Perhaps that's why I keep getting confused. I do have a 
> pretty firm
> notion of what UCS-2 and UTF-16 are, and so I get sometimes 
> self-contradictory results.
> Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm 
> not sure I
> have found one, although the regex thing might count.

I just thought of one.  The casemapping functions don't work right on
Deseret, which is a non-BMP case-changing scripts.  That's one I submitted
as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP
code points into paired UTF-16 surrogates, then the casing functions had
jolly well be able to deal with it.  If the UTF-8 decoder knows it is only
going to UCS-2, then it should have raised on exception on my non-BMP source.
Since it went to UTF-16, the rest of the language should have behaved 
accordingly.
Java does to this right, BTW, despite its UTF-16ness.

--tom

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to