Re: PEP 393 vs UTF-8 Everywhere

Steve D'Aprano Sat, 21 Jan 2017 18:49:05 -0800

On Sun, 22 Jan 2017 07:21 am, Pete Forman wrote:

> Marko Rauhamaa <ma...@pacujo.net> writes:
> 
>>> py> low = '\uDC37'
>>
>> That should raise a SyntaxError exception.
> 
> Quite. My point was that with older Python on a narrow build (Windows
> and Mac) you need to understand that you are using UTF-16 rather than
> Unicode.


But you're *not* using UTF-16, at least not proper UTF-16, in older narrow
builds. If you were, then Unicode strings u'...' containing surrogate pairs
would be treated as supplementary single code points, but they aren't.

unichr() doesn't support supplementary code points in narrow builds:

[steve@ando ~]$ python2.7 -c "print len(unichr(0x10900))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)


and even if you sneak a supplementary code point in, it is treated wrongly:

[steve@ando ~]$ python2.7 -c "print len(u'\U00010900')"
2


So Python narrow builds are more like a bastard hybrid of UCS-2 and UTF-16.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: PEP 393 vs UTF-8 Everywhere

Reply via email to