On Tue, 18 Jun 2013 00:12:34 -0400, Dave Angel wrote: > On 06/17/2013 10:42 PM, Steven D'Aprano wrote: >> On Mon, 17 Jun 2013 21:06:57 -0400, Dave Angel wrote: >> >>> On 06/17/2013 08:41 PM, Steven D'Aprano wrote: >>>> >>>> <SNIP> >>>> >>>> In Python 3.2 and older, the data will be either UTF-4 or UTF-8, >>>> selected when the Python compiler itself is compiled. >>> >>> I think that was a typo. Do you perhaps UCS-2 or UCS-4 >> >> Yes, that would be better. >> >> UCS-2 is identical to UTF-16, except it doesn't support non-BMP >> characters and therefore doesn't have surrogate pairs. >> >> UCS-4 is functionally equivalent to UTF-16, > > Perhaps you mean UTF-32 ?
Yes, sorry for the repeated confusion. >> as far as I can tell. (I'm >> not really sure what the difference is.) >> >> > Now you've got me curious, by bringing up surrogate pairs. Do you know > whether a narrow build (say 3.2) really works as UTF16, so when you > encode a surrogate pair (4 bytes) to UTF-8, it encodes a single Unicode > character into a single UTF-8 sequence (prob. 4 bytes long) ? In a Python narrow build, the internal storage of strings is equivalent to UTF-16: all characters in the Basic Multilingual Plane require two bytes: py> sys.maxunicode 65535 py> sys.getsizeof('π') - sys.getsizeof('') 2 Outside of the BMP, characters are treated as a pair of surrogates: py> c = chr(0x10F000) # one character... py> len(c) # ...stored as a pair of surrogates 2 Encoding and decoding works fine: py> c.encode('utf-8').decode('utf-8') == c True py> c.encode('utf-8') b'\xf4\x8f\x80\x80' The problem with surrogates is that it is possible to accidentally separate the pair, which leads to broken, invalid text: py> c[0].encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udbfc' in position 0: surrogates not allowed (The error message is a little misleading; surrogates are allowed, but only if they make up a valid pair.) Python's handling of UTF-16 is, as far as I know, correct. What isn't correct is that the high-level Python string methods assume that two bytes == one character, which can lead to surrogates being separated, which gives you junk text. Wide builds don't have this problem, because every character == four bytes, and neither does Python 3. -- Steven -- http://mail.python.org/mailman/listinfo/python-list