On Sat, 16 Mar 2013 15:09:56 +1100, Chris Angelico wrote: > On Sat, Mar 16, 2013 at 2:56 PM, Mark Lawrence <breamore...@yahoo.co.uk> > wrote: >> On 16/03/2013 02:44, Thomas 'PointedEars' Lahn wrote: >>> >>> Chris Angelico wrote: >>> >>> >> Thomas and Chris, would the two of you be kind enough to explain to >> morons such as myself how all the ECMAScript stuff relates to Python's >> unicode as implemented via PEP 393 as you've lost me, easily done I >> know. > > Sure. Here's the brief version: It's all about how a string is exposed > to a script. > > * Python 3.2 Narrow gives you UTF-16. Non-BMP characters count twice. > * Python 3.2 Wide gives you UTF-32. Each character counts once. > * Python 3.3 gives you UTF-32, but will store it as compactly as > possible. > * ECMAScript specifies the semantics of Python 3.2 Narrow.
And just for the record: Unicode actually doesn't define what a "character" is. Instead, it talks about "code points", but for our purposes we can gloss over the differences and pretend that they are almost the same thing, except where noted. Some code points represent characters, some represent non- characters, and some are currently unused. UTF-16 is a *variable length* storage mechanism that says each code point takes either two or four bytes. Since Unicode includes more code points than will fit in two bytes, UTF-16 includes a mechanism for dealing with the additional code points: * The first 65,536 code points are defined as the "Basic Multilingual Plane", or BMP. Each code point in the BMP is represented in UTF-16 by a 16-bit value. * The remaining 16 sets of 65,536 code points are defined as "Supplementary Multilingual Planes", or SMPs. Each code point in a SMP is represented by two 16 bit values, called a "surrogate pair". The "lead surrogate" will be in the range 0xD800...0xDBFF and the "trail surrogate" will be in the range 0xDC00...0xDFFF. The disadvantage here is that you can't tell how far into a string you need to go to get to (say) the 1000th character (code point). If all of the first 1000 code points are in the BMP, then you can jump directly to byte offset 2000. If all of them are in a SMP, then you can jump directly to byte offset 4000. But since you don't usually know how many SMP code points are in the string, you have to walk through the string: # Pseudo-code to return the code-point in string at index. offset = 0 # Offset in pairs of bytes. counter = 0 while offset < length of string counted in pairs of bytes: if string[offset] in 0xD800...0xDBFF: # Lead surrogate of a surrogate pair. if counter == index: return string[offset:offset+1] else: counter += 1 index += 2 # Skip the next pair. elif string[offset] in 0xDC00...0xDFFF: # Trail surrogate found outside of a surrogate pair. raise Error else: # BMP code point. if counter == index: return string[offset] else: counter += 1 index += 1 What a mess! Slow and annoying to get right. Not surprisingly, most implementations of UTF-16 don't do this, and Python is one of them. Instead, they assume that all code points take up the same space, and consequently they let you create *invalid Unicode strings* by splitting a surrogate pair: This is in Python 3.2 narrow build: py> s = chr(70000) py> len(s) 2 py> a = s[0] py> a == s False py> print(a) Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\ud804' in position 0: surrogates not allowed Oops! We have created an invalid Unicode string. A wide build will fix this, because it uses UTF-32 instead. UTF-32 is a *fixed width* storage mechanism where every code point takes exactly four bytes. Since the entire Unicode range will fit in four bytes, that ensures that every code point is covered, and there is no need to walk the string every time you perform an indexing operation. But it means that if you're one of the 99.9% of users who mostly use characters in the BMP, your strings take twice as much space as necessary. If you only use Latin1 or ASCII, your strings take four times as much space as necessary. So a Python wide build uses more memory, but gets string processing right. Python 3.3 has the best of both worlds, fixing the invalid string processing and avoiding memory waste. One of the complications here is that often people conflate UTF-16 and UCS-2. It isn't entirely clear to me, but it seems that ECMAScript may be doing that. UCS-2 specifies that *all characters* are represented by *exactly* 16 bits (two bytes), and hence it is completely unable to deal with the Supplementary Multilingual Planes at all. If I have read the ECMAScript spec correctly, it *completely ignores* the issue of surrogate pairs, and so is underspecified. Instead it talks about normalised form, which is entirely unrelated. Normalisation relates to the idea that many characters can be represented in two or more ways. For example, the character "รค" can be represented in at least two forms: Normalised form: U+00E4 LATIN SMALL LETTER A WITH DIAERESIS Canonical decomposition: U+0061 LATIN SMALL LETTER A + U+0308 COMBINING DIAERESIS So both of these two strings represent the same letter, even though the second uses two code points and the first only one: py> a = "\N{LATIN SMALL LETTER A WITH DIAERESIS}" py> b = "a\N{COMBINING DIAERESIS}" Arguably, they should print the same way too, although I think that will depend on how smart your terminal is, and/or on the font you use. But I digress. The ECMAScript spec carefully specifies that it makes no guarantees about normalisation, which is right and proper for a language, but it says nothing about surrogates, and that is very poor. So as I said, I suspect that ECMAScript is actually referring to UCS-2 when it mentions UTF-16. If I'm right, that's pretty lousy. -- Steven -- http://mail.python.org/mailman/listinfo/python-list