Kurt Mueller wrote: [...] > Now the part of the two Python builds is still somewhat unclear to me. [...] > In Python 2.7: > > As I learned from the ord() manual: > If a unicode argument is given and Python was built with UCS2 Unicode,
Where does the manual mention UCS-2? As far as I know, no version of Python uses that. > (I suppose this is the narrow build in your terms), Mostly right, but not quite. "Narrow build" means that Python uses UTF-16, not UCS-2, although the two are very similar. See below for further details. But to make it more confusing, *parts* of Python (like the unichr function) assume UCS-2, and refuse to accept values over 0xFFFF. > then the character’s code point must be in the range [0..65535] inclusive; Half-right. Unicode code points are always in the range U+0000 to U+10FFFF, or in decimal, [0...1114111]. But, Python "narrow builds" don't quite handle that correctly, and only half-support code points from [65536...1114111]. The reasons are complicated, but see below. UCS-2 is an implementation of an early, obsolete version of Unicode which is limited to just 65536 characters (technically: "code points") instead of the full range of 1114112 characters supported by Unicode. UCS-2 is very similar to UTF-16. Both use a 16-bit "code unit" to represent characters. In UCS-2, each character is represented by precisely 1 code unit, numbered between 0 and 65535 (0x0000 and 0xFFFF in hex). In UTF-16, the most common characters (the Basic Multilingual Plane) are likewise represented by 1 code unit, between 0 and 65535, but there are a range of "characters" (actually code points) which are reserved for use as so-called "surrogate pairs". Using hex: Code points U+0000 to U+D7FF: - represent the same character in UCS-2 and UTF-16; Code points U+D800 to U+DFFF: - represent reserved but undefined characters in UCS-2; - represent surrogates in UTF-16 (see below); Code points U+E000 to U+FFFF: - represent the same character in UCS-2 and UTF-16; Code points U+010000 to U+10FFFF: - impossible to represent in UCS-2; - represented by TWO surrogates in UTF-16. For example, the Unicode code point U+1D11E (MUSICAL SYMBOL G CLEF) cannot be represented at all in UCS-2, because it is past U+FFFF. In UTF-16, it cannot be represented as a single 16-bit code unit, instead it is represented as two code-units, 0xD834 0xDD1E. That is called a "surrogate pair". The problem with Python's narrow builds is that, although characters are variable width (the most common are 1 code unit, 16 bits, the rest are 2 code units), the Python implementation assumes that all characters are a fixed 16 bits. So if your string is a single character like U+1D11E, instead of treating it as a string of length one with ordinal value 0x1D11E, Python will treat it as a string of length *two* with ordinal values 0xD834 and 0xDD1E. (In other words, Python narrow builds fail to deal with surrogate pairs correctly.) Although you cannot create that string using unichr, you can create it using the \U notation: py> unichr(0x1D11E) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unichr() arg not in range(0x10000) (narrow Python build) py> u'\U0001D11E' u'\U0001d11e' > I understand: In a UCS2 build each character of a Unicode string uses > 16 Bits and can represent code points from U-0000..U-FFFF. That is correct. So UCS-2 can only represent a small subset of Unicode. > From the unichr(i) manual I learn: > The valid range for the argument depends how Python was configured > – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. > I understand: narrow build is UCS2, wide build is UCS4 UCS-4 is exactly the same as UTF-32, and wide builds use a fixed 32 bits for every code point, so that's correct. > - In a UCS2 build each character of an Unicode string uses 16 Bits and has > code points from U-0000..U-FFFF (0..65535) As I said, it's not strictly correct, Python is actually using UTF-16, but it's a buggy or incomplete UTF-16, with parts of the system assuming UCS-2. > - In a UCS4 build each character of an Unicode string uses 32 Bits and has > code points from U-00000000..U-0010FFFF (0..1114111) Correct. Remember that UCS-4 and UTF-32 are exactly the same: every code point from U+0000 to U+10FFFF is represented by a single 32-bit value. So our earlier example, U+1D11E (MUSICAL SYMBOL G CLEF) would be represented as 0x0001D11E in UTF-32 and UCS-4. Remember, though, these internal representations are (nearly) irrelevant to Python code. In Python code, you just consider that a Unicode string is an array of ordinal values from 0x0 to 0x10FFFF, each representing a single code point U+0000 to U+10FFFF. The only reason I say "nearly" is that narrow builds don't *quite* work right if the string contains surrogate pairs. -- Steven -- https://mail.python.org/mailman/listinfo/python-list