2009/8/25 <ru...@yahoo.com>: > In Python 2.5 on Windows I could do [*1]: > > # Create a unicode character outside of the BMP. > >>> a = u'\U00010040' > > # On Windows it is represented as a surogate pair. > >>> len(a) > 2 > >>> a[0],a[1] > (u'\ud800', u'\udc40') > > # Create the same character with the unichr() function. > >>> a = unichr (65600) > >>> a[0],a[1] > (u'\ud800', u'\udc40') > > # Although the unichr() function works fine, its > # inverse, ord(), doesn't. > >>> ord (a) > TypeError: ord() expected a character, but string of length 2 found > > On Python 2.6, unichr() was "fixed" (using the word > loosely) so that it too now fails with characters outside > the BMP. > > >>> a = unichr (65600) > ValueError: unichr() arg not in range(0x10000) (narrow Python build) > > Why was this done rather than changing ord() to accept a > surrogate pair? > > Does not this effectively make unichr() and ord() useless > on Windows for all but a subset of unicode characters? > -- > http://mail.python.org/mailman/listinfo/python-list >
Hi, I'm not sure about the exact reasons for this behaviour on narrow builds either (maybe the consistency of the input/ output data to exactly one character?). However, if I need these functions for higher unicode planes, the following rather hackish replacements seem to work. I presume, there might be smarter ways of dealing with this, but anyway... hth, vbr #### not (systematically) tested ##################################### import sys def wide_ord(char): try: return ord(char) except TypeError: if len(char) == 2 and 0xD800 <= ord(char[0]) <= 0xDBFF and 0xDC00 <= ord(char[1]) <= 0xDFFF: return (ord(char[0]) - 0xD800) * 0x400 + (ord(char[1]) - 0xDC00) + 0x10000 else: raise TypeError("invalid character input") def wide_unichr(i): if i <= sys.maxunicode: return unichr(i) else: return ("\U"+str(hex(i))[2:].zfill(8)).decode("unicode-escape") -- http://mail.python.org/mailman/listinfo/python-list