On 08/26/2009 08:52 PM, Steven D'Aprano wrote: > On Wed, 26 Aug 2009 16:27:33 -0700, rurpy wrote: > >> But regardless, the significant question is, what is the reason for >> having ord() (and unichr) not work for surrogate pairs and thus not >> usable with a large number of unicode characters that Python otherwise >> supports? > > > I'm no expert on Unicode, but my guess is that the reason is out of a > desire for simplicity: unichr() should always return a single char, not a > pair of chars, and similarly ord() should take as input a single char, > not two, and return a single number. > > Otherwise it would be ambiguous whether ord(surrogate_pair) should return > a pair of ints representing the codes for each item in the pair, or a > single int representing the code point for the whole pair. > > E.g. given your earlier example: > >>>> a = u'\U00010040' >>>> len(a) > 2 >>>> a[0] > u'\ud800' >>>> a[1] > u'\udc40' > > would you expect ord(a) to return (0xd800, 0xdc40) or 0x10040?
The latter. > If the > latter, what about ord(u'ab')? I would expect a TypeError* (as ord() currently raises) because the string length is not 1 and 'ab' is not a surrogate pair. *Actually I would have expected ValueError but I'm not going to lose sleep over it. > Remember that a unicode string can contain code points that aren't valid > characters: > >>>> ord(u'\ud800') # reserved for surrogates, not a character > 55296 > > so if ord() sees a surrogate pair, it can't assume it's meant to be > treated as a surrogate pair rather than a pair of code points that just > happens to match a surrogate pair. Well, actually, yes it can. :-) Python has already made a strong statement that such a pair the representation of a character: >>> a = ''.join([u'\ud800',u'\udc40']) >>> a u'\U00010040' That is, Python prints, and treats in nearly all other contexts, that combination as a character. This is related to the practicality argument: what is the ratio of need treat a surrogate pair as character consistent with with the rest of Python, vs the need to treat it as a string of two separate (and invalid in the unicode sense?) characters? And if you want to treat each half of the pair separately it's not exactly hard: ord(a[0]), ord(a[1]). > None of this means you can't deal with surrogate pairs, it just means you > can't deal with them using ord() and unichr(). Kind of like saying, it doesn't mean you can't deal with integers larger that 2**32, you just can't multiply and divide them. > The above is just my guess, I'd be interested to hear what others say. -- http://mail.python.org/mailman/listinfo/python-list