On 08/26/2009 11:51 PM, "Martin v. Löwis" wrote: >[...] >> But regardless, the significant question is, what is >> the reason for having ord() (and unichr) not work for >> surrogate pairs and thus not usable with a large number >> of unicode characters that Python otherwise supports? > > See PEP 261, http://www.python.org/dev/peps/pep-0261/ > It specifies all this.
The PEP (AFAICT) says only what we already know... that on narrow builds unichr() will raise an exception with an argument >= 0x10000, and ord() is unichr()'s inverse. I have read the PEP twice now and still see no justification for that decision, it appears to have been made by fiat.[*1] Could you or someone please point me to specific justification for having unichr and ord work only for a subset of unicode characters on narrow builds, as opposed to the more general and IMO useful behavior proposed earlier in this thread? ---------------------------------------------------------- [*1] The PEP says: * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one string. * unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string on wide Python builds. On narrow builds it will raise ValueError. and * ord() is always the inverse of unichr() which of course we know; that is the current behavior. But there is no reason given for that behavior. Under the second *unicode bullet point, there are two issues raised: 1) Should surrogate pairs be disallowed on narrow builds? That appears to have been answered in the negative and is not relevant to my question. 2) Should access to code points above TOPCHAR be allowed? Not relevant to my question. * every Python Unicode character represents exactly one Unicode code point (i.e. Python Unicode Character = Abstract Unicode character) I'm not sure what this means (what's an abstract unicode character?). If it mandates that u'\ud800\udc40' be treated as a len() 2 string, that is that current case but does not say anything about how unichr and ord should behave. If it mandates that that string must always be treated as two separate code points then Python itself violates by printing that string as u'\U00010040' rather than u'\ud800\udc40'. Finally we read: * There is a convention in the Unicode world for encoding a 32-bit code point in terms of two 16-bit code points. These are known as "surrogate pairs". Python's codecs will adopt this convention. Is a distinction made between Python and Python codecs with only the latter having any knowledge of surrogate pairs? I guess that would explain why Python prints a surrogate pair as a single character. But this seems arbitrary and counter-useful if applied to ord() and unichr(). What possible use-case is there for *not* recognizing surrogate pairs in those two functions? Nothing else in the PEP seems remotely relevant. -- http://mail.python.org/mailman/listinfo/python-list