On 08/29/2009 12:06 PM, Steven D'Aprano wrote: [...] >> The reasons for the current behavior so far: >> >> 1. >>> What you propose would break the property "unichr(i) always returns a >>> string of length one, if it returns anything at all". >> >> Yes. And i don't see the problem with that. Why is that property more >> desirable than the non-existent property that a Unicode literal always >> produces one python character? > > What do you mean? Unicode literals don't always produce one character, > e.g. u'abcd' is a Unicode literal with four characters.
I'm sorry, I should have been clearer. I meant the literal representation of a *single* unicode character. u'\u4000' which results in a string of length 1, vs u'\U00010040' which results in a string of length 2. In both case the literal represents a single unicode code point. > I think it's fairly self-evident that a function called uniCHR [emphasis > added] should return a single character (technically a single code > point). There are two concepts of characters here, the 16-bit things that encodes a character in a Python unicode string (in a narrow build Python), and a character in the sense of one of the ~2**10 unicode characters. Python has chosen to represent the latter (when outside the BMP) as a pair of surrogate characters from the former. I don't see why one would assume that CHR would mean the python 16-bit character concept rather than the full unicode character concept. In fact, rather the opposite. > But even if you can come up with a reason for unichr() to return > two or more characters, I've given a number of reasons why it should return a two character representation of a non-BMP character, one of which is that that is how Python has chosen to represent such characters internally. I won't repeat the other reasons again. I'm not sure why you think more than two characters would ever be possible. > this would break code that relies on the > documented promise that the length of the output of unichr() is always > one. Ah, OK. This is the good reason I was looking for. I did not realize (until prompted by your remark to go back and look at the early docs) that unichr had been documented to return a single character since 2.0 and that wide character support was added in 2.2. Martin v. Loewis also implied that, I now see, although the implication was too deep for me to pick up. So although it leads to a suboptimal situation, I agree that maintaining the documented behavior was necessary. [...] > I would much rather see a pair of new functions, wideord() and > widechr() used for converting between surrogate pairs and numbers. I guess if it were still 2001 and Python 2.2 was coming out I would be in favor of this too. :-) -- http://mail.python.org/mailman/listinfo/python-list