On Feb 21, 10:48 am, a...@pythoncraft.com (Aahz) wrote: > In article <499f397c.7030...@v.loewis.de>, > > =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= <mar...@v.loewis.de> wrote: > >> Yes, I know that. But every concrete representation of a unicode string > >> has to have an encoding associated with it, including unicode strings > >> produced by the Python parser when it parses the ascii string "u'\xb5'" > > >> My question is: what is that encoding? > > >The internal representation is either UTF-16, or UTF-32; which one is > >a compile-time choice (i.e. when the Python interpreter is built). > > Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the > countless threads about the distinction between UTF and UCS?
Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer to Unicode 1.1 and earlier, with no surrogates. We target Unicode 5.1. If you naively encode UCS-2 as UTF-8 you really end up with CESU-8. You miss the step where you combine surrogate pairs (which only exist in UTF-16) into a single supplementary character. Lo and behold, that's actually what current python does in some places. It's not pretty. See bugs #3297 and #3672. -- http://mail.python.org/mailman/listinfo/python-list