Re: What encoding does u'...' syntax use?

Adam Olsen Sat, 21 Feb 2009 12:40:50 -0800

On Feb 21, 10:48 am, [email protected] (Aahz) wrote:
> In article <[email protected]>,
>
> =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=  <[email protected]> wrote:
> >> Yes, I know that.  But every concrete representation of a unicode string
> >> has to have an encoding associated with it, including unicode strings
> >> produced by the Python parser when it parses the ascii string "u'\xb5'"
>
> >> My question is: what is that encoding?
>
> >The internal representation is either UTF-16, or UTF-32; which one is
> >a compile-time choice (i.e. when the Python interpreter is built).
>
> Wait, I thought it was UCS-2 or UCS-4?  Or am I misremembering the
> countless threads about the distinction between UTF and UCS?


Nope, that's partly mislabeling and partly a bug.  UCS-2/UCS-4 refer
to Unicode 1.1 and earlier, with no surrogates.  We target Unicode
5.1.

If you naively encode UCS-2 as UTF-8 you really end up with CESU-8.
You miss the step where you combine surrogate pairs (which only exist
in UTF-16) into a single supplementary character.  Lo and behold,
that's actually what current python does in some places.  It's not
pretty.

See bugs #3297 and #3672.
--
http://mail.python.org/mailman/listinfo/python-list

Re: What encoding does u'...' syntax use?

Reply via email to