Re: unicode string alteration

BAvant Garde Thu, 12 Aug 2010 16:00:58 -0700

Thanks MRAB, 

I'll have to do some reading about unicode surrogates. Also need to research 
which python versions/platforms are narrow builds and which are wide. Much to 
learn here.


Thanks!  

--- On Thu, 8/12/10, MRAB <pyt...@mrabarnett.plus.com> wrote:

From: MRAB <pyt...@mrabarnett.plus.com>
Subject: Re: unicode string alteration
To: python-list@python.org
Date: Thursday, August 12, 2010, 12:31 PM

BAvant Garde wrote:
> HELP!!!
> I need help with a unicode issue that has me stumped. I must be doing 
> something  wrong because I don't believe this condition would have slipped 
> thru testing.
> 
> Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or unichr(1113088) 
> is substituted and the file loses 1 character resulting in all trailing 
> characters being shifted out of position. No other corrupt strings have been 
> detected.
>    The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04 
>where the maximum ord # is 1114111 (wide Python build).
>    Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535 (narrow 
>Python build) the string u'\U0010fc00' also occurs and it "seems" that the 
>substitution takes place but no characters are lost and file sizes are ok. 
>Note that ord(u'\U0010fc00') causes the following error:
>              "TypeError: ord() expected a character, but string of length 2 
>found"
> The condition is otherwise invisible in 2.5.4 and is handled internally 
> without any apparent effect on processing with characters u'\udbff' and 
> u'\udc00' each being separately accessible.
> 
> The first part of the attachment repeats this email but also has examples and 
> illustrates other related oddities.
>    Any help would be greatly appreciated.
> 
It's not an error, it's a "surrogate pair". Surrogate pairs are part of
the Unicode specification.

Unicode codepoints go up to U+0010FFFF.

If you're using 16 bits per codepoint, like in a narrow build of Python,
then the codepoints above U+FFFF _can't_ be represented directly, so
they are represented by a pair of codepoints called a "surrogate pair".

If, on the other hand, you're using 32 bits per codepoint, like in a
wide build of Python, then the codepoints above U+FFFF _can_ be
represented directly, so surrogate pairs aren't needed, and, indeed
shouldn't be there.

What you're seeing in the wide build is Python replacing a surrogate
pair with the codepoint that it represents, which is actually the right
thing to do because, as I said, the surrogate pairs really shouldn't be
there.
-- http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode string alteration

Reply via email to