Thanks MRAB, I'll have to do some reading about unicode surrogates. Also need to research which python versions/platforms are narrow builds and which are wide. Much to learn here.
Thanks! --- On Thu, 8/12/10, MRAB <pyt...@mrabarnett.plus.com> wrote: From: MRAB <pyt...@mrabarnett.plus.com> Subject: Re: unicode string alteration To: python-list@python.org Date: Thursday, August 12, 2010, 12:31 PM BAvant Garde wrote: > HELP!!! > I need help with a unicode issue that has me stumped. I must be doing > something wrong because I don't believe this condition would have slipped > thru testing. > > Wherever the string u'\udbff\udc00' occurs u'\U0010fc00' or unichr(1113088) > is substituted and the file loses 1 character resulting in all trailing > characters being shifted out of position. No other corrupt strings have been > detected. > The condition was noticed while testing in Python 2.6.5 on Ubuntu 10.04 >where the maximum ord # is 1114111 (wide Python build). > Using Python 2.5.4 on Windows-ME where the maximum ord # is 65535 (narrow >Python build) the string u'\U0010fc00' also occurs and it "seems" that the >substitution takes place but no characters are lost and file sizes are ok. >Note that ord(u'\U0010fc00') causes the following error: > "TypeError: ord() expected a character, but string of length 2 >found" > The condition is otherwise invisible in 2.5.4 and is handled internally > without any apparent effect on processing with characters u'\udbff' and > u'\udc00' each being separately accessible. > > The first part of the attachment repeats this email but also has examples and > illustrates other related oddities. > Any help would be greatly appreciated. > It's not an error, it's a "surrogate pair". Surrogate pairs are part of the Unicode specification. Unicode codepoints go up to U+0010FFFF. If you're using 16 bits per codepoint, like in a narrow build of Python, then the codepoints above U+FFFF _can't_ be represented directly, so they are represented by a pair of codepoints called a "surrogate pair". If, on the other hand, you're using 32 bits per codepoint, like in a wide build of Python, then the codepoints above U+FFFF _can_ be represented directly, so surrogate pairs aren't needed, and, indeed shouldn't be there. What you're seeing in the wide build is Python replacing a surrogate pair with the codepoint that it represents, which is actually the right thing to do because, as I said, the surrogate pairs really shouldn't be there. -- http://mail.python.org/mailman/listinfo/python-list
-- http://mail.python.org/mailman/listinfo/python-list