Re: Unicode surrogate pairs (Python 3.4)

Jon Ribbens Sun, 03 May 2015 08:37:07 -0700

On 2015-05-03, Chris Angelico <[email protected]> wrote:
> On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
><[email protected]> wrote:
>> If I have a string containing surrogate pairs like this in Python 3.4:
>>
>>   "\udb40\udd9d"
>>
>> How do I convert it into the proper form:
>>
>>   "\U000E019D"
>>
>> ? The answer appears not to be "unicodedata.normalize".
>
> No, it's not, because Unicode normalization is a very specific thing.
> You're looking for a fix for some kind of encoding issue; Unicode
> normalization translates between combining characters and combined
> characters.
>
> You shouldn't even actually _have_ those in your string in the first
> place. How did you construct/receive that data? Ideally, catch it at
> that point, and deal with it there.


That would, unfortunately, be "tell the Unicode Consortium to format
their documents differently", which seems unlikely to happen. I'm
trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt

> But if you absolutely have to convert the surrogates, it ought to be
> possible to do a sloppy UCS-2 conversion to bytes, then a proper
> UTF-16 decode on the result.

Python doesn't appear to have UCS-2 support, so I guess what you're
saying is that I have to write my own surrogate-decoder? This seems
a little surprising.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode surrogate pairs (Python 3.4)

Reply via email to