On 2015-05-03 16:32, Jon Ribbens wrote:
On 2015-05-03, Chris Angelico <ros...@gmail.com> wrote:
On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens
<jon+use...@unequivocal.co.uk> wrote:
If I have a string containing surrogate pairs like this in Python 3.4:
"\udb40\udd9d"
How do I convert it into the proper form:
"\U000E019D"
? The answer appears not to be "unicodedata.normalize".
No, it's not, because Unicode normalization is a very specific thing.
You're looking for a fix for some kind of encoding issue; Unicode
normalization translates between combining characters and combined
characters.
You shouldn't even actually _have_ those in your string in the first
place. How did you construct/receive that data? Ideally, catch it at
that point, and deal with it there.
That would, unfortunately, be "tell the Unicode Consortium to format
their documents differently", which seems unlikely to happen. I'm
trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
That document looks like it's encoded in UTF-8.
But if you absolutely have to convert the surrogates, it ought to be
possible to do a sloppy UCS-2 conversion to bytes, then a proper
UTF-16 decode on the result.
Python doesn't appear to have UCS-2 support, so I guess what you're
saying is that I have to write my own surrogate-decoder? This seems
a little surprising.
--
https://mail.python.org/mailman/listinfo/python-list