On 2015-05-03, Chris Angelico <ros...@gmail.com> wrote: > On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens ><jon+use...@unequivocal.co.uk> wrote: >> If I have a string containing surrogate pairs like this in Python 3.4: >> >> "\udb40\udd9d" >> >> How do I convert it into the proper form: >> >> "\U000E019D" >> >> ? The answer appears not to be "unicodedata.normalize". > > No, it's not, because Unicode normalization is a very specific thing. > You're looking for a fix for some kind of encoding issue; Unicode > normalization translates between combining characters and combined > characters. > > You shouldn't even actually _have_ those in your string in the first > place. How did you construct/receive that data? Ideally, catch it at > that point, and deal with it there.
That would, unfortunately, be "tell the Unicode Consortium to format their documents differently", which seems unlikely to happen. I'm trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt > But if you absolutely have to convert the surrogates, it ought to be > possible to do a sloppy UCS-2 conversion to bytes, then a proper > UTF-16 decode on the result. Python doesn't appear to have UCS-2 support, so I guess what you're saying is that I have to write my own surrogate-decoder? This seems a little surprising. -- https://mail.python.org/mailman/listinfo/python-list