On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens <jon+use...@unequivocal.co.uk> wrote: > If I have a string containing surrogate pairs like this in Python 3.4: > > "\udb40\udd9d" > > How do I convert it into the proper form: > > "\U000E019D" > > ? The answer appears not to be "unicodedata.normalize".
No, it's not, because Unicode normalization is a very specific thing. You're looking for a fix for some kind of encoding issue; Unicode normalization translates between combining characters and combined characters. You shouldn't even actually _have_ those in your string in the first place. How did you construct/receive that data? Ideally, catch it at that point, and deal with it there. But if you absolutely have to convert the surrogates, it ought to be possible to do a sloppy UCS-2 conversion to bytes, then a proper UTF-16 decode on the result. ChrisA -- https://mail.python.org/mailman/listinfo/python-list