On 2015-05-03, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 2015-05-03 16:32, Jon Ribbens wrote: >> That would, unfortunately, be "tell the Unicode Consortium to format >> their documents differently", which seems unlikely to happen. I'm >> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt >> > That document looks like it's encoded in UTF-8.
It is. But it also, for reasons best known to the Unicode Consortium, contains strings of the form \uXXXX which need to be parsed into the appropriate character, and some of *those* are then surrogate pairs, which need to be further converted. -- https://mail.python.org/mailman/listinfo/python-list