On 2015-05-03 17:26, Jon Ribbens wrote:
On 2015-05-03, MRAB <pyt...@mrabarnett.plus.com> wrote:
On 2015-05-03 16:32, Jon Ribbens wrote:
That would, unfortunately, be "tell the Unicode Consortium to format
their documents differently", which seems unlikely to happen. I'm
trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
That document looks like it's encoded in UTF-8.
It is. But it also, for reasons best known to the Unicode Consortium,
contains strings of the form \uXXXX which need to be parsed into the
appropriate character, and some of *those* are then surrogate pairs,
which need to be further converted.
Ah, so it's r"\udb40\udd9d". :-)
There's also a mistake in this bit:
"""
# Note that according to the \uXXXX escaping convention, a supplemental
character (> 0x10FFFF) is represented
# by a sequence of two surrogate characters: the first between D800 and
DBFF, and the second between DC00 and DFFF.
"""
--
https://mail.python.org/mailman/listinfo/python-list