On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens <jon+use...@unequivocal.co.uk> wrote: >> You shouldn't even actually _have_ those in your string in the first >> place. How did you construct/receive that data? Ideally, catch it at >> that point, and deal with it there. > > That would, unfortunately, be "tell the Unicode Consortium to format > their documents differently", which seems unlikely to happen. I'm > trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes are in your input. I'm not sure what the best way to deal with that is... it's a bit of a mess. You may find yourself needing to do something manually, unless there's a way to ask Python to encode to pseudo-UCS-2 that allows surrogates. Some languages may have sloppy conversions available, but Python's seems to be quite strict (which is correct). Is there an errors handler that can do this? ChrisA -- https://mail.python.org/mailman/listinfo/python-list