Re: Unicode surrogate pairs (Python 3.4)

Jon Ribbens Sun, 03 May 2015 09:37:53 -0700

On 2015-05-03, Chris Angelico <[email protected]> wrote:
> On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens
><[email protected]> wrote:
>> That would, unfortunately, be "tell the Unicode Consortium to format
>> their documents differently", which seems unlikely to happen. I'm
>> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt
>
> Ah, so what you _actually_ have is "\\udb40\\udd9d" - the backslashes
> are in your input.


Well, they were, but I already wrote code to convert them into the
strings I showed in my original post.

> I'm not sure what the best way to deal with that is... it's a bit of
> a mess. You may find yourself needing to do something manually,
> unless there's a way to ask Python to encode to pseudo-UCS-2 that
> allows surrogates. Some languages may have sloppy conversions
> available, but Python's seems to be quite strict (which is correct).
> Is there an errors handler that can do this?

I did some experimentation, and it looks like the answer is:

  "\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16")

Thanks for your help!
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode surrogate pairs (Python 3.4)

Reply via email to