Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Jon Ribbens
On 2015-05-03, MRAB wrote: > There's also a mistake in this bit: > > """ > # Note that according to the \u escaping convention, a supplemental > character (> 0x10) is represented > # by a sequence of two surrogate characters: the first between D800 and > DBFF, and the second between DC00

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread MRAB
On 2015-05-03 17:26, Jon Ribbens wrote: On 2015-05-03, MRAB wrote: On 2015-05-03 16:32, Jon Ribbens wrote: That would, unfortunately, be "tell the Unicode Consortium to format their documents differently", which seems unlikely to happen. I'm trying to read in: http://www.unicode.org/Public/idn

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Chris Angelico
On Mon, May 4, 2015 at 2:30 AM, Jon Ribbens wrote: > I did some experimentation, and it looks like the answer is: > > "\udb40\udd9d".encode("utf16", "surrogatepass").decode("utf16") > > Thanks for your help! Ha! That's the one. I went poking around but couldn't find the name for it. That's exac

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Jon Ribbens
On 2015-05-03, Chris Angelico wrote: > On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens > wrote: >> That would, unfortunately, be "tell the Unicode Consortium to format >> their documents differently", which seems unlikely to happen. I'm >> trying to read in: http://www.unicode.org/Public/idna/6.3.0/Id

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Jon Ribbens
On 2015-05-03, MRAB wrote: > On 2015-05-03 16:32, Jon Ribbens wrote: >> That would, unfortunately, be "tell the Unicode Consortium to format >> their documents differently", which seems unlikely to happen. I'm >> trying to read in: http://www.unicode.org/Public/idna/6.3.0/IdnaTest.txt >> > That do

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread MRAB
On 2015-05-03 16:32, Jon Ribbens wrote: On 2015-05-03, Chris Angelico wrote: On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens wrote: If I have a string containing surrogate pairs like this in Python 3.4: "\udb40\udd9d" How do I convert it into the proper form: "\U000E019D" ? The answer ap

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Chris Angelico
On Mon, May 4, 2015 at 1:32 AM, Jon Ribbens wrote: >> You shouldn't even actually _have_ those in your string in the first >> place. How did you construct/receive that data? Ideally, catch it at >> that point, and deal with it there. > > That would, unfortunately, be "tell the Unicode Consortium t

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Marko Rauhamaa
Jon Ribbens : > Python doesn't appear to have UCS-2 support, so I guess what you're > saying is that I have to write my own surrogate-decoder? This seems a > little surprising. Try UTF-16. Marko -- https://mail.python.org/mailman/listinfo/python-list

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Jon Ribbens
On 2015-05-03, Chris Angelico wrote: > On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens > wrote: >> If I have a string containing surrogate pairs like this in Python 3.4: >> >> "\udb40\udd9d" >> >> How do I convert it into the proper form: >> >> "\U000E019D" >> >> ? The answer appears not to be "u

Re: Unicode surrogate pairs (Python 3.4)

2015-05-03 Thread Chris Angelico
On Mon, May 4, 2015 at 12:40 AM, Jon Ribbens wrote: > If I have a string containing surrogate pairs like this in Python 3.4: > > "\udb40\udd9d" > > How do I convert it into the proper form: > > "\U000E019D" > > ? The answer appears not to be "unicodedata.normalize". No, it's not, because Unic