Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread Neil Cerutti
On 2013-10-09, Ned Batchelder wrote: > On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote: >> and what Unicode.org does not say is that these coding schemes >> (like any coding scheme) should be used in an exclusive way. > > Can you clarify what you mean by "in an exclusive way"? Ned, pay no attention

Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread Ned Batchelder
On 10/9/13 4:22 AM, wxjmfa...@gmail.com wrote: Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit : http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three encoding forms can be used to represent the full range of encoded characters in the Unicode Standard; ...

Re: Encoding of surrogate code points to UTF-8

2013-10-09 Thread wxjmfauth
Le mercredi 9 octobre 2013 08:20:05 UTC+2, Steven D'Aprano a écrit : > > > > http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf#G13708 "All three > > > encoding forms can be used to represent the full range of encoded > > > characters in the Unicode Standard; ... Each of the three Unicode >

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
On Tue, 08 Oct 2013 21:28:25 -0400, Terry Reedy wrote: > On 10/8/2013 6:30 PM, Steven D'Aprano wrote: >> On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote: >> >>> In any case, "\ud800\udc01" isn't a valid unicode string. >> >> I don't think this is correct. Can you show me where the standard

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Terry Reedy
On 10/8/2013 6:30 PM, Steven D'Aprano wrote: On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote: In any case, "\ud800\udc01" isn't a valid unicode string. I don't think this is correct. Can you show me where the standard says that Unicode strings[1] may not contain surrogates? I think tha

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
On Tue, 08 Oct 2013 15:14:33 +, Neil Cerutti wrote: > In any case, "\ud800\udc01" isn't a valid unicode string. I don't think this is correct. Can you show me where the standard says that Unicode strings[1] may not contain surrogates? I think that is a critical point, and the FAQ conflates

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
On Tue, 08 Oct 2013 18:00:58 +0100, MRAB wrote: > The only time you should get a surrogate pair in a Unicode string is in > a narrow build, which doesn't exist in Python 3.3 and later. Incorrect. py> sys.version '3.3.0rc3 (default, Sep 27 2012, 18:44:58) \n[GCC 4.1.2 20080704 (Red Hat 4.1.2-52)

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Terry Reedy
On 10/8/2013 5:47 PM, Terry Reedy wrote: On 10/8/2013 9:52 AM, Steven D'Aprano wrote: But reading the previous entry in the FAQs: http://www.unicode.org/faq/utf_bom.html#utf8-4 I interpret this as meaning that I should be able to encode valid pairs of surrogates. It says you should be able

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Terry Reedy
On 10/8/2013 9:52 AM, Steven D'Aprano wrote: I think this is a bug in Python's UTF-8 handling, but I'm not sure. If I've read the Unicode FAQs correctly, you cannot encode *lone* surrogate code points into UTF-8: http://www.unicode.org/faq/utf_bom.html#utf8-5 Sure enough, using Python 3.3: py

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread wxjmfauth
>>> sys.version '3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)]' >>> '\ud800'.encode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread MRAB
On 08/10/2013 16:23, Pete Forman wrote: Steven D'Aprano writes: I think this is a bug in Python's UTF-8 handling, but I'm not sure. [snip] py> s = '\ud800\udc01' py> s.encode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'utf-8' codec can't encode cha

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Neil Cerutti
On 2013-10-08, Neil Cerutti wrote: > In any case, "\ud800\udc01" isn't a valid unicode string. In a > perfect world it would automatically get converted to > '\u00010001' without intervention. This last paragraph is erroneous. I must have had a typo in my testing. -- Neil Cerutti -- https://ma

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Pete Forman
Steven D'Aprano writes: > I think this is a bug in Python's UTF-8 handling, but I'm not sure. [snip] > py> s = '\ud800\udc01' > py> s.encode('utf-8') > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in > position 0:

Re: Encoding of surrogate code points to UTF-8

2013-10-08 Thread Neil Cerutti
On 2013-10-08, Steven D'Aprano wrote: > py> c = '\N{LINEAR B SYLLABLE B038 E}' > py> surr_pair = c.encode('utf-16be') > py> print(surr_pair) > b'\xd8\x00\xdc\x01' > > and then use those same values as the code points, I ought to be able to > encode to UTF-8, as if it were the same \N{LINEAR B SYL

Encoding of surrogate code points to UTF-8

2013-10-08 Thread Steven D'Aprano
I think this is a bug in Python's UTF-8 handling, but I'm not sure. If I've read the Unicode FAQs correctly, you cannot encode *lone* surrogate code points into UTF-8: http://www.unicode.org/faq/utf_bom.html#utf8-5 Sure enough, using Python 3.3: py> surr = '\udc80' py> surr.encode('utf-8') Tra