Re: Encoding of surrogate code points to UTF-8

Terry Reedy Tue, 08 Oct 2013 14:49:24 -0700

On 10/8/2013 9:52 AM, Steven D'Aprano wrote:

I think this is a bug in Python's UTF-8 handling, but I'm not sure.


If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:

http://www.unicode.org/faq/utf_bom.html#utf8-5

Sure enough, using Python 3.3:

py> surr = '\udc80'

I am pretty sure that if Python were being strict, that would raise anerror, as the result is not a valid unicode string. Allowing the aboveor not was debated and laxness was allowed for at least the followingpractical reasons.

1. Python itself uses the invalid surrogate codepoints forsurrogateescape error-handling.

http://www.python.org/dev/peps/pep-0383/

2. Invalid strings are needed for tests ;-)
-- like the one you do next.

3. Invalid strings may be needed for interfacing with other C APIs.

py> surr.encode('utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed

Default strict encoding (utf-8 or otherwise) will only encode validunicode strings. Encode invalid strings with surrogate codepoints withsurrogateescape error handling.

But reading the previous entry in the FAQs:

http://www.unicode.org/faq/utf_bom.html#utf8-4

I interpret this as meaning that I should be able to encode valid pairs
of surrogates.

It says you should be able to 'convert' them, and that the result forutf-8 encoding must be a single 4-bytes code for the correspondingsupplementary codepoint.

So if I find a code point that encodes to a surrogate pair
in UTF-16:

py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'

and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:

py> s = '\ud800\udc01'


This is now a string with two invalid codepoints instead of one ;-).
As above, it would be rejected if Python were being strict.

py> s.encode('utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed


Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points.


No, it is being too lax about allowing them at all.

I believe there is an issue on the tracker (maybe closed) about the docfor unicode escapes in string literals. Perhaps is should say moreclearly that inserting surrogates is allowed but results in an invalidstring that cannot be normally encoded.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Encoding of surrogate code points to UTF-8

Reply via email to