Ezio Melotti <ezio.melo...@gmail.com> added the comment:

RFC 4627 doesn't say much about lone surrogates:
   A string is a sequence of zero or more Unicode characters [UNICODE].
   [...]

   All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "\u005C".
   [...]

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "\uD834\uDD1E".


Raymond> JSON is UTF-8 by definition and it is a useful feature that invalid 
UTF-8 won't load.

Even if the input strings are not encodable in UTF-8 because they contain lone 
surrogates, they can still be converted to an \uXXXX escape, and the resulting 
JSON document will be valid UTF-8.
AFAIK json always uses \uXXXX, so it doesn't produce invalid UTF-8 documents.

While decoding, both json.loads('"\xed\xa0\x80"') and json.loads('"\ud800"') 
result in u'\ud800', but the first is not a valid UTF-8 document because it 
contains an invalid UTF-8 byte sequence that represent a lone surrogate, 
whereas the second one contains only ASCII bytes and it's therefore valid.
Python 2.7 should probably reject '"\xed\xa0\x80"', but since its UTF-8 codec 
is somewhat permissive already, I'm not sure it makes much sense changing the 
behavior now.  Python 3 doesn't have this problem because it works only with 
unicode strings, so you can't pass invalid UTF-8 byte sequences.

OTOH the Unicode standard says that lone surrogates shouldn't be passed around, 
so we might decide to replace them with the replacement char U+FFFD, raise an 
error, or even provide a way to decide what should be done with them (something 
like the errors argument of codecs).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11489>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to