Alexander Belopolsky <belopol...@users.sourceforge.net> added the comment:

> It appears this is an invalid unicode character.
> Shouldn't this be caught by decode("utf8")

It should and it is in Python 3.x:

>>> b'\xed\xa8\x80'.decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid 
continuation byte

Python 2.7 behavior seems to be a bug.

>>> '\xed\xa8\x80'.decode("utf8")
u'\uda00'

Note also the following difference:

In 3.x:

>>> b'\xed\xa8\x80'.decode("utf8", 'replace')
'��'

In 2.7:

>>> '\xed\xa8\x80'.decode("utf8", 'replace')
u'\uda00'

I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip 
just fine in 2.x and there likely to be existing code that relies on this.

>  Shouldn't anything generated by json.dumps be parsed by json.loads?

This on the other hand should probably be fixed by either rejecting lone 
surrogates in json.dumps or accepting them in json.loads or both.  The last 
alternative would be consistent with the common wisdom of being conservative in 
what you produce but liberal in what you accept.

----------
nosy: +belopolsky, haypo
versions: +Python 2.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11489>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to