New submission from Walter Dörwald <wal...@livinglogic.de>:

The following code issues a misleading exception message:

>>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid 
continuation byte

The cause for the exception is *not* an invalid continuation byte, but UTF-8 
encoded surrogates. In fact using the 'surrogatepass' error handler doesn't 
raise an exception:

>>> b'\xed\xa0\xbd\xed\xb3\x9e'.decode("utf-8", "surrogatepass")
'\ud83d\udcde'

I would have expected an exception message like:

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-2: 
surrogates not allowed

(Note that the input bytes are an improperly UTF-8 encoded version of U+1F4DE 
(telephone receiver))

----------
components: Unicode
messages: 327357
nosy: doerwalter, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Misleading error message in str.decode()
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34935>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to