Chris Angelico added the comment:
Got around to tracking down where this is actually being done. It's in
Objects/stringlib/codecs.h and it looks to be a hot area for optimization. I
don't want to fiddle with it without knowing a lot about the performance
implications (UTF-8 encode/decode being
Ezio Melotti added the comment:
> Nice document. Is that actually how Python's decoder checks things?
Yes, Python follows the Unicode standard.
> * E0 followed by 80..9F: "non-shortest form"
> * ED followed by A0..BF: "surrogate"
> * F4 followed by 90..BF: "outside defined range"
If you get a
Chris Angelico added the comment:
Nice document. Is that actually how Python's decoder checks things? Does the
decoder have different definitions of "valid continuation byte" based on the
lead byte? If that's the case... well, ten out of ten for complying with the
spec, to be sure, but unfortu
Ezio Melotti added the comment:
The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93
of the book, or 40 of the pdf) shows that if the start byte is ED the
continuation byte must be in range 80..9F. This means that, in order to decode
a sequence starting with ED, you
Serhiy Storchaka added the comment:
UTF-8 codec can't decode byte 0xed because 0xed is not valid UTF-8 sequence and
following byte is not expected valid continuation byte.
UTF-8 codec can produce errors of three types:
* "invalid start byte". When the byte is not start byte of UTF-8 sequence
New submission from Chris Angelico:
>>> b"\xed\xb4\x80".decode("utf-8")
Traceback (most recent call last):
File "", line 1, in
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position
0: invalid continuation byte
The actual problem here is that this byte sequence would decode to U