[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-04-27 Thread Chris Angelico
Chris Angelico added the comment: Got around to tracking down where this is actually being done. It's in Objects/stringlib/codecs.h and it looks to be a hot area for optimization. I don't want to fiddle with it without knowing a lot about the performance implications (UTF-8 encode/decode being

[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-13 Thread Ezio Melotti
Ezio Melotti added the comment: > Nice document. Is that actually how Python's decoder checks things? Yes, Python follows the Unicode standard. > * E0 followed by 80..9F: "non-shortest form" > * ED followed by A0..BF: "surrogate" > * F4 followed by 90..BF: "outside defined range" If you get a

[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-13 Thread Chris Angelico
Chris Angelico added the comment: Nice document. Is that actually how Python's decoder checks things? Does the decoder have different definitions of "valid continuation byte" based on the lead byte? If that's the case... well, ten out of ten for complying with the spec, to be sure, but unfortu

[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-13 Thread Ezio Melotti
Ezio Melotti added the comment: The Table 3-7 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (page 93 of the book, or 40 of the pdf) shows that if the start byte is ED the continuation byte must be in range 80..9F. This means that, in order to decode a sequence starting with ED, you

[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-08 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: UTF-8 codec can't decode byte 0xed because 0xed is not valid UTF-8 sequence and following byte is not expected valid continuation byte. UTF-8 codec can produce errors of three types: * "invalid start byte". When the byte is not start byte of UTF-8 sequence

[issue23614] Opaque error message on UTF-8 decoding to surrogates

2015-03-08 Thread Chris Angelico
New submission from Chris Angelico: >>> b"\xed\xb4\x80".decode("utf-8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte The actual problem here is that this byte sequence would decode to U