[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Marc-Andre Lemburg Thu, 01 Apr 2010 00:44:58 -0700

Marc-Andre Lemburg <m...@egenix.com> added the comment:

John Machin wrote:
> 
> John Machin <sjmac...@users.sourceforge.net> added the comment:
> 
> @lemburg: "failing byte" seems rather obvious: first byte that you meet that 
> is not valid in the current state. I don't understand your explanation, 
> especially "does not have the high bit set". I think you mean "is a valid 
> starter byte". See example 3 below.


I just had a quick look at the code and saw that it's testing for the high
bit on the subsequent bytes.

Looking closer, you're right and the situation is a bit more complex,
but the solution still looks simple: only the endinpos
has to be adjusted more carefully depending on what the various
checks find.

That said, I find the Unicode consortium solution a bit awkward.
In UTF-8 the first byte in a multi-byte sequence defines the number
of bytes that make up a sequence. If some of those bytes are invalid,
the whole sequence is invalid and the fact that some of those
bytes may be interpretable as regular code points does not necessarily
result in better results - the reason is that loss of bytes in a
stream is far more unlikely than flipping a few bits in the data.

----------
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> 
str.decode('utf8',    'replace') -- conformance with Unicode 5.2.0

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to