Marc-Andre Lemburg <m...@egenix.com> added the comment: John Machin wrote: > > John Machin <sjmac...@users.sourceforge.net> added the comment: > > @lemburg: "failing byte" seems rather obvious: first byte that you meet that > is not valid in the current state. I don't understand your explanation, > especially "does not have the high bit set". I think you mean "is a valid > starter byte". See example 3 below.
I just had a quick look at the code and saw that it's testing for the high bit on the subsequent bytes. Looking closer, you're right and the situation is a bit more complex, but the solution still looks simple: only the endinpos has to be adjusted more carefully depending on what the various checks find. That said, I find the Unicode consortium solution a bit awkward. In UTF-8 the first byte in a multi-byte sequence defines the number of bytes that make up a sequence. If some of those bytes are invalid, the whole sequence is invalid and the fact that some of those bytes may be interpretable as regular code points does not necessarily result in better results - the reason is that loss of bytes in a stream is far more unlikely than flipping a few bits in the data. ---------- title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8271> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com