[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Marc-Andre Lemburg Sat, 03 Apr 2010 04:41:45 -0700

Marc-Andre Lemburg <m...@egenix.com> added the comment:

Ezio Melotti wrote:
> 
> Ezio Melotti <ezio.melo...@gmail.com> added the comment:
> 
> Here's a new patch. Should be complete but I want to test it some more before 
> committing.
> I decided to follow RFC 3629, putting 0 instead of 5/6 for bytes in range 
> F5-FD (we can always put them back in the unlikely case that the Unicode 
> Consortium changed its mind) and also for other invalid ranges (e.g. C0-C1). 
> This lead to some simplification in the code.


Ok.

> I also found out that, according to RFC 3629, surrogates are considered 
> invalid and they can't be encoded/decoded, but the UTF-8 codec actually does 
> it. I included tests and fix but I left them commented out because this is 
> out of the scope of this patch, and it probably need a discussion on 
> python-dev.

Right, but that idea is controversial. In Python we need to be able to
put those surrogate code points into source code (encoded as UTF-8) as
well as pickle and marshal dumps of Unicode object dumps, so we can't
consider them invalid UTF-8.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to