[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Ezio Melotti Fri, 02 Sep 2011 17:28:11 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

Or they are still called UTF-8 but used in combination with different error 
handlers, like surrogateescape and surrogatepass.  The "plain" UTF-* codecs 
should produce data that can be used for "open interchange", rejecting all the 
invalid data, both during encoding and decoding.


Chapter 03, D79 also says:
"""
To ensure that the mapping for a Unicode encoding form is one-to-one, all 
Unicode scalar values, including those corresponding to noncharacter code 
points and unassigned code points, must be mapped to unique code unit 
sequences. Note that this requirement does not extend to high-surrogate and 
low-surrogate code points, which are excluded by definition from the set of 
Unicode scalar values.
"""

and this seems to imply that the only unencodable codepoint are the non-scalar 
values, i.e. surrogates and codepoints >U+10FFFF.  Noncharacters shouldn't thus 
receive any special treatment (at least during encoding).

Tom, do you agree with this?  What does Perl do with them?

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to