[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Ezio Melotti Fri, 02 Sep 2011 03:13:15 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

> To start with, no code point which when bitwise added with 0xFFFE 
> returns 0xFFFE can never appear in a valid UTF-* stream, but Python
> allow this without any error.


> That means that both 0xNN_FFFE and 0xNN_FFFF are illegal in all 
> planes, where NN is 00 through 10 in hex.  So that's 2 noncharacters
> times 17 planes = 34 code points illegal for interchange that Python 
> is passing through illegally.  

> The remaining 32 nonsurrogate code points illegal for open interchange
> are 0xFDD0 through 0xFDEF.  Those are not allowed either, but Python
> doesn't seem to care.

It's not entirely clear to me what the UTF-8 codec is supposed to do with this.

For example U+FFFE is <EF BF BE> in UTF-8, and this is valid according to table 
3-7, Chapter 03[0]:
"""
Code points     1st byte  2nd byte  3rd byte
U+E000..U+FFFF  EE..EF    80..BF    80..BF
"""

Chapter 16, section 16.7 "Noncharacters" says[1]:
"""
Noncharacters are code points that are permanently reserved in the Unicode 
Standard for internal use. They are forbidden for use in open interchange of 
Unicode text data.
"""

and
"""
Applications are free to use any of these noncharacter code points internally 
but should never attempt to exchange them.
"""
seem to suggest that encoding them is forbidden.


"""
If a noncharacter is received in open interchange, an application is not 
required to interpret it in any way. It is good practice, however, to recognize 
it as a noncharacter and to take appropriate action, such as replacing it with 
U+FFFD replacement character, to indicate the problem in the text. It is not 
recommended to simply delete noncharacter code points from such text, because 
of the potential security issues caused by deleting uninterpreted characters.
"""
here decoding seems allowed, possibly with a replacement (that would depend on 
the error handler used though, so the default 'strict' would turn this in an 
error).


Chapter 03, after D14, says:
"""
In general, a conforming process may indicate the presence of a code point 
whose use has not been designated (for example, by showing a missing glyph in 
rendering or by signaling an appropriate error in a streaming protocol), even 
though it is forbidden by the standard from interpreting that code point as an 
abstract character.
"""

and in C7:
"""
If a noncharacter that does not have a specific internal use is unexpectedly 
encountered in processing, an implementation may signal an error or replace the 
noncharacter with U+FFFD replacement character. If the implementation chooses 
to replace, delete or ignore a noncharacter, such an action constitutes a 
modification in the interpretation of the text. In general, a noncharacter 
should be treated as an unassigned code point.
"""

This doesn't mention clearly what the codec is supposed to do.
On one hand, it suggests that an error can be raised, i.e. consider the 
noncharacter invalid like out-of-range codepoints (>U+10FFFF) or lone 
surrogates. 
On the other hand it says that they should be treated as an unassigned code 
point, i.e. encoded/decoded normally, and doesn't list them as invalid in table 
3-7.


[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
[1]: http://www.unicode.org/versions/Unicode6.0.0/ch16.pdf

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to