[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Tom Christiansen Sun, 14 Aug 2011 09:56:11 -0700

Tom Christiansen <tchr...@perl.com> added the comment:

Ezio Melotti <rep...@bugs.python.org> wrote
   on Sun, 14 Aug 2011 07:15:09 -0000:


>> Unicode says you can't put surrogates or noncharacters in a
>> UTF-anything stream.  It's a bug to do so and pretend it's a
>> UTF-whatever.

> The UTF-8 codec described by RFC 2279 didn't say so, so, since our
> codec was following RFC 2279, it was producing valid UTF-8.  With RFC
> 3629 a number of things changed in a non-backward compatible way.
> Therefore we couldn't just change the behavior of the UTF-8 codec nor
> rename it to something else in Python 2.  We had to wait till Python 3
> in order to fix it.

I'm a bit confused on this.  You no longer fix bugs in Python 2?

I've dug out the references that state that you are not allowed to do things the
way you are doing them.  This is from the published Unicode Standard version 
6.0.0,
chapter 3, Conformance.  It is a very important chapter.

    http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Python is in violation of that published Standard by interpreting noncharacter 
code
points as abstract characters and tolerating them in character encoding forms 
like
UTF-8 or UTF-16.  This explains that conformant processes are forbidden from 
doing this.

    Code Points Unassigned to Abstract Characters

     C1 A process shall not interpret a high-surrogate code point or a 
low-surrogate code point
         as an abstract character.
       · The high-surrogate and low-surrogate code points are designated for 
surrogate
         code units in the UTF-16 character encoding form. They are unassigned 
to any
         abstract character.

==>  C2 A process shall not interpret a noncharacter code point as an abstract 
character.
       · The noncharacter code points may be used internally, such as for 
sentinel val-
         ues or delimiters, but should not be exchanged publicly.

     C3 A process shall not interpret an unassigned code point as an abstract 
character.
       · This clause does not preclude the assignment of certain generic 
semantics to
         unassigned code points (for example, rendering with a glyph to 
indicate the
         position within a character block) that allow for graceful behavior in 
the pres-
         ence of code points that are outside a supported subset.
       · Unassigned code points may have default property values. (See D26.)
       · Code points whose use has not yet been designated may be assigned to 
abstract
         characters in future versions of the standard. Because of this fact, 
due care in
         the handling of generic semantics for such code points is likely to 
provide better
         robustness for implementations that may encounter data based on future 
ver-
         sions of the standard.

Next we have exactly how something you call UTF-{8,16-32} must be formed.
*This* is the Standard against which these things are measured; it is not the 
RFC.

You are of course perfectly free to say you conform to this and that RFC, but 
you
must not say you conform to the Unicode Standard when you don't.  These are 
different
things.  I feel it does users a grave disservice to ignore the Unicode Standard 
in
this, and sheer casuistry to rely on an RFC definition while ignoring the 
Unicode
Standard whence it originated, because this borders on being intentionally 
misleading.

    Character Encoding Forms

     C8 When a process interprets a code unit sequence which purports to be in 
a Unicode char-
         acter encoding form, it shall interpret that code unit sequence 
according to the corre-
         sponding code point sequence.
==>    · The specification of the code unit sequences for UTF-8 is given in D92.
       · The specification of the code unit sequences for UTF-16 is given in 
D91.
       · The specification of the code unit sequences for UTF-32 is given in 
D90.

     C9 When a process generates a code unit sequence which purports to be in a 
Unicode char-
         acter encoding form, it shall not emit ill-formed code unit sequences.
       · The definition of each Unicode character encoding form specifies the 
ill-
         formed code unit sequences in the character encoding form. For 
example, the
         definition of UTF-8 (D92) specifies that code unit sequences such as 
<C0 AF>
         are ill-formed.

==> C10 When a process interprets a code unit sequence which purports to be in 
a Unicode char-
         acter encoding form, it shall treat ill-formed code unit sequences as 
an error condition
         and shall not interpret such sequences as characters.
       · For example, in UTF-8 every code unit of the form 110xxxx2 must be 
followed
         by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 
0xxxxxxx2
         is ill-formed and must never be generated. When faced with this 
ill-formed
         code unit sequence while transforming or interpreting text, a 
conformant pro-
         cess must treat the first code unit 110xxxxx2 as an illegally 
terminated code unit
         sequence--for example, by signaling an error, filtering the code unit 
out, or
         representing the code unit with a marker such as U+FFFD replacement
         character.
       · Conformant processes cannot interpret ill-formed code unit sequences. 
How-
         ever, the conformance clauses do not prevent processes from operating 
on code
         unit sequences that do not purport to be in a Unicode character 
encoding form.
         For example, for performance reasons a low-level string operation may 
simply
         operate directly on code units, without interpreting them as 
characters. See,
         especially, the discussion under D89.
       · Utility programs are not prevented from operating on "mangled" text. 
For
         example, a UTF-8 file could have had CRLF sequences introduced at 
every 80
         bytes by a bad mailer program. This could result in some UTF-8 byte 
sequences
         being interrupted by CRLFs, producing illegal byte sequences. This 
mangled
         text is no longer UTF-8. It is permissible for a conformant program to 
repair
         such text, recognizing that the mangled text was originally 
well-formed UTF-8
         byte sequences. However, such repair of mangled data is a special 
case, and it
         must not be used in circumstances where it would cause security 
problems.
         There are important security issues associated with encoding 
conversion, espe-
         cially with the conversion of malformed text. For more information, 
see Uni-
         code Technical Report #36, "Unicode Security Considerations."

Here is the part that explains why Python narrow builds are actually UTF-16 not 
UCS-2,
and why its documentation needs to be updated:

    D89 In a Unicode encoding form: A Unicode string is said to be in a 
particular Unicode
           encoding form if and only if it consists of a well-formed Unicode 
code unit sequence
           of that Unicode encoding form.
        · A Unicode string consisting of a well-formed UTF-8 code unit sequence 
is said
           to be in UTF-8. Such a Unicode string is referred to as a valid 
UTF-8 string, or a
           UTF-8 string for short.
        · A Unicode string consisting of a well-formed UTF-16 code unit 
sequence is said
           to be in UTF-16. Such a Unicode string is referred to as a valid 
UTF-16 string,
           or a UTF-16 string for short.
        · A Unicode string consisting of a well-formed UTF-32 code unit 
sequence is said
           to be in UTF-32. Such a Unicode string is referred to as a valid 
UTF-32 string,
           or a UTF-32 string for short.

==> Unicode strings need not contain well-formed code unit sequences under all 
conditions.
    This is equivalent to saying that a particular Unicode string need not be 
in a Unicode
    encoding form.

        · For example, it is perfectly reasonable to talk about an operation 
that takes the
           two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of 
which
           contains an ill-formed UTF-16 code unit sequence, and concatenates 
them to
           form another Unicode string <004D D800 DF02 004D>, which contains a 
well-
           formed UTF-16 code unit sequence. The first two Unicode strings are 
not in
           UTF-16, but the resultant Unicode string is.

    [...]

     D14 Noncharacter: A code point that is permanently reserved for internal 
use and that
           should never be interchanged. Noncharacters consist of the values 
U+nFFFE and
           U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
         · For more information, see Section 16.7, Noncharacters.
         · These code points are permanently reserved as noncharacters.

     D15 Reserved code point: Any code point of the Unicode Standard that is 
reserved for
           future assignment. Also known as an unassigned code point.
         · Surrogate code points and noncharacters are considered assigned code 
points,
           but not assigned characters.
         · For a summary classification of reserved and other types of code 
points, see
           Table 2-3.

    In general, a conforming process may indicate the presence of a code point 
whose use has
    not been designated (for example, by showing a missing glyph in rendering 
or by signaling
    an appropriate error in a streaming protocol), even though it is forbidden 
by the standard
    from interpreting that code point as an abstract character.

Here's how I read all that.

The noncharacters and the unpaired surrogates are illegal for interchange, and 
their
presence in a UTF means that that UTF is not conformant to the requirements for 
what
a UTF shall contain.  Nonetheless, internally it is necessary that all code 
points,
even noncharacter code points and surrogates, be representable, and doing so 
does not
mean that you are no longer are in that encoding form.  However, you must not 
allow
such things into a UTF stream, because doing so means that that stream is no 
longer
a UTF stream.

That's why I say that you are of conformance by having encoders and decoders of 
UTF
streams tolerate noncharacters.  You are not allowed to call something a UTF 
and do
non-UTF things with it, because this in violation of conformance requirement C2.
Therefore you must either (1) change what you are calling the thing you doing 
the
nonconforming thing to, or you must (2) change it to no longer do the 
nonconforming
thing.  If you do neither, then Python no longer conforms to the formal 
requirements
for handling such things as these are defined by the Unicode Standard, and 
therefore
that version of Python is no longer conformant to the version of the Unicode 
Standard
that it purports conformance to.  And yes, that's a long way of saying it's 
lying.

It's also why having noncharacters including surrogates in memory does *not* 
suddenly
mean that there are not stored in a UTF, because you have to be able to do that 
to
build up buffers per the concatenation example in conformance requirement D89.
Therefore, Python uses UTF-16 internally and should not say it uses UCS-2, 
because
that is inaccurate and incorrect; in short, it's wrong.  That doesn't help 
anybody.

At least, that's how I read the Unicode Standard.  Perhaps a more careful 
reading
than mine would admit alternate interpretations.  If you have not reread its 
Chapter
3 of late in its entirety, you probably want to do so.  There is quite a bit of
material there that is fundamental to any process that claims to be conformant 
with
the Unicode Standard.

I hope that makes sense.  These things can be extremely difficult to read, for 
they
are subtle and quick to anger. :)

--tom

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to