Tom Christiansen <tchr...@perl.com> added the comment: Ezio Melotti <rep...@bugs.python.org> wrote on Sun, 14 Aug 2011 07:15:09 -0000:
>> Unicode says you can't put surrogates or noncharacters in a >> UTF-anything stream. It's a bug to do so and pretend it's a >> UTF-whatever. > The UTF-8 codec described by RFC 2279 didn't say so, so, since our > codec was following RFC 2279, it was producing valid UTF-8. With RFC > 3629 a number of things changed in a non-backward compatible way. > Therefore we couldn't just change the behavior of the UTF-8 codec nor > rename it to something else in Python 2. We had to wait till Python 3 > in order to fix it. I'm a bit confused on this. You no longer fix bugs in Python 2? I've dug out the references that state that you are not allowed to do things the way you are doing them. This is from the published Unicode Standard version 6.0.0, chapter 3, Conformance. It is a very important chapter. http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf Python is in violation of that published Standard by interpreting noncharacter code points as abstract characters and tolerating them in character encoding forms like UTF-8 or UTF-16. This explains that conformant processes are forbidden from doing this. Code Points Unassigned to Abstract Characters C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. · The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character. ==> C2 A process shall not interpret a noncharacter code point as an abstract character. · The noncharacter code points may be used internally, such as for sentinel val- ues or delimiters, but should not be exchanged publicly. C3 A process shall not interpret an unassigned code point as an abstract character. · This clause does not preclude the assignment of certain generic semantics to unassigned code points (for example, rendering with a glyph to indicate the position within a character block) that allow for graceful behavior in the pres- ence of code points that are outside a supported subset. · Unassigned code points may have default property values. (See D26.) · Code points whose use has not yet been designated may be assigned to abstract characters in future versions of the standard. Because of this fact, due care in the handling of generic semantics for such code points is likely to provide better robustness for implementations that may encounter data based on future ver- sions of the standard. Next we have exactly how something you call UTF-{8,16-32} must be formed. *This* is the Standard against which these things are measured; it is not the RFC. You are of course perfectly free to say you conform to this and that RFC, but you must not say you conform to the Unicode Standard when you don't. These are different things. I feel it does users a grave disservice to ignore the Unicode Standard in this, and sheer casuistry to rely on an RFC definition while ignoring the Unicode Standard whence it originated, because this borders on being intentionally misleading. Character Encoding Forms C8 When a process interprets a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall interpret that code unit sequence according to the corre- sponding code point sequence. ==> · The specification of the code unit sequences for UTF-8 is given in D92. · The specification of the code unit sequences for UTF-16 is given in D91. · The specification of the code unit sequences for UTF-32 is given in D90. C9 When a process generates a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall not emit ill-formed code unit sequences. · The definition of each Unicode character encoding form specifies the ill- formed code unit sequences in the character encoding form. For example, the definition of UTF-8 (D92) specifies that code unit sequences such as <C0 AF> are ill-formed. ==> C10 When a process interprets a code unit sequence which purports to be in a Unicode char- acter encoding form, it shall treat ill-formed code unit sequences as an error condition and shall not interpret such sequences as characters. · For example, in UTF-8 every code unit of the form 110xxxx2 must be followed by a code unit of the form 10xxxxxx2. A sequence such as 110xxxxx2 0xxxxxxx2 is ill-formed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant pro- cess must treat the first code unit 110xxxxx2 as an illegally terminated code unit sequence--for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD replacement character. · Conformant processes cannot interpret ill-formed code unit sequences. How- ever, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89. · Utility programs are not prevented from operating on "mangled" text. For example, a UTF-8 file could have had CRLF sequences introduced at every 80 bytes by a bad mailer program. This could result in some UTF-8 byte sequences being interrupted by CRLFs, producing illegal byte sequences. This mangled text is no longer UTF-8. It is permissible for a conformant program to repair such text, recognizing that the mangled text was originally well-formed UTF-8 byte sequences. However, such repair of mangled data is a special case, and it must not be used in circumstances where it would cause security problems. There are important security issues associated with encoding conversion, espe- cially with the conversion of malformed text. For more information, see Uni- code Technical Report #36, "Unicode Security Considerations." Here is the part that explains why Python narrow builds are actually UTF-16 not UCS-2, and why its documentation needs to be updated: D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode encoding form if and only if it consists of a well-formed Unicode code unit sequence of that Unicode encoding form. · A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short. · A Unicode string consisting of a well-formed UTF-16 code unit sequence is said to be in UTF-16. Such a Unicode string is referred to as a valid UTF-16 string, or a UTF-16 string for short. · A Unicode string consisting of a well-formed UTF-32 code unit sequence is said to be in UTF-32. Such a Unicode string is referred to as a valid UTF-32 string, or a UTF-32 string for short. ==> Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. · For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a well- formed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is. [...] D14 Noncharacter: A code point that is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF. · For more information, see Section 16.7, Noncharacters. · These code points are permanently reserved as noncharacters. D15 Reserved code point: Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point. · Surrogate code points and noncharacters are considered assigned code points, but not assigned characters. · For a summary classification of reserved and other types of code points, see Table 2-3. In general, a conforming process may indicate the presence of a code point whose use has not been designated (for example, by showing a missing glyph in rendering or by signaling an appropriate error in a streaming protocol), even though it is forbidden by the standard from interpreting that code point as an abstract character. Here's how I read all that. The noncharacters and the unpaired surrogates are illegal for interchange, and their presence in a UTF means that that UTF is not conformant to the requirements for what a UTF shall contain. Nonetheless, internally it is necessary that all code points, even noncharacter code points and surrogates, be representable, and doing so does not mean that you are no longer are in that encoding form. However, you must not allow such things into a UTF stream, because doing so means that that stream is no longer a UTF stream. That's why I say that you are of conformance by having encoders and decoders of UTF streams tolerate noncharacters. You are not allowed to call something a UTF and do non-UTF things with it, because this in violation of conformance requirement C2. Therefore you must either (1) change what you are calling the thing you doing the nonconforming thing to, or you must (2) change it to no longer do the nonconforming thing. If you do neither, then Python no longer conforms to the formal requirements for handling such things as these are defined by the Unicode Standard, and therefore that version of Python is no longer conformant to the version of the Unicode Standard that it purports conformance to. And yes, that's a long way of saying it's lying. It's also why having noncharacters including surrogates in memory does *not* suddenly mean that there are not stored in a UTF, because you have to be able to do that to build up buffers per the concatenation example in conformance requirement D89. Therefore, Python uses UTF-16 internally and should not say it uses UCS-2, because that is inaccurate and incorrect; in short, it's wrong. That doesn't help anybody. At least, that's how I read the Unicode Standard. Perhaps a more careful reading than mine would admit alternate interpretations. If you have not reread its Chapter 3 of late in its entirety, you probably want to do so. There is quite a bit of material there that is fundamental to any process that claims to be conformant with the Unicode Standard. I hope that makes sense. These things can be extremely difficult to read, for they are subtle and quick to anger. :) --tom ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com