Marc-Andre Lemburg <m...@egenix.com> added the comment: Ezio Melotti wrote: > > New submission from Ezio Melotti <ezio.melo...@gmail.com>: > >>From Chapter 03 of the Unicode Standard 6[0], D91: > """ > • UTF-16 encoding form: The Unicode encoding form that assigns each Unicode > scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single > unsigned 16-bit code unit with the same numeric value as the Unicode scalar > value, and that assigns each Unicode scalar value in the range > U+10000..U+10FFFF to a surrogate pair, according to Table 3-5. > • Because surrogate code points are not Unicode scalar values, isolated > UTF-16 code units in the range 0xD800..0xDFFF are ill-formed. > """ > I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and > encode a non-BMP character using a valid surrogate pair, but it should > reject lone surrogates both during encoding and decoding. > > On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to > U+10FFFF (including (lone) surrogates), but it's not able to decode lone > surrogates (not sure if this is by design or if it just fails because it > expects another (missing) surrogate). > > ---------------------------------------------- > >>From Chapter 03 of the Unicode Standard 6[0], D90: > """ > • UTF-32 encoding form: The Unicode encoding form that assigns each Unicode > scalar value to a single unsigned 32-bit code unit with the same numeric > value as the Unicode scalar value. > • Because surrogate code points are not included in the set of Unicode scalar > values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed. > """ > I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, > both during encoding and during decoding. > > On Python 3, the utf-32 codec can encode and decode all the codepoints from > U+0000 to U+10FFFF (including surrogates). > > ---------------------------------------------- > > I think that: > * this should be fixed in 3.3; > * it's a bug, so the fix /might/ be backported to 3.2. Hoverver it's also > a fairly big change in behavior, so it might be better to leave it for 3.3 > only; > * it's better to leave 2.7 alone, even the utf-8 codec is broken there; > * the surrogatepass error handler should work with the utf-16 and utf-32 > codecs too. > > > Note that this has been already reported in #3672, but eventually only the > utf-8 codec was fixed.
All UTF codecs should reject lone surrogates in strict error mode, but let them pass using the surrogatepass error handler (the UTF-8 codec already does) and apply the usual error handling for ignore and replace. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12892> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com