[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Ezio Melotti Sun, 04 Sep 2011 03:49:51 -0700

New submission from Ezio Melotti <ezio.melo...@gmail.com>:

>From Chapter 03 of the Unicode Standard 6[0], D91:
"""
• UTF-16 encoding form: The Unicode encoding form that assigns each Unicode 
scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single 
unsigned 16-bit code unit with the same numeric value as the Unicode scalar 
value, and that assigns each Unicode scalar value in the range 
U+10000..U+10FFFF to a surrogate pair, according to Table 3-5.
• Because surrogate code points are not Unicode scalar values, isolated UTF-16 
code units in the range 0xD800..0xDFFF are ill-formed.
"""
I.e. UTF-16 should be able to decode correctly a valid surrogate pair, and 
encode a non-BMP character using a  valid surrogate pair, but it should reject 
lone surrogates both during encoding and decoding.


On Python 3, the utf-16 codec can encode all the codepoints from U+0000 to 
U+10FFFF (including (lone) surrogates), but it's not able to decode lone 
surrogates (not sure if this is by design or if it just fails because it 
expects another (missing) surrogate).

----------------------------------------------

>From Chapter 03 of the Unicode Standard 6[0], D90:
"""
• UTF-32 encoding form: The Unicode encoding form that assigns each Unicode 
scalar value to a single unsigned 32-bit code unit with the same numeric value 
as the Unicode scalar value.
• Because surrogate code points are not included in the set of Unicode scalar 
values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed.
"""
I.e. UTF-32 should reject both lone surrogates and valid surrogate pairs, both 
during encoding and during decoding.

On Python 3, the utf-32 codec can encode and decode all the codepoints from 
U+0000 to U+10FFFF (including surrogates).

----------------------------------------------

I think that:
  * this should be fixed in 3.3;
  * it's a bug, so the fix /might/ be backported to 3.2.  Hoverver it's also a 
fairly big change in behavior, so it might be better to leave it for 3.3 only;
  * it's better to leave 2.7 alone, even the utf-8 codec is broken there;
  * the surrogatepass error handler should work with the utf-16 and utf-32 
codecs too.


Note that this has been already reported in #3672, but eventually only the 
utf-8 codec was fixed.

[0]: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

----------
assignee: ezio.melotti
components: Unicode
messages: 143490
nosy: ezio.melotti, gvanrossum, haypo, lemburg, loewis, tchrist
priority: high
severity: normal
stage: test needed
status: open
title: UTF-16 and UTF-32 codecs should reject (lone) surrogates
type: behavior
versions: Python 3.3

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12892>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12892] UTF-16 and UTF-32 codecs should reject (lone) surrogates

Reply via email to