[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Adam Olsen Fri, 11 Jul 2008 16:33:53 -0700

Adam Olsen <[EMAIL PROTECTED]> added the comment:

Simpler way to reproduce this (on linux):


$ rm unicodetest.pyc 
$ 
$ python -c 'import unicodetest'
Result: False
Len: 2 1
Repr: u'\ud800\udd23' u'\U00010123'
$ 
$ python -c 'import unicodetest'
Result: True
Len: 1 1
Repr: u'\U00010123' u'\U00010123'

Storing surrogates in UTF-32 is ill-formed[1], so the first part
definitely shouldn't be failing on linux (with a UTF-32 build).

The repr could go either way, as unicode doesn't cover escape sequences.
 We could allow u'\ud800\udd23' literals to magically become
u'\U00010123' on UTF-32 builds.  We already allow repr(u'\ud800\udd23')
to magically become "u'\U00010123'" on UTF-16 builds (which is why the
repr test always passes there, rather than always failing).

The bigger problem is how much we prohibit ill-formed character
sequences.  We already prevent values above U+10FFFF, but not
inappropriate surrogates.


[1] Search for D90 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

----------
nosy: +Rhamphoryncus
Added file: http://bugs.python.org/file10880/unicodetest.py

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3297>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Reply via email to