[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Adam Olsen Sat, 12 Jul 2008 12:04:07 -0700

Adam Olsen <[EMAIL PROTECTED]> added the comment:

Marc, perhaps Unicode has refined their definitions since you last looked?


Valid UTF-8 *cannot* contain surrogates[1].  If it does, you have
CESU-8[2][3], not UTF-8.

So there are two bugs: first, the UTF-8 codec should refuse to load
surrogates.  Second, since the original bug showed up before the .pyc is
created, something in the parse/compilation/whatever stage is producing
CESU-8.


[1] 4th bullet point of D92 in
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
[2] http://unicode.org/reports/tr26/
[3] http://en.wikipedia.org/wiki/CESU-8

_______________________________________
Python tracker <[EMAIL PROTECTED]>
<http://bugs.python.org/issue3297>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Reply via email to