STINNER Victor <victor.stin...@haypocalc.com> added the comment: > I also found out that, according to RFC 3629, surrogates > are considered invalid and they can't be encoded/decoded, > but the UTF-8 codec actually does it.
Python2 does, but Python3 raises an error. Python 2.7a4+ (trunk:79675, Apr 3 2010, 16:11:36) >>> u"\uDC80".encode("utf8") '\xed\xb2\x80' Python 3.2a0 (py3k:79441, Mar 26 2010, 13:04:55) >>> "\uDC80".encode("utf8") UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed Deny encoding surrogates (in utf8) causes a lot of crashs in Python3, because most functions calling suppose that _PyUnicode_AsString() does never fail: see #6687 (and #8195 and a lot of other crashs). It's not a good idea to change it in Python 2.7, because it would require a huge work and we are close to the first beta of 2.7. ---------- nosy: +haypo _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue8271> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com