New submission from Antoine Pitrou <pit...@free.fr>: The utf-7 codec happily encodes lone surrogates, but it won't decode them:
>>> "\ud801".encode("utf-7") b'+2AE-' >>> "\ud801\ud801".encode("utf-7") b'+2AHYAQ-' >>> "\ud801".encode("utf-7").decode("utf-7") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-4: second surrogate missing at end of shift sequence >>> "\ud801\ud801".encode("utf-7").decode("utf-7") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/antoine/cpython/default/Lib/encodings/utf_7.py", line 12, in decode return codecs.utf_7_decode(input, errors, True) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: second surrogate missing I don't know which behaviour is better but round-tripping is certainly a desirable property of any codec. ---------- components: Interpreter Core, Unicode messages: 146919 nosy: ezio.melotti, loewis, pitrou priority: normal severity: normal status: open title: utf-7 inconsistent with surrogates type: behavior versions: Python 2.7, Python 3.2, Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13333> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com