Serhiy Storchaka added the comment: > It's Unicode that considers unpaired surrogates invalid, not UTF-8 by itself.
It's UTF-8 too. See RFC 3629: The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters. When encoding in UTF-8 from UTF-16 data, it is necessary to first decode the UTF-16 data to obtain character numbers, which are then encoded in UTF-8 as described above. ---------- nosy: +storchaka _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue11489> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com