Ezio Melotti <ezio.melo...@gmail.com> added the comment: Or they are still called UTF-8 but used in combination with different error handlers, like surrogateescape and surrogatepass. The "plain" UTF-* codecs should produce data that can be used for "open interchange", rejecting all the invalid data, both during encoding and decoding.
Chapter 03, D79 also says: """ To ensure that the mapping for a Unicode encoding form is one-to-one, all Unicode scalar values, including those corresponding to noncharacter code points and unassigned code points, must be mapped to unique code unit sequences. Note that this requirement does not extend to high-surrogate and low-surrogate code points, which are excluded by definition from the set of Unicode scalar values. """ and this seems to imply that the only unencodable codepoint are the non-scalar values, i.e. surrogates and codepoints >U+10FFFF. Noncharacters shouldn't thus receive any special treatment (at least during encoding). Tom, do you agree with this? What does Perl do with them? ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com