Marc-Andre Lemburg <m...@egenix.com> added the comment: Tom Christiansen wrote: > > I'm pretty sure that anything that claims to be UTF-{8,16,32} needs > to reject both surrogates *and* noncharacters. Here's something from the > published Unicode Standard's p.24 about noncharacter code points: > > • Noncharacter code points are reserved for internal use, such as for > sentinel values. They should never be interchanged. They do, however, > have well-formed representations in Unicode encoding forms and survive > conversions between encoding forms. This allows sentinel values to be > preserved internally across Unicode encoding forms, even though they are > not designed to be used in open interchange. > > And here from the Unicode Standard's chapter on Conformance, section 3.2, p. > 59: > > C2 A process shall not interpret a noncharacter code point as an > abstract character. > > • The noncharacter code points may be used internally, such as for > sentinel values or delimiters, but should not be exchanged publicly.
You have to remember that Python is used to build applications. It's up to the applications to conform to Unicode or not and the application also defines what "exchange" means in the above context. Python itself needs to be able to deal with assigned non-character code points as well as unassigned code points or code points that are part of special ranges such as the surrogate ranges. I'm +1 on not allowing e.g. lone surrogates in UTF-8 data, because we have a way to optionally allow these via an error handler, but -1 on making changes that cause full range round-trip safety of the UTF encodings to be lost without a way to turn the functionality back on. ---------- title: Python lib re cannot handle Unicode properly due to narrow/wide bug -> Python lib re cannot handle Unicode properly due to narrow/wide bug _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com