Nick Coghlan added the comment: surrogateescape and surrogateepass data *already* can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you use surrogateescape/surrogateepass as the output error handler). They only work as a transparent pass through if the input and output encodings match.
I'd be fine with putting these data scrubbing functions somewhere other than in codecs, though (I'm not sure unicodedata is the right place, but a new module like "string.internals" might be, as these functions have more to do with Python's internal text representation than they do anything else. A module like the latter could also be a home for things like a chunking utility that splits a string up into substrings that use as little memory as possible for feeding into a StringIO instance before throwing the original away). I also don't think they're urgent - the introduction of /etc/locale.conf makes modern Linux far more consistent in getting locale settings right, and even older platforms tend to get the locale right for user processes. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue18814> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com