How about another str-like type, a sequence of char-or-bytes? Could be called strbytes or stringwithinvalidcharacters. It would support whatever subset of str functionality makes sense / is easy to implement plus a to_escaped_str() method (that does the escaping the PEP talks about) for people who want to use regexes or other str-only stuff.
Here is a description by example: os.listdir('.') -> [strbytes('normal_file'), strbytes('bad', 128, 'file')] strbytes('a')[0] -> strbytes('a') strbytes('bad', 128, 'file')[3] -> strbytes(128) strbytes('bad', 128, 'file').to_escaped_str() -> 'bad?128file' Having a separate type is cleaner than a "str that isn't exactly what it represents". And making the escaping an explicit (but rarely-needed) step would be less surprising for users. Anyway, I don't know a whole lot about this issue so there may an obvious reason this is a bad idea. On Wed, Apr 22, 2009 at 6:50 AM, "Martin v. Löwis" <mar...@v.loewis.de> wrote: > I'm proposing the following PEP for inclusion into Python 3.1. > Please comment. > > Regards, > Martin > > PEP: 383 > Title: Non-decodable Bytes in System Character Interfaces > Version: $Revision: 71793 $ > Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $ > Author: Martin v. Löwis <mar...@v.loewis.de> > Status: Draft > Type: Standards Track > Content-Type: text/x-rst > Created: 22-Apr-2009 > Python-Version: 3.1 > Post-History: > > Abstract > ======== > > File names, environment variables, and command line arguments are > defined as being character data in POSIX; the C APIs however allow > passing arbitrary bytes - whether these conform to a certain encoding > or not. This PEP proposes a means of dealing with such irregularities > by embedding the bytes in character strings in such a way that allows > recreation of the original byte string. > > Rationale > ========= > > The C char type is a data type that is commonly used to represent both > character data and bytes. Certain POSIX interfaces are specified and > widely understood as operating on character data, however, the system > call interfaces make no assumption on the encoding of these data, and > pass them on as-is. With Python 3, character strings use a > Unicode-based internal representation, making it difficult to ignore > the encoding of byte strings in the same way that the C interfaces can > ignore the encoding. > > On the other hand, Microsoft Windows NT has correct the original > design limitation of Unix, and made it explicit in its system > interfaces that these data (file names, environment variables, command > line arguments) are indeed character data, by providing a > Unicode-based API (keeping a C-char-based one for backwards > compatibility). > > For Python 3, one proposed solution is to provide two sets of APIs: a > byte-oriented one, and a character-oriented one, where the > character-oriented one would be limited to not being able to represent > all data accurately. Unfortunately, for Windows, the situation would > be exactly the opposite: the byte-oriented interface cannot represent > all data; only the character-oriented API can. As a consequence, > libraries and applications that want to support all user data in a > cross-platform manner have to accept mish-mash of bytes and characters > exactly in the way that caused endless troubles for Python 2.x. > > With this PEP, a uniform treatment of these data as characters becomes > possible. The uniformity is achieved by using specific encoding > algorithms, meaning that the data can be converted back to bytes on > POSIX systems only if the same encoding is used. > > Specification > ============= > > On Windows, Python uses the wide character APIs to access > character-oriented APIs, allowing direct conversion of the > environmental data to Python str objects. > > On POSIX systems, Python currently applies the locale's encoding to > convert the byte data to Unicode. If the locale's encoding is UTF-8, > it can represent the full set of Unicode characters, otherwise, only a > subset is representable. In the latter case, using private-use > characters to represent these bytes would be an option. For UTF-8, > doing so would create an ambiguity, as the private-use characters may > regularly occur in the input also. > > To convert non-decodable bytes, a new error handler "python-escape" is > introduced, which decodes non-decodable bytes using into a private-use > character U+F01xx, which is believed to not conflict with private-use > characters that currently exist in Python codecs. > > The error handler interface is extended to allow the encode error > handler to return byte strings immediately, in addition to returning > Unicode strings which then get encoded again. > > If the locale's encoding is UTF-8, the file system encoding is set to > a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes > (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. > > Discussion > ========== > > While providing a uniform API to non-decodable bytes, this interface > has the limitation that chosen representation only "works" if the data > get converted back to bytes with the python-escape error handler > also. Encoding the data with the locale's encoding and the (default) > strict error handler will raise an exception, encoding them with UTF-8 > will produce non-sensical data. > > For most applications, we assume that they eventually pass data > received from a system interface back into the same system > interfaces. For example, and application invoking os.listdir() will > likely pass the result strings back into APIs like os.stat() or > open(), which then encodes them back into their original byte > representation. Applications that need to process the original byte > strings can obtain them by encoding the character strings with the > file system encoding, passing "python-escape" as the error handler > name. > > Copyright > ========= > > This document has been placed in the public domain. > _______________________________________________ > Python-Dev mailing list > python-...@python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > http://mail.python.org/mailman/options/python-dev/agbauer%40gmail.com > -- http://mail.python.org/mailman/listinfo/python-list