Toshio Kuratomi <[EMAIL PROTECTED]> added the comment: ''' @a.badger: The behaviour (drop non encodable strings) is not really a problem if you configure correctly your program and computer. Eg. you spoke about CGI-WSGI: if your website also speak UTF-8, you will be able to read all environment variables. So this issue is not important, it only appears when your website/OS is not well configured. I mean the problem is not in Python but outside Python. The PATH variable contains directory names, if you have only names encodable in your filesystem encoding (UTF-8 most of the time), you will be able to use the PATH variable. If a directory has an non decodable name, rename the directory but don't try to fix Python! '''
The idea that having mixed encodings on a system is a misconfiguration is a fallacy. 1) In a multiuser setup, each user has a choice of what encoding to use. So mixed encodings are both possible and valid. 2) In a legacy system, your operating system may have all utf-8 naming for the core OS but all of the old data files is being mounted with another encoding that the legacy programs on the host expect. 3) On an nfs mount, data may come from users on different machines from widely separated areas using different system encodings. 4) The same thing as 1-3 but applied to any of the data a site may be passing via an environment variable rather than just file and directory names. 5) An application may have to deal with different encodings from the system default due to limitations of another program. Since one of python's many uses is as a glue language, it needs to be able to deal with these quirks. 6) The application you're interfacing may just be using bytes rather than text in the environment variables. Let me put it this way: If I write a file in a latin-1 encoding and put it on my system that has a utf-8 system encoding what does python-3 do? 1) If I try to open it as a text file: "open('filename', 'r')" it throws a UnicodeDecodeError when I attempt to read some non-utf-8 characters from it. 2) As a programmer I then know to open it as binary "open('filename', 'rb')" and do my own decoding of the data now that I've been made aware that I must take this corner case into account. Some notes: 1) This seems to be the right general procedure to take when handling things that are usually text but can contain arbitrary bytes. 2) This makes use of python's exception infrastructure to tell the programmer plainly what's going wrong instead of silently ignoring values that the programmer may not have encountered in their test data but could exist in the real world. Would you rather get a bug report from a user that says: "FooApp gives me a UnicodeDecodeError traceback pointing at line 345" (how open() works) or "FooApp never authenticates me" (which you then have to track down to the fact that the credentials on the user's system are being passed in an env var and are not in the system encoding.) 3) This learns the correct lesson from python-2's unicode problems: Stop the mixture of bytes and unicode at the border so the programmer can be explicit about how to deal with the odd-ball data there. It does not become squeamish about throwing a Unicode Exception which is the wrong lesson to learn from python-2. 4) It also doesn't refuse to acknowledge that the world outside python is not as simple and elegant as the world inside python and allows the programmer to write an interface to that world instead of forcing them to go outside of python to deal with it. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue4006> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com