Martin Pool <m...@sourcefrog.net> added the comment: On 22 December 2011 12:32, STINNER Victor <rep...@bugs.python.org> wrote: > > STINNER Victor <victor.stin...@haypocalc.com> added the comment: > > On 22/12/2011 02:16, Martin Pool wrote: >> The proposal is that in some cases where Python currently assumes >> filenames are ascii on Linux, it ought to instead assume they are >> utf-8. > > Oh, I expected a use case describing the problem, not the proposed > solution :-)
The problem as I see it is this: On Linux, filenames are generally (but not always) in UTF-8; people fairly commonly end up with no locale configured, which causes Python to decode filenames as ascii. It is easy for this to end up with them hitting UnicodeErrors. >>> You want to use UTF-8 instead of ASCII, so what? What do you >>> want to do with your nicely well decoded filenames? You cannot print it >>> to your terminal nor pass it to a subprocess, because your terminal uses >>> ASCII, as subprocess. I don't see how it would help you. >> >> When the application has a unicode string, > > Where does this string come from? (It is an important question). It comes, for example, from the name of a file, or a directory, or the contents of a symlink. Or the problem applies equally when the program has got a unicode string (for example off the network in a defined encoding) and it is trying to use it to access the filesystem. > If your locale encoding is ASCII, you cannot write such non-ASCII > filenames using the keyboard for example. Sure you can. The user could enter a backslash-escaped name, which the program knows to decode to unicode. The point is the program has a choice of how it deals with user input, whereas it does not have as much control in Python of how filenames are encoded. > > with working around this when the filenames really are > > valid in what should be the user's locale, > > On your computer, UTF-8 is maybe a good candidate for "what should be > the user's locale", but you cannot generalize for all computers. > > I also wanted to force UTF-8 everywhere, but you cannot do that or your > program will just not work in some configurations. Just to be clear, I'm not proposing to force UTF-8 everywhere. I am only proposing to 'break' the case where the user has non-ascii filenames but, intentionally or not, a locale that specifies only ascii is used. With this change, Python will try to decode them as utf-8, and fail if they're not utf-8. I am coming to think the best step here is just for the OS to do more to make sure the application does get the appropriate locale. (For example, Ubuntu in recent releases uses a pam hook to set LANG for cron jobs, to avoid the example described above.) ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13643> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com