Martin Pool <m...@sourcefrog.net> added the comment: On 21 December 2011 12:41, Antoine Pitrou <rep...@bugs.python.org> wrote: > > Antoine Pitrou <pit...@free.fr> added the comment: > >> The standard encoding is UTF-8. > > How so? I don't know of any Linux or Unix spec which says so. If you get > the Linux heads to standardize this then I'll certainly be very happy > (and countless others will, too). But AFAIK this it not the case and I > don't see why you are asking Python to make a choice that OS vendors > refuse to make. You are certainly asking the wrong project to solve this > problem.
It is a de facto, not de jure standard: UTF-8 is how things are typically stored. Other software (eg gnome file handling utilities) makes this assumption. See eg <http://www.cl.cam.ac.uk/~mgk25/unicode.html#linux>. I would be happy to see an authoritative document saying this is how things _should_ be stored, but I can't find one yet. But in Unix there are no ultimate authorities: even if someone announced filenames are utf-8 there will obviously continue to be many machines where in practice they are not. I started asking about it over here, to see if at least Ubuntu can have an opinion that this is how things should normally be: https://lists.ubuntu.com/archives/ubuntu-devel/2011-December/034588.html I'm not sure what you expect a technical solution at the OS level would look like. The api is 8-bit strings and that's not likely to change. It's possible to have a situation where no locale is specified. Applications unavoidably need to have some opinion about what to do there. Other applications assume the filenames are utf-8. Python assumes that text in general will be UTF-8 (getdefaultencoding). It is almost like your caricature of OS developers as being anglocentric, but in fact here it's Python that assumes everything is probably ascii - or more charitably, it is just assuming that failing when things aren't ascii is the best tradeoff. Maybe it is. One OS-level fix is to try to reduce the number of situations where people see no locale, or the C locale, and give them C.UTF-8 instead. That is probably worth doing. But having no locale can still happen, and I think Python could handle that better, so the changes are complimentary. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13643> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com