Thanks for writing this PEP 383, MvL. I recently ran into this
problem in Python 2.x in the Tahoe project [1]. The Tahoe project
should be considered a good use case showing what some people need.
For example, the assumption that a file will later be written back
into the same local filesystem (and thus luckily use the same
encoding) from which it originally came doesn't hold for us, because
Tahoe is used for file-sharing as well as for backup-and-restore.
One of my first conclusions in pursuing this issue is that we can
never use the Python 2.x unicode APIs on Linux, just as we can never
use the Python 2.x str APIs on Windows [2]. (You mentioned this
ugliness in your PEP.) My next conclusion was that the Linux way of
doing encoding of filenames really sucks compared to, for example,
the Mac OS X way. I'm heartened to see what David Wheeler is trying
to persuade the maintainers of Linux filesystems to improve some of
this: [3].
My final conclusion was that we needed to have two kinds of
workaround for the Linux suckage: first, if decoding using the
suggested filesystem encoding fails, then we fall back to mojibake
[4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not
sure if it matters and I haven't yet understood if utf-8b offers
another alternative for this case). Second, if decoding succeeds
using the suggested filesystem encoding on Linux, then write down the
encoding that we used and include that with the filename. This
expands the size of our filenames significantly, but it is the only
way to allow some future programmer to undo the damage of a falsely-
successful decoding. Here's our whole plan: [5].
Regards,
Zooko
[1] http://allmydata.org
[2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html #
see the footnote of this message
[3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[4] http://en.wikipedia.org/wiki/Mojibake
[5] http://allmydata.org/trac/tahoe/ticket/534#comment:47
--
http://mail.python.org/mailman/listinfo/python-list