On Fri, 2003-01-03 at 18:11, Jochen Voss wrote: > Is this meant to apply to programs like "ls", "bash", "touch", and > "emacs"?
Yes. > I imagine that the transition period could be a hard time > for users who (like me) use non-ASCII characters in file-names. That is probably true. But we really have no other choice. See below. > As I see it, the current (broken ?) behaviour is, to use the user's > locale setting (LC_CTYPE) to encode file names. It appears so, and yes, this behavior is completely and fundamentally broken. If you have say a Chinese friend who logs onto your computer, and he sets LANG to something like cn_CN.BIG5, then when he tries to 'ls' your files, it will completely fail. Likewise, when you try to look at his, it will not work at all. Moreover, say the system administrator does something like 'find /home'. The resulting stream will be a mixture of ISO-8859-X and BIG5, and impossible to reliably differentiate. And of course the problem doesn't just occur when you have a multiuser system; your Chinese friend could send you a .ogg file named using BIG5, and your Latin 1 system would simply fail to encode the filename. And finally, having the encoding of filenames dependent on the current locale often doesn't make sense even for a single user; what if you are a software developer in an ISO-8859-1 locale, and you want to test the Japanese translation of your software. So you run it with LANG=ja_JP.ISO-2022-JP or something to get the translations displayed. As a side effect, all the filenames on your system will fail to work. In summary, UTF-8 is the *only* sane character set to use for filenames. Major upstream software for Debian like GNOME is moving towards requiring UTF-8 for filenames, and we should too. See for example: http://www.gtk.org/gtk-2.0.0-notes.html Microsoft Windows has used Unicode for filenames for a long time because of issues like these. MacOS also uses Unicode. And like Tollef said, Red Hat 8 has already switched to defaulting to UTF-8 for new systems. > During the > transition period non-ASCII file names will have two possible > representations in the file system (LC_CTYPE vs. UTF-8). I think > we should clarify the following points before introducing the above > into policy: > > 1) Should interpretation of existing files' names as UTF-8 > be implemented before the encoding of newly created files' > names is switched? I am not sure what policy can say here. For people using filenames in legacy encodings, perhaps policy could suggest that programs try to fall back to the user's locale encoding, if the filename is not valid UTF-8. This might become common practise, but I don't think policy should require it. Again, major chunks of upstream software which have Unicode support (like GNOME), are *already* defaulting to interpreting filenames as UTF-8 by default. I am just trying to bring policy in line with best practise in this regard. > 2) How should already existing files with non-ASCII names > be converted? There are lots of different options; we could have a package 'unicode-transition' in base which would convert all local filesystems, or we could do it as part of a base-files upgrade. But mainly, this is a technical issue separate from policy, in my opinion. We can hash out those detailed plans separately from this proposal.