-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Tue, Feb 14, 2017 at 10:19:14PM +0000, Chris Vine wrote: > On Tue, 14 Feb 2017 21:52:01 +0000 (UTC) > Mike Gran <spk...@yahoo.com> wrote: > [snip] > > > In particular, filenames are *not*, nor can they be mapped to, > > > Unicode > > > > > strings in Linux. > > > > True. Linux should follow OpenBSD and make all locales UTF-8. > > Filenames and locales are not necessarily related. When you access a > networked file system, you get the filename encoding you are given, > which may or may not be the same as the particular locale encoding on > your particular machine on one particular day, and may or may not be a > unicode encoding. Glib, for example, enables you to set this with the > G_FILENAME_ENCODING environmental variable [...]
which is, btw., "just a better approximation", but still wrong: the application creating a directory might have been "in" a different locale (and thus having a different encoding) that the one creating the file whithin that directory. Most notably, the whole path might cross several mount points, thus the whole path can well have fragments coming from several file systems. I think the only sane way to see a Linux file system path is the way Linux sees it: as a byte string. Sure, some helper infrastructure to try to make characters of that mess will be welcome, but that should be absolutely robust wrt. unexpected input e.g. bad UTF-8) and leave control to the application. Not easy. > g_filename_to_utf8() and g_filename_from_utf8() functions for this > purpose. To me, that seems insufficient, unless this just applies to one (e.g. the last) path element. Skimming the docs I can't see whether you are only supposed to do that or whether you can dump whole paths (or path fragments) into those functions. > You can tie the filename encoding to the locale encoding by > defining the G_BROKEN_FILENAMES environmental variable but that is > deprecated (the name suggests what they thing about that idea). > > You may possibly agree with this: I am not clear from your post what > connection you were making between locales and filenames. But if > OpenBSD requires all _filenames_ to be in valid UTF-8, that is a bad > decision in my view. NT has done that too. I don't know: there are arguments for both approaches -- that depends whether you think file names are composed of characters (makes sense, no?) or whether the OS doesn't care what's in them (just leave null and slash alone!). It's moving between those two views what's hard. Personally, I'd tend to have Guile being agnostic (i.e. byte arrays) at the lowest level (no conversions), and offer the application what it knows (on BSD or "modern" Windows say: "yes, that's UTF-8" and on Linux say "No idea, but you can try to convert"). Current locale is just a weak hint one might use in heuristics. For things like environment variables and command line arguments, locale is a stronger hint (but not 100%). > Linux is capable of treating filenames as just a null-terminated array > of bytes with '/' as the directory separator. It is encoding agnostic, > and that works just fine. Or not. For the OS all is fine, for the applications it's a small hell -- see those Glib functions you quoted, which -- given their interfaces -- can't possibly do the right thing (dropping their names in a search engine to skim their documentation turns up quite a lot of failure modes, if you know what I mean). regards - -- tomás -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAlikHOgACgkQBcgs9XrR2kYBLACggihOlLCNLcUjlrsWh0vQMuH8 JxEAnRye7C4d1GNDJi7x6nLgI1PMamex =+A5K -----END PGP SIGNATURE-----