On Sun, Sep 02, 2012 at 11:11:56PM -0400, Dan B. wrote: > Roger Leigh wrote: > >On Sat, Sep 01, 2012 at 07:32:48PM -0400, Dan B. wrote: > >... > > > >>Which common programs (e.g., getty, xterm/etc., sed/grep?) do something > >>different based on the charset portion of the local setting? > > > >All of them, in short. > > > >When you run a terminal emulator such as xterm, it will get the > >encoding to use inside the emulator using nl_langinfo(3). ... > > What about the virtual consoles?
Virtual consoles are slightly different. Because they start up /before/ you log in, they switch unicode mode on or off depending on the default system locale (/etc/default/locale). See unicode_start_stop in /etc/init.d/console-screen.kbd.sh. You can switch them into unicode mode with unicode_start, which sends an escape sequence to select the ISO-2022 UTF-8 charset. > Whether I choose a default system locale of UTF-8 or None (in the > dialog for "dpkg-reconfigure locales"), and log out and log in (to > make sure the shell has a chance to get fresh settings), then > > echo $'\xC2\xA2' > > displays the same thing (the cent sign). "None" might result in UTF-8 as a default. Try ISO-8859-1 to explicitly specify a non-unicode locale. None that you'll need to generate a suitable locale e.g. en_GB.ISO-8859-1 with localegen/localedef. > Is the virtual console supposed to follow the locale's character > encoding? If so, does something else (e.g., something in /etc/init.d/) > need to be run to make a difference? /etc/init.d/console-screen.kbd.sh as above. > Actually, what I really want to know is how to revert the sorting of > file names from ls (and Emacs dired listings) from the order caused > by having "en_US" in LANG=en_US.UTF-8 back to the traditional (old) > Unix order (e.g., what LANG=C would yield) without messing up all the > UTF-8 support that's all over Linux now. > First of all, can UTF-8 be combined with the "C" locale as in > LANG=C.UTF-8? Yes (and no). You can certainly generate such a locale. In fact, I'm a strong proponent of having a C.UTF-8 locale as the default locale in glibc. However, right now if you generate it (which is possible), it's not completely compatible with a real C locale (i.e. conformant with the C and POSIX standards). Hopefully this will be the case in the future. > Do I probably want something closer to LANG=en_US.UTF-8 LC_COLLATE=C > (in order to reduce the amount of locale settings I'm overriding)? Just set LC_COLLATE=C. So you keep the UTF-8 LC_CTYPE, but the sort order is taken from C. However, this will likely miss-sort any character outside the ASCII range, since C is a 7-bit ASCII locale. [Note: you probably do not want this!] In general, I would advise using the default collation for your locale, though in code it's common to switch to C for locale-independent sorting. > >When you run sed/grep, the encoding will affect how it processes the > >text. > > Are you sure about sed? > > I tried probing how LANG= vs. LANG=en_US.UTF-8 affected whether > the regular expression "[a-z]" matched "X". Grep seems to be > affected as expected, but sed never matched. (That's on Squeeze.) It's the same version in wheezy, so I would not expect a change here. I'm not sure how [a-z] matches--I'd have to check if it's locale- independent. In general, I'd use POSIX character classes like [:alpha:], [:upper:] and [:lower:] to work properly in all locales. Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' schroot and sbuild http://alioth.debian.org/projects/buildd-tools `- GPG Public Key F33D 281D 470A B443 6756 147C 07B3 C8BC 4083 E800 -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120903111323.gi3...@codelibre.net