Thanks for this illuminating response to what I thought might have been mere user naivete.
On Sun, Feb 05, 2012 at 05:55:48PM -0700, Bob Proulx wrote: > Neal Murphy wrote: > > For quite some time now, I've been getting peeved with egrep not > > doing what it should. > > You don't like it and I don't like it but the powers that be have > decided that within a locale, within libc, character collation > sequences will be dictionary ordering where case is folded and > punctuation is ignored. They failed to see how this would negatively > impact almost everything. Creeping features. > > And because punctuation is ignored it causes a lot of problems with > utilities such as sort. You didn't have to say LC_ALL=C for the first > thirty years. But you do now. (Or at least since the mid 1990's.) > In almost all scripts dealing with sort ordering you will find it > necessary to set LC_ALL=C to get expected results. I have been a > rather outspoken critic of this design decision on other mailing > lists. > > > I have Squeese installed and up-to-date. In an xterm running bash or on a > > console running bash or dash, this command: > > ls -C1 | egrep "^[A-Z]" > > returns all lines except those beginning with 'a'. > > The collation sequence of [a-z] in dictionary ordering is really > "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are > getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are > really getting "AbBcC...xXyYzZ" with 'A'! Holy rat piss! Understanding this is a lot to be asking someone new (or even old) to a Unixlike environment. I usually recommend that people learn Unix tools (it's fun!), but this is exceptionally obtuse, if not downright unfriendly. > In the en_US.UTF-8 locale (for example) what would traditionally have > been [A-Z] and [a-z] now must be specified as [[:upper:]] and > [[:lower:]] instead. > > [An Aside: I do find [:space:] a useful character class meaning any of > the whitespace characters. It is posix complicant and can be cut and > pasted nicely. Of course perl heads will want to use \s and \S.] > In better news, after years and years of dealing with this problem, > there is now a move by applications (both gnu awk and gnu grep IIRC, > awk is in experimental now) to reverse this behavior in the userland > code. So what libc has put in will finally be reversed by > applications voting with code to take it out. The newest gnu awk is > re-implementing ranges A-Z and a-z as you would expect. > > Here is a reference: > > http://www.gnu.org/s/gawk/manual/html_node/Ranges-and-Locales.html > > > Even the following commands exhibit similar behavior: > > > > alias|sed -e 's/^a/b/'|egrep "^[A-Z]" # passes sed's output untouched > > alias|sed -e 's/^a/A/'|egrep "^[A-Z]" # passes sed's output untouched > > > > These commands behave the same way on another Squeeze installation > > at another location. Also, 'grep -E' behaves the same way. > > > > The commands behave as expected on a different GNU/Linux system. > > > > Does anyone else see this behavior? Or do I need to clean my pipe and smoke > > something better? > > The character collation sequence affects almost everything on the > system that sorts. This includes commands such as 'ls' and also your > shell (e.g. 'echo *') too. Plus things like 'expr'. Everything that > uses libc strcoll(3) and that is pretty much everything. It is a part > of the conversion for a program to be multi-byte character aware. > Programs were converted en masse in the 90's in order to support UTF-8 > and non-English locales. Mostly that was good. But for things like > this I think it was quite bad. > > Personally I have the following in my $HOME/.bashrc file. > > export LANG=en_US.UTF-8 > export LC_COLLATE=C Thanks that's very useful. > That sets most of my locale to a UTF-8 one but forces sorting to be > standard C/POSIX. This probably won't work in the general case since > I have no idea how that would interact with all character sets. I > expect it will interact very badly with big5 for example. And I don't > know how it deals with other non-english character sets. But it gives > me some relief. > > Bob -- Joel Roth -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120206034505.GB19228@sprite