Neal Murphy wrote: > For quite some time now, I've been getting peeved with egrep not > doing what it should.
You don't like it and I don't like it but the powers that be have decided that within a locale, within libc, character collation sequences will be dictionary ordering where case is folded and punctuation is ignored. They failed to see how this would negatively impact almost everything. Creeping features. And because punctuation is ignored it causes a lot of problems with utilities such as sort. You didn't have to say LC_ALL=C for the first thirty years. But you do now. (Or at least since the mid 1990's.) In almost all scripts dealing with sort ordering you will find it necessary to set LC_ALL=C to get expected results. I have been a rather outspoken critic of this design decision on other mailing lists. > I have Squeese installed and up-to-date. In an xterm running bash or on a > console running bash or dash, this command: > ls -C1 | egrep "^[A-Z]" > returns all lines except those beginning with 'a'. The collation sequence of [a-z] in dictionary ordering is really "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are really getting "AbBcC...xXyYzZ" with 'A'! In the en_US.UTF-8 locale (for example) what would traditionally have been [A-Z] and [a-z] now must be specified as [[:upper:]] and [[:lower:]] instead. [An Aside: I do find [:space:] a useful character class meaning any of the whitespace characters. It is posix complicant and can be cut and pasted nicely. Of course perl heads will want to use \s and \S.] In better news, after years and years of dealing with this problem, there is now a move by applications (both gnu awk and gnu grep IIRC, awk is in experimental now) to reverse this behavior in the userland code. So what libc has put in will finally be reversed by applications voting with code to take it out. The newest gnu awk is re-implementing ranges A-Z and a-z as you would expect. Here is a reference: http://www.gnu.org/s/gawk/manual/html_node/Ranges-and-Locales.html > Even the following commands exhibit similar behavior: > > alias|sed -e 's/^a/b/'|egrep "^[A-Z]" # passes sed's output untouched > alias|sed -e 's/^a/A/'|egrep "^[A-Z]" # passes sed's output untouched > > These commands behave the same way on another Squeeze installation > at another location. Also, 'grep -E' behaves the same way. > > The commands behave as expected on a different GNU/Linux system. > > Does anyone else see this behavior? Or do I need to clean my pipe and smoke > something better? The character collation sequence affects almost everything on the system that sorts. This includes commands such as 'ls' and also your shell (e.g. 'echo *') too. Plus things like 'expr'. Everything that uses libc strcoll(3) and that is pretty much everything. It is a part of the conversion for a program to be multi-byte character aware. Programs were converted en masse in the 90's in order to support UTF-8 and non-English locales. Mostly that was good. But for things like this I think it was quite bad. Personally I have the following in my $HOME/.bashrc file. export LANG=en_US.UTF-8 export LC_COLLATE=C That sets most of my locale to a UTF-8 one but forces sorting to be standard C/POSIX. This probably won't work in the general case since I have no idea how that would interact with all character sets. I expect it will interact very badly with big5 for example. And I don't know how it deals with other non-english character sets. But it gives me some relief. Bob
signature.asc
Description: Digital signature