Vincent Lefevre wrote: > Bob Proulx wrote: > > The collation sequence of [a-z] in dictionary ordering is really > > "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are > > getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are > > really getting "AbBcC...xXyYzZ" with 'A'! > > This is not what I observe (though I was expecting this behavior) > on Debian/unstable. Is it a bug?
To me it just tells me you are running Sid/Testing with the newer grep. Try it on a Squeeze machine to observe the previous behavior. Squeeze released with 2.6.3 but Sid currently has 2.10. Etch released with 2.5.3. Here is the upstream discussion of rational ranges: http://lists.gnu.org/archive/html/bug-grep/2011-11/msg00106.html Continues: http://lists.gnu.org/archive/html/bug-grep/2011-12/msg00003.html Implementation: http://lists.gnu.org/archive/html/bug-grep/2012-01/msg00088.html I haven't been following the upstream grep project closely and so I am making some assumptions which may be incorrect. But the behavior matches what I am seeing so seems reasonable to assume it. Confusing things is that there were Debian specific range patches in grep that have been noted as coming and going in the changelog. Seeing those worries me about my assumptions. But don't have the time to pull the code and look just to satisfy my curiosity. If you find out otherwise I would be interested in knowing. > xvii% export LC_ALL=en_US.utf8 > xvii% echo BC | grep '[a-z]' In Sid with 2.10, yes. With grep 2.6.3 in Squeeze: $ echo BC | LC_ALL=en_US.UTF-8 grep '[a-z]' BC > At least "sort" seems to behave as expected: > > xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=C sort > AB > BC > CD > ab > bc > cd > xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=en_US.utf8 sort > ab > AB > bc > BC > cd > CD For more on this topic try it with some punctuation (which is ignored) in place. Since the punctuation is ignored it can produce some very surprising sort results. $ printf '%s\n' AB A.B BC B.C CD ab bc b.c cd | LC_ALL=en_US.UTF-8 sort ab AB A.B bc b.c BC B.C cd CD $ printf '%s\n' AB A.B BC B.C CD ab bc b.c cd | LC_ALL=C sort A.B AB B.C BC CD ab b.c bc cd > But the grep man page still says: > > Within a bracket expression, a range expression consists of two > characters separated by a hyphen. It matches any single character that > sorts between the two characters, inclusive, using the locale's > collating sequence and character set. For example, in the default C > locale, [a-d] is equivalent to [abcd]. Many locales sort characters in > dictionary order, and in these locales [a-d] is typically not > equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. > To obtain the traditional interpretation of bracket expressions, you > can use the C locale by setting the LC_ALL environment variable to the > value C. I don't see any problem with that wording. The opening for almost any behavior comes from "using the locale's collating sequence and character set" which isn't defined by grep but is defined by libc. Was there something there in particular that you didn't like? Fortunately setting LC_ALL=C converges all of the behavior across all of the versions. It would be a nightmare to keep track of all of the individual versions and behaviors otherwise. Bob
signature.asc
Description: Digital signature