On 2012-02-05 17:55:48 -0700, Bob Proulx wrote: > The collation sequence of [a-z] in dictionary ordering is really > "aAbBcC...xXyYzZ" and not "abc...z". So when you say "[a-z]" you are > getting "aAbBcC...xXyYz" without 'Z' and when you say "[A-Z]" you are > really getting "AbBcC...xXyYzZ" with 'A'!
This is not what I observe (though I was expecting this behavior) on Debian/unstable. Is it a bug? xvii% export LC_ALL=en_US.utf8 xvii% locale LANG=POSIX LANGUAGE= LC_CTYPE="en_US.utf8" LC_NUMERIC="en_US.utf8" LC_TIME="en_US.utf8" LC_COLLATE="en_US.utf8" LC_MONETARY="en_US.utf8" LC_MESSAGES="en_US.utf8" LC_PAPER="en_US.utf8" LC_NAME="en_US.utf8" LC_ADDRESS="en_US.utf8" LC_TELEPHONE="en_US.utf8" LC_MEASUREMENT="en_US.utf8" LC_IDENTIFICATION="en_US.utf8" LC_ALL=en_US.utf8 xvii% echo BC | grep '[a-z]' xvii% echo BC | grep '[A-z]' grep: Invalid range end xvii% echo BC | LC_ALL=C grep '[A-z]' BC The test with '[A-z]' shows that something happens with the collating rules, but then I would have expected echo BC | grep '[a-z]' to output BC. At least "sort" seems to behave as expected: xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=C sort AB BC CD ab bc cd xvii% printf '%s\n' AB BC CD ab bc cd | LC_ALL=en_US.utf8 sort ab AB bc BC cd CD > In better news, after years and years of dealing with this problem, > there is now a move by applications (both gnu awk and gnu grep IIRC, > awk is in experimental now) to reverse this behavior in the userland > code. Perhaps this explains what I'm seeing with grep, except for [A-z]. But the grep man page still says: Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <http://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <http://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120206110358.gb20...@xvii.vinc17.org