[ I know I'm going to regret this... ] > `[a-z]' is case insensitive > > You are encountering problems with locales. POSIX mandates that `[a-z]' > uses the current locale's collation order -- in C parlance, that means > strcoll(3) instead of strcmp(3).
As of the 2008 standard, this is no longer true. Ranges are now implementation defined. This is what gives us the leeway to move to range interpretation not based on locales. Although in theory locales seem like a good idea, and having '[a-z]' include all kinds of other characters between the ASCII 'a' and 'z' sounds nice, well over 10 years of experience has shown me, at least, that it only confuses users and leads to problems. For example, in some vendor en_US.UTF-8 locales, the ordering is AaBb ... YyZz and in others it is: aAbB ... yYzZ So try and explain why '[a-z]' includes all of a...z but only A...Y or B...Z !!! In short, nothing but pain and confusion and endless bug reports. By defining '[a-z]' as using the machine's character set, you know what you're getting, and you are compatible with original Unix practice. (You are in for slight confusion on an EBCDIC machine, but that was always the case anyway, and that is several orders of magnitude less of a problem than the mess created by locales.) After moving gawk to historic range interpretation, the number of bug reports related to this has dropped to close to zero. I'm happier, and my users are happier. I'd be thrilled if the GLIBC locale tables would be fixed. But in the meantime, I have decided to leave this whole issue behind me. I'll go crawl back under my rock now. Arnold