Re: locale specific ordering in EN_US -- why is a

Aharon Robbins Mon, 01 Jul 2013 11:59:40 -0700

[ I know I'm going to regret this... ]

> `[a-z]' is case insensitive
>
>   You are encountering problems with locales.  POSIX mandates that `[a-z]'
>   uses the current locale's collation order -- in C parlance, that means
>   strcoll(3) instead of strcmp(3).


As of the 2008 standard, this is no longer true. Ranges are now
implementation defined. This is what gives us the leeway to move to
range interpretation not based on locales.

Although in theory locales seem like a good idea, and having '[a-z]'
include all kinds of other characters between the ASCII 'a' and 'z'
sounds nice, well over 10 years of experience has shown me, at least,
that it only confuses users and leads to problems.

For example, in some vendor en_US.UTF-8 locales, the ordering is

        AaBb ... YyZz

and in others it is:

        aAbB ... yYzZ

So try and explain why '[a-z]' includes all of a...z but only A...Y
or B...Z !!!

In short, nothing but pain and confusion and endless bug reports.

By defining '[a-z]' as using the machine's character set, you
know what you're getting, and you are compatible with original
Unix practice. (You are in for slight confusion on an EBCDIC
machine, but that was always the case anyway, and that is several
orders of magnitude less of a problem than the mess created by locales.)

After moving gawk to historic range interpretation, the number
of bug reports related to this has dropped to close to zero.
I'm happier, and my users are happier.

I'd be thrilled if the GLIBC locale tables would be fixed. But
in the meantime, I have decided to leave this whole issue behind me.

I'll go crawl back under my rock now.

Arnold

Re: locale specific ordering in EN_US -- why is a

Reply via email to