Re: pg_collation.collversion for C.UTF-8

Jeff Davis Thu, 25 May 2023 11:30:38 -0700

On Wed, 2023-04-19 at 14:07 +1200, Thomas Munro wrote:
> That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied
> by
> the glibc project) isn't supposed to be versioned, but it's extremely
> unfortunate that a bunch of OSes (Debian and maybe more) have been
> sorting text in some other order under that name for years.


What should we do with locales like C.UTF-8 in both libc and ICU? 

We either need to capture it and use the memcmp/pg_ascii code paths so
it doesn't use the provider at all (like C); or if we send it to the
provider, we can't have too many expectations about what will be done
with it (even if we know what "should" happen).

If we capture it like the C locale, then where do we draw the line? Any
locale that begins with "C."? What if the language part is C but there
is some other part to the locale? What about lower case? Should all of
these apply the same way except with POSIX? What about backwards
compatibility?

If we pass it to the provider:

* ICU: Recent versions of ICU don't recognize C.UTF-8 at all, and if
you try to open it, you'll get the root collator (with warning or
error, which is not great for such a common locale name). ICU versions
63 and earlier recognize C.UTF-8 as en-US-u-va-posix (a.k.a.
en_US_POSIX), which has some adjustments to match expectations of C
sorting (e.g. upper case first).

* libc: problems as raised in this thread.

Regards,
        Jeff Davis

Re: pg_collation.collversion for C.UTF-8

Reply via email to