On Wed, 2023-04-19 at 14:07 +1200, Thomas Munro wrote: > That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied > by > the glibc project) isn't supposed to be versioned, but it's extremely > unfortunate that a bunch of OSes (Debian and maybe more) have been > sorting text in some other order under that name for years.
What should we do with locales like C.UTF-8 in both libc and ICU? We either need to capture it and use the memcmp/pg_ascii code paths so it doesn't use the provider at all (like C); or if we send it to the provider, we can't have too many expectations about what will be done with it (even if we know what "should" happen). If we capture it like the C locale, then where do we draw the line? Any locale that begins with "C."? What if the language part is C but there is some other part to the locale? What about lower case? Should all of these apply the same way except with POSIX? What about backwards compatibility? If we pass it to the provider: * ICU: Recent versions of ICU don't recognize C.UTF-8 at all, and if you try to open it, you'll get the root collator (with warning or error, which is not great for such a common locale name). ICU versions 63 and earlier recognize C.UTF-8 as en-US-u-va-posix (a.k.a. en_US_POSIX), which has some adjustments to match expectations of C sorting (e.g. upper case first). * libc: problems as raised in this thread. Regards, Jeff Davis