On Wed, 2023-04-19 at 07:48 +1200, Thomas Munro wrote: > Many OSes have a locale with this name. I don't know this history, > who did it first etc, but now I am wondering if they all took the > "obvious" interpretation, that it should be code-point based, > extrapolating from "C" (really memcmp order):
memcmp() is not the same as code-point order in all encodings, right? I've been thinking that we should have a "provider=none" for the special cases that use memcmp(). It's not using libc as a collation provider; it's really postgres in control of the semantics. That would clean up the documentation and the code a bit, and make it more clear which locales are being passed to the provider and which ones aren't. If we are passing it to a provider (e.g. "C.UTF-8"), we shouldn't make unnecessary assumptions about what the provider will do with it. For what it's worth, in my recent ICU language tag work, I special- cased ICU locales with language "C" or "POSIX" to map to "en-US-u-va- posix", disregarding everything else (collation attributes, etc.). I believe that's the right thing based on the behavior I observed: for the POSIX variant of en-US, ICU seems to disregard other things such as case insensitivity. But it still ultimately goes to the provider and ICU has particular rules for that locale -- I don't assume memcpy-like semantics or code point order. Regards, Jeff Davis