Jeff Davis wrote: > What about ICU? How should provider=icu locale=C.UTF-8 behave? We > could: > > a. Just pass it to the provider and see what happens (older versions of > ICU would interpret it as en-US-u-va-posix; newer versions would give > the root locale). > > b. Consistently interpret it as en-US-u-va-posix. > > c. Don't pass it to the provider at all and treat it with memcmp > semantics.
I think b) and c) are quite problematic. First, en-US-u-va-posix does not sort like C.UTF-8 in glibc. For one thing it seems that en-US-u-va-posix assigns zero weights to some codepoints, which makes it semantically definitely different. For instance consider ZERO WIDTH SPACE (U+200B): postgres=# select 'ab' < E'a\u200Ba' COLLATE "C.utf8"; ?column? ---------- t postgres=# select 'ab' < E'a\u200Ba' COLLATE "en-US-u-va-posix-x-icu"; ?column? ---------- f Even if ICU folks refer to u-va-posix as approximating POSIX (as in [1]), for our purpose, either it sorts by codepoints or it does not, and it clearly does not. One consequence is that en-US-u-va-posix-x-icu needs to be versioned and indexes depending on it need to be rebuilt on upgrades. OTOH the goal with C.UTF-8, that is achieved in glibc>=2.35, is to not need that. Also it's not just about sorting. The semantics for the ctype-kind functions are also different. Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we only match ASCII characters 0-9, or 10 codepoints. With "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the digit characters in all languages, plus a bunch of variants for mathematical symbols. For instance consider U+FF10 (Fullwidth Digit Zero): postgres=# select E'\uff10' collate "C.utf8" ~ '\d'; ?column? ---------- f postgres=# select E'\uff10' collate "en-US-u-va-posix-x-icu" ~ '\d'; ?column? ---------- t If someone dumps their C.UTF-8 database to reload into an ICU/en-US-u-va-posix database, there is no guarantee that it even reloads because of semantic differences occuring in constraints. In general it will surely reload, but the apps might not behave the same with the new database in a way that might be problematic. It's fine if that's what they want and they explicitly ask for this conversion, but it's not fine if it's postgres that has quietly decided that for them. About c) "don't pass it to the operators", it would be doable for sorting (ignoring the "glibc before 2.35 does not sort like that" issue) but not for the ctype-kind functions, where postgres' own code doesn't have the Unicode knowledge. About a) "just pass it to the provider", that seems better than b) or c), but still, when a user asks for provider=icu locale=C.UTF-8, it's a very probably a pilot error. To me the user would be best served by a warning, if not an error, informing them that it's quite probably not the combination they want. [1] https://sourceforge.net/p/icu/mailman/icu-support/thread/CAN49p6pvQKP93j8LMn3zBWhpk-T0qYD0TCuiHMv6Z3UPGFh3QQ%40mail.gmail.com/#msg35638356 Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite