Re: Built-in CTYPE provider

Daniel Verite Wed, 20 Dec 2023 04:49:44 -0800

        Jeff Davis wrote:


> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the database collation,
> and then use COLLATE clauses with ICU collations where necessary.

+1

> I am also still concerned that we have the wrong defaults. Almost
> nobody thinks libc is a great provider, but that's the default, and
> there were problems trying to change that default to ICU in 16. If we
> had a builtin provider, that might be a better basis for a default
> (safe, fast, always available, and documentable). Then, at least if
> someone picks a different locale at initdb time, they would be doing so
> intentionally, rather than implicitly accepting index corruption risks
> based on an environment variable.

Yes. The introduction of the bytewise-sorting, locale-agnostic
C.UTF-8 in glibc is also a step in the direction of providing better
defaults for apps like Postgres, that need both long-term stability
in sorts and Unicode coverage for ctype-dependent functions.

But C.UTF-8 is not available everywhere, and there's still the
problem that Unicode updates through libc are not aligned
with Postgres releases.

ICU has the advantage of cross-OS compatibility,
but it does not provide any collation with bytewise sorting
like C or C.UTF-8, and we don't allow a combination like
"C" for sorting and ICU for ctype operations. When opting
for a locale provider, it has to be for both sorting
and ctype, so an installation that needs cross-OS
compatibility, good Unicode support and long-term stability
of indexes cannot get that with ICU as we expose it
today.

If the Postgres default was bytewise sorting+locale-agnostic
ctype functions directly derived from Unicode data files,
as opposed to libc/$LANG at initdb time, the main
annoyance would be that "ORDER BY textcol" would no
longer be the human-favored sort.
For the presentation layer, we would have to write for instance
 ORDER BY textcol COLLATE "unicode" for the root collation
or a specific region-country if needed.
But all the rest seems better, especially cross-OS compatibity,
truly immutable and faster indexes for fields that
don't require linguistic ordering, alignment between Unicode
updates and Postgres updates.


Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Built-in CTYPE provider

Reply via email to