Jeff Davis wrote:
> But there are a lot of users for whom neither of those things are true, > and it makes zero sense to order all of the text indexes in the > database according to any one particular locale. I think these users > would prioritize stability and performance for the database collation, > and then use COLLATE clauses with ICU collations where necessary. +1 > I am also still concerned that we have the wrong defaults. Almost > nobody thinks libc is a great provider, but that's the default, and > there were problems trying to change that default to ICU in 16. If we > had a builtin provider, that might be a better basis for a default > (safe, fast, always available, and documentable). Then, at least if > someone picks a different locale at initdb time, they would be doing so > intentionally, rather than implicitly accepting index corruption risks > based on an environment variable. Yes. The introduction of the bytewise-sorting, locale-agnostic C.UTF-8 in glibc is also a step in the direction of providing better defaults for apps like Postgres, that need both long-term stability in sorts and Unicode coverage for ctype-dependent functions. But C.UTF-8 is not available everywhere, and there's still the problem that Unicode updates through libc are not aligned with Postgres releases. ICU has the advantage of cross-OS compatibility, but it does not provide any collation with bytewise sorting like C or C.UTF-8, and we don't allow a combination like "C" for sorting and ICU for ctype operations. When opting for a locale provider, it has to be for both sorting and ctype, so an installation that needs cross-OS compatibility, good Unicode support and long-term stability of indexes cannot get that with ICU as we expose it today. If the Postgres default was bytewise sorting+locale-agnostic ctype functions directly derived from Unicode data files, as opposed to libc/$LANG at initdb time, the main annoyance would be that "ORDER BY textcol" would no longer be the human-favored sort. For the presentation layer, we would have to write for instance ORDER BY textcol COLLATE "unicode" for the root collation or a specific region-country if needed. But all the rest seems better, especially cross-OS compatibity, truly immutable and faster indexes for fields that don't require linguistic ordering, alignment between Unicode updates and Postgres updates. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite