On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote: > FWIW, the idea that we're going to develop a built-in provider seems > to be solid, for the reasons Jeff mentions: it can be stable, and > under our control. But it seems like we might need built-in providers > for everything rather than just CTYPE to get those advantages, and I > fear we'll get sucked into needing a lot of tailoring rather than > just > being able to get by with one "vanilla" implementation.
For the database default collation, I suspect a lot of users would jump at the chance to have "vanilla" semantics. Tailoring is more important for individual collation objects than for the database-level collation. There are reasons you might select a tailored database collation, like if the set of users accessing it are mostly from a single locale, or if the application connected to the database is expecting it in a certain form. But there are a lot of users for whom neither of those things are true, and it makes zero sense to order all of the text indexes in the database according to any one particular locale. I think these users would prioritize stability and performance for the database collation, and then use COLLATE clauses with ICU collations where necessary. The question for me is how good the "vanilla" semantics need to be to be useful as a database-level collation. Most of the performance and stability problems come from collation, so it makes sense to me to provide a fast and stable memcmp collation paired with richer ctype semantics (as proposed here). Users who want something more probably want the Unicode "root" collation, which can be provided by ICU today. I am also still concerned that we have the wrong defaults. Almost nobody thinks libc is a great provider, but that's the default, and there were problems trying to change that default to ICU in 16. If we had a builtin provider, that might be a better basis for a default (safe, fast, always available, and documentable). Then, at least if someone picks a different locale at initdb time, they would be doing so intentionally, rather than implicitly accepting index corruption risks based on an environment variable. Regards, Jeff Davis