On 22.07.24 19:55, Robert Haas wrote:
Every other piece of software in the world has to deal with changes as
a result of the addition of new code points, and probably less
commonly, revisions to existing code points. Presumably, their stuff
breaks too, from time to time. I mean, I find it a bit difficult to
believe that web browsers or messaging applications on phones only
ever display emoji, and never try to do any sort of string sorting.

The sorting isn't the problem. We have a versioning mechanism for collations. What we do with the version information is clearly not perfect yet, but the mechanism exists and you can hack together queries that answer the question, did anything change here that would affect my indexes. And you could build more tooling around that and so on.

The problem being considered here are updates to Unicode itself, as distinct from the collation tables. A Unicode update can impact at least two things:

- Code points that were previously unassigned are now assigned. That's obviously a very common thing with every Unicode update. The new character will have new properties attached to it, so the result of various functions that use such properties (upper(), lower(), normalize(), etc.) could change, because previously the code point had no properties, and so those functions would not do anything interesting with the character.

- Certain properties of an existing character can change. Like, a character used to be a letter and now it's a digit. (This is an example; I'm not sure if that particular change would be allowed.) In the extreme case, this could have the same impact as the above, but in practice the kinds of changes that are allowed wouldn't affect typical indexes.

I don't think this has anything in particular to do with the new builtin collation provider. That is just one new consumer of this.


Reply via email to