Re: Update Unicode data to Unicode 16.0.0

Jeff Davis Wed, 22 Jan 2025 14:13:05 -0800

On Wed, 2025-01-22 at 19:03 +0100, Peter Eisentraut wrote:
> Building a collation provider on this came much later.  It was
> possibly 
> a mistake how that was done.


It wasn't a mistake. "Stability within a PG major version" was called a
*benefit* near the top of the first email on the subject[1]. It was
considered a benefit because it offered a level of stability that
neither libc nor ICU could offer. As far as I know, it's still
considered to be a benefit today by more people than not (e.g. [2]).

The concerns about Unicode updates come from a misunderstanding of the
level of stability offered in the past:

* IMMUTABLE was initially a planner concept[3], which is why it didn't
care much about dependence on GUCs for instance.

* Expression / predicate indexes rely on immutability to mean something
more strict, and for that, dependence on GUCs creates a problem[4].
(Also, partitioning.)

* It's hard to make an immutable UDF without a SET search_path clause,
but until version 17, that was such a huge performance hit that it was
not usable in an expression index. There will be a lot of not-truly-
immutable UDFs used in expression indexes for a long time.

* Ordinary text indexes rely on the collation libraries to be stable,
which is hard to control because they could be updated by the OS. It's
barely possible recently to freeze the version of libc[5] without
freezing the whole OS version. And if you do manage to freeze both libc
and ICU, you are risking missed security fixes.

* pg_upgrade implicitly relies on IMMUTABLE to mean something even more
strict: stability across major versions. That's a problem for
expression indexes on functions like NORMALIZE(). And, if using the
optional built-in provider, also a problem for expression indexes on
LOWER(), etc.

At each moment we took steps that made sense at the time and in context
and I am not criticizing any of those steps. The biggest practical
problem was unforseen dramatic changes in glibc that broke a lot of
text indexes. The rest of the problems are a mix of design issues,
feature interactions, and implementation details that were not resolved
before the builtin provider existed and still not resolved today.

I do not accept the premise that there is a problem with the built-in
provider. I didn't throw caution to the wind and neither did the
reviewers: you, Daniel, Jeremy, and I did a ton of work to understand,
mitigate, and document the risks (along with a lot of help from
Thomas's earlier work). Users who opt-in to the built in provider opt-
in to occasional controlled changes according to the rather strict
Unicode stability policies[6]. These policies mitigate risks
dramatically, especially for those using only assigned code points,
which can be checked with the SQL function unicode_assigned().

Regards,
        Jeff Davis

[1] 
https://www.postgresql.org/message-id/ff4c2f2f9c8fc7ca27c1c24ae37ecaeaeaff6b53.ca...@j-davis.com

[2]
https://www.postgresql.org/message-id/3729436.1721322211%40sss.pgh.pa.us

[3]
https://www.postgresql.org/message-id/3428810.1721160969%40sss.pgh.pa.us

[4]

   CREATE TABLE t(f float4);
   CREATE UNIQUE INDEX t_idx ON t((f::text));
   SET extra_float_digits = 0;
   INSERT INTO t VALUES (1.23456789);
   INSERT INTO t VALUES (1.23456789); -- error
   SET extra_float_digits = 1;
   INSERT INTO t VALUES (1.23456789); -- success

[5] https://github.com/awslabs/compat-collation-for-glibc

[6] https://www.unicode.org/policies/stability_policy.html

Re: Update Unicode data to Unicode 16.0.0

Reply via email to