On Sat, 2024-06-29 at 15:08 -0700, Noah Misch wrote: > lower(), initcap(), upper(), and regexp_matches() are > PROVOLATILE_IMMUTABLE. > Until now, we've delegated that responsibility to the user. The user > is > supposed to somehow never update libc or ICU in a way that changes > outcomes > from these functions.
To me, "delegated" connotes a clear and organized transfer of responsibility to the right person to solve it. In that sense, I disagree that we've delegated it. What's happened here is evolution of various choices that seemed reasonable at the time. Unfortunately, the consequences that are hard for us to manage and even harder for users to manage themselves. > Now that postgresql.org is taking that responsibility > for builtin C.UTF-8, how should we govern it? I think the above text > and [1] > convey that we'll update the Unicode data between major versions, > making > functions like lower() effectively STABLE. Is that right? Marking them STABLE is not a viable option, that would break a lot of valid use cases, e.g. an index on LOWER(). Unicode already has its own governance, including a stability policy that includes case mapping: https://www.unicode.org/policies/stability_policy.html#Case_Pair Granted, that policy does not guarantee that the results will never change. In particular, the results can change if using unassinged code poitns that are later assigned to Cased characters. That's not terribly common though; for instance, there are zero changes in uppercase/lowercase behavior between Unicode 14.0 (2021) and 15.1 (current) -- even for code points that were unassigned in 14.0 and later assigned. I checked this by modifying case_test.c to look at unassigned code points as well. There's a greater chance that character properties can change (e.g. whether a character is "alphabetic" or not) in new releases of Unicode. Such properties can affect regex character classifications, and in some cases the results of initcap (because it uses the "alphanumeric" classification to determine word boundaries). I don't think we need code changes for 17. Some documentation changes might be helpful, though. Should we have a note around LOWER()/UPPER() that users should REINDEX any dependent indexes when the provider is updated? > (This thread had some discussion[2] that datcollversion/collversion > won't > necessarily change when a major versions changes lower() behavior.) datcollversion/collversion track the vertsion of the collation specifically (text ordering only), not the ctype (character semantics). When using the libc provider, get_collation_actual_version() completely ignores the ctype. It would be interesting to consider tracking the versions separately, though. Regards, Jeff Davis