On 12/13/23 5:28 AM, Jeff Davis wrote: > On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: >> My biggest concern is around maintenance. Every year Unicode is >> assigning new characters to existing code points, and those existing >> code points can of course already be stored in old databases before >> libs >> are updated. > > Is the concern only about unassigned code points? > > I already committed a function "unicode_assigned()" to test whether a > string contains only assigned code points, which can be used in a > CHECK() constraint. I also posted[5] an idea about a per-database > option that could reject the storage of any unassigned code point, > which would make it easier for users highly concerned about > compatibility.
I didn't know about this. Did a few smoke tests against today's head on git and it's nice to see the function working as expected. :) test=# select unicode_version(); unicode_version ----------------- 15.1 test=# select chr(3212),unicode_assigned(chr(3212)); chr | unicode_assigned -----+------------------ ಌ | t -- unassigned code point inside assigned block test=# select chr(3213),unicode_assigned(chr(3213)); chr | unicode_assigned -----+------------------ | f test=# select chr(3214),unicode_assigned(chr(3214)); chr | unicode_assigned -----+------------------ ಎ | t -- unassigned block test=# select chr(67024),unicode_assigned(chr(67024)); chr | unicode_assigned -----+------------------ | f test=# select chr(67072),unicode_assigned(chr(67072)); chr | unicode_assigned -----+------------------ 𐘀 | t Looking closer, patches 3 and 4 look like an incremental extension of this earlier idea; the perl scripts download data from unicode.org and we've specifically defined Unicode version 15.1 and the scripts turn the data tables inside-out into C data structures optimized for lookup. That C code is then checked in to the PostgreSQL source code files unicode_category.h and unicode_case_table.h - right? Am I reading correctly that these two patches add C functions pg_u_prop_* and pg_u_is* (patch 3) and unicode_*case (patch 4) but we don't yet reference these functions anywhere? So this is just getting some plumbing in place? >> And we may end up with >> something like the timezone database where we need to periodically >> add a >> more current ruleset - albeit alongside as a new version in this >> case. > > There's a build target "update-unicode" which is run to pull in new > Unicode data files and parse them into static C arrays (we already do > this for the Unicode normalization tables). So I agree that the tables > should be updated but I don't understand why that's a problem. I don't want to get stuck on this. I agree with the general approach of beginning to add a provider for locale functions inside the database. We have awhile before Unicode 16 comes out. Plenty of time for bikeshedding My prediction is that updating this built-in provider eventually won't be any different from ICU or glibc. It depends a bit on how we specifically built on this plumbing - but when Unicode 16 comes out, i I'll try to come up with a simple repro on a default DB config where changing the Unicode version causes corruption (it was pretty easy to demonstrate for ICU collation, if you knew where to look)... but I don't think that discussion should derail this commit, because for now we're just starting the process of getting Unicode 15.1 into the PostgreSQL code base. We can cross the "update" bridge when we come to it. Later on down the road, from a user perspective, I think we should be careful about confusion where providers are used inconsistently. It's not great if one function follow built-in Unicode 15.1 rules but another function uses Unicode 13 rules because it happened to call an ICU function or a glibc function. We could easily end up with multiple providers processing different parts of a single SQL statement, which could lead to strange results in some cases. Ideally a user just specifies a default provider their database, and the rules for that version of Unicode are used as consistently as possible - unless a user explicitly overrides their choice in a table/column definition, query, etc. But it might take a little time and work to get to this point. -Jeremy -- http://about.me/jeremy_schneider