On 12/5/23 3:46 PM, Jeff Davis wrote: > CTYPE, which handles character classification and upper/lowercasing > behavior, may be simpler than it first appears. We may be able to get > a net decrease in complexity by just building in most (or perhaps all) > of the functionality. > > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1], specifying both a "Standard" variant and a > "POSIX Compatible" variant. The main difference with the POSIX variant > is that symbols count as punctuation. > > === LOWER()/INITCAP()/UPPER() === > > The LOWER() and UPPER() functions are defined in the SQL spec with > surprising detail, relying on specific Unicode General Category > assignments. How to map characters seems to be left (implicitly) up to > Unicode. If the input string is normalized, the output string must be > normalized, too. Weirdly, there's no room in the SQL spec to localize > LOWER()/UPPER() at all to handle issues like [1]. Also, the standard > specifies one example, which is that "ß" becomes "SS" when folded to > upper case. INITCAP() is not in the SQL spec.
I'll be honest, even though this is primarily about CTYPE and not collation, I still need to keep re-reading the initial email slowly to let it sink in and better understand it... at least for me, it's complex to reason through. 🙂 I'm trying to make sure I understand clearly what the user impact/change is that we're talking about: after a little bit of brainstorming and looking through the PG docs, I'm actually not seeing much more than these two things you've mentioned here: the set of regexp_* functions PG provides, and these three generic functions. That alone doesn't seem highly concerning. I haven't checked the source code for the regexp_* functions yet, but are these just passing through to an external library? Are we actually able to easily change the CTYPE provider for them? If nobody knows/replies then I'll find some time to look. One other thing that comes to mind: how does the parser do case folding for relation names? Is that using OS-provided libc as of today? Or did we code it to use ICU if that's the DB default? I'm guessing libc, and global catalogs probably need to be handled in a consistent manner, even across different encodings. (Kindof related... did you ever see the demo where I create a user named '🏃' and then I try to connect to a database with non-unicode encoding? 💥😜 ...at least it seems to be able to walk the index without decoding strings to find other users - but the way these global catalogs work scares me a little bit) -Jeremy -- http://about.me/jeremy_schneider