Re: Built-in CTYPE provider

Jeremy Schneider Wed, 20 Dec 2023 15:48:34 -0800

On 12/5/23 3:46 PM, Jeff Davis wrote:
> CTYPE, which handles character classification and upper/lowercasing
> behavior, may be simpler than it first appears. We may be able to get
> a net decrease in complexity by just building in most (or perhaps all)
> of the functionality.
> 
> === Character Classification ===
> 
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1], specifying both a "Standard" variant and a
> "POSIX Compatible" variant. The main difference with the POSIX variant
> is that symbols count as punctuation.
> 
> === LOWER()/INITCAP()/UPPER() ===
> 
> The LOWER() and UPPER() functions are defined in the SQL spec with
> surprising detail, relying on specific Unicode General Category
> assignments. How to map characters seems to be left (implicitly) up to
> Unicode. If the input string is normalized, the output string must be
> normalized, too. Weirdly, there's no room in the SQL spec to localize
> LOWER()/UPPER() at all to handle issues like [1]. Also, the standard
> specifies one example, which is that "ß" becomes "SS" when folded to
> upper case. INITCAP() is not in the SQL spec.


I'll be honest, even though this is primarily about CTYPE and not
collation, I still need to keep re-reading the initial email slowly to
let it sink in and better understand it... at least for me, it's complex
to reason through. 🙂

I'm trying to make sure I understand clearly what the user impact/change
is that we're talking about: after a little bit of brainstorming and
looking through the PG docs, I'm actually not seeing much more than
these two things you've mentioned here: the set of regexp_* functions PG
provides, and these three generic functions. That alone doesn't seem
highly concerning.

I haven't checked the source code for the regexp_* functions yet, but
are these just passing through to an external library? Are we actually
able to easily change the CTYPE provider for them? If nobody
knows/replies then I'll find some time to look.

One other thing that comes to mind: how does the parser do case folding
for relation names? Is that using OS-provided libc as of today? Or did
we code it to use ICU if that's the DB default? I'm guessing libc, and
global catalogs probably need to be handled in a consistent manner, even
across different encodings.

(Kindof related... did you ever see the demo where I create a user named
'🏃' and then I try to connect to a database with non-unicode encoding?
💥😜  ...at least it seems to be able to walk the index without decoding
strings to find other users - but the way these global catalogs work
scares me a little bit)

-Jeremy


-- 
http://about.me/jeremy_schneider

Re: Built-in CTYPE provider

Reply via email to