Re: Built-in CTYPE provider

Jeff Davis Mon, 12 Feb 2024 18:01:44 -0800

On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote:
> Various comments are updated to include the term "character class". 
> I 
> don't recognize that as an official Unicode term.  There are
> categories 
> and properties.  Let's check this.


It's based on
https://www.unicode.org/reports/tr18/#Compatibility_Properties

so I suppose the right name is "properties".

> Is it potentially confusing that only some pg_u_prop_* have a posix
> variant?  Would it be better for a consistent interface to have a
> "posix" argument for each and just ignore it if not used?  Not sure.

I thought about it but didn't see a clear advantage one way or another.

> About this initdb --builtin-locale option and analogous options 
> elsewhere:  Maybe we should flip this around and provide a --libc-
> locale 
> option, and have all the other providers just use the --locale
> option. 
> This would be more consistent with the fact that it's libc that is 
> special in this context.

Would --libc-locale affect all the environment variables or just
LC_CTYPE/LC_COLLATE? How do we avoid breakage?

I like the general direction here but we might need to phase in the
option or come up with a new name. Suggestions welcome.

> Do we even need the "C" locale?  We have established that "C.UTF-8"
> is 
> useful, but if that is easily available, who would need "C"?

I don't think we should encourage its use generally but I also don't
think it will disappear any time soon. Some people will want it on
simplicity grounds. I hope fewer people will use "C" when we have a
better builtin option.

> Some changes in this patch appear to be just straight renamings, like
> in
> src/backend/utils/init/postinit.c and 
> src/bin/pg_upgrade/t/002_pg_upgrade.pl.  Maybe those should be put
> into 
> the previous patch instead.
> 
> On the collation naming: My expectation would have been that the 
> "C.UTF-8" locale would be exposed as the UCS_BASIC collation.

I'd like that. We have to sort out a couple things first, though:

1. The SQL spec mentions the capitalization of "ß" as "SS"
specifically. Should UCS_BASIC use the unconditional mappings in
SpecialCasing.txt? I already have some code to do that (not posted
yet).

2. Should UCS_BASIC use the "POSIX" or "Standard" variant of regex
properties? (The main difference seems to be whether symbols get
treated as punctuation or not.)

3. What do we do about potential breakage for existing users of
UCS_BASIC who might be expecting C-like behavior?

Regards,
        Jeff Davis

Re: Built-in CTYPE provider

Reply via email to