Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC

Noah Misch Thu, 17 Apr 2025 14:59:17 -0700

On Fri, Jan 17, 2025 at 04:06:20PM -0800, Jeff Davis wrote:
> Committed 0001 and 0002.


> Upon reviewing the discussion threads, I removed the Unicode "adjust to
> Cased" behavior when titlecasing. As Peter pointed out[1], it doesn't
> match the documentation or expectations for INITCAP().

While commit d3d0983 changed most of the non-test pg_u_*() "bool posix"
arguments, it left a pg_u_isalnum(u, true) in strtitle_builtin() subroutine
initcap_wbnext().  The above paragraph may or may not be saying that's
intentional.  Example of the consequence at non-ASCII decimal digits:

SELECT
        str,
        re,
        regexp_count(str COLLATE pg_c_utf8, re) AS count_c_utf8,
        regexp_count(str COLLATE pg_unicode_fast, re) AS count_unicode_fast,
        regexp_count(str COLLATE unicode, re) AS count_unicode,
        initcap(str COLLATE pg_c_utf8) AS initcap_c_utf8,
        initcap(str COLLATE pg_unicode_fast) AS initcap_unicode_fast,
        initcap(str COLLATE unicode) AS initcap_unicode
FROM
        (VALUES (U&'foo\0661bar baz')) AS str_t(str),
        (VALUES ('[[:digit:]]')) AS re_t(re)
ORDER BY 1, 2;

str                  │ foo١bar baz
re                   │ [[:digit:]]
count_c_utf8         │ 0
count_unicode_fast   │ 1
count_unicode        │ 1
initcap_c_utf8       │ Foo١Bar Baz
initcap_unicode_fast │ Foo١Bar Baz
initcap_unicode      │ Foo١bar Baz

Should initcap_wbnext() pass in a locale-dependent "bool posix" argument like
the others calls the commit changed?  Related message from the development of
pg_c_utf8, which you shared downthread:
https://www.postgresql.org/message-id/610d7f1b-c68c-4eb8-a03d-1515da304c58%40manitou-mail.org


Long-term, pg_u_isword() should have a "bool posix" argument.  Currently, only
tests call that function.  If it got a non-test caller,
https://www.unicode.org/reports/tr18/#word would have pg_u_isword() follow the
choice of posix compatibility like pg_u_isalnum() does.

Re: Unicode full case mapping: PG_UNICODE_FAST, and standard-compliant UCS_BASIC

Reply via email to