On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote: > I think the code for (2) and for "I/i in Turkish" haven't returned. > Given > commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain > lower(), > the regex code likely needs a similar restoration. If not, the regex > comments > would need to change to match the code.
Great find, thank you! I'm curious how you came about this difference, was it through testing or code inspection? Patch attached. I also updated the top of the comment so that it's clear that it's referring to the libc provider specifically, and that ICU still has an issue with non-UTF8 encodings. Also, the force-to-ASCII-behavior special case is different for pg_wc_tolower/uppper vs LOWER()/UPPER: the former depends only on whether it's the default locale, whereas the latter depends on whether it's the default locale and the encoding is single-byte. Therefore the results in the tr_TR.UTF-8 locale for the libc provider are inconsistent: => select 'i' ~* 'I', 'I' ~* 'i', lower('I') = 'i', upper('i') = 'I'; ?column? | ?column? | ?column? | ?column? ----------+----------+----------+---------- t | t | f | f That behavior goes back a long way, so I'm not suggesting that we change it. Regards, Jeff Davis
From e8a68f42f5802d138ba04043b25b7d42862be29d Mon Sep 17 00:00:00 2001 From: Jeff Davis <j...@j-davis.com> Date: Mon, 14 Apr 2025 11:34:11 -0700 Subject: [PATCH v1] Another unintentional behavior change in commit e9931bfb75. Reported-by: Noah Misch <n...@leadboat.com> Discussion: https://postgr.es/m/20250412123430.8c.nmi...@google.com --- src/backend/regex/regc_pg_locale.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c index ed7411df83d..41b993ad773 100644 --- a/src/backend/regex/regc_pg_locale.c +++ b/src/backend/regex/regc_pg_locale.c @@ -21,9 +21,10 @@ #include "utils/pg_locale.h" /* - * To provide as much functionality as possible on a variety of platforms, - * without going so far as to implement everything from scratch, we use - * several implementation strategies depending on the situation: + * For the libc provider, to provide as much functionality as possible on a + * variety of platforms without going so far as to implement everything from + * scratch, we use several implementation strategies depending on the + * situation: * * 1. In C/POSIX collations, we use hard-wired code. We can't depend on * the <ctype.h> functions since those will obey LC_CTYPE. Note that these @@ -33,8 +34,9 @@ * * 2a. When working in UTF8 encoding, we use the <wctype.h> functions. * This assumes that every platform uses Unicode codepoints directly - * as the wchar_t representation of Unicode. On some platforms - * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF. + * as the wchar_t representation of Unicode. (XXX: This could be a problem + * for ICU in non-UTF8 encodings.) On some platforms wchar_t is only 16 bits + * wide, so we have to punt for codepoints > 0xFFFF. * * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar * values up to 255, and punt for values above that. This is 100% correct @@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c) case PG_REGEX_STRATEGY_BUILTIN: return unicode_uppercase_simple(c); case PG_REGEX_STRATEGY_LIBC_WIDE: + /* force C behavior for ASCII characters, per comments above */ + if (pg_regex_locale->is_default && c <= (pg_wchar) 127) + return pg_ascii_toupper((unsigned char) c); if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF) return towupper_l((wint_t) c, pg_regex_locale->info.lt); /* FALL THRU */ case PG_REGEX_STRATEGY_LIBC_1BYTE: + /* force C behavior for ASCII characters, per comments above */ + if (pg_regex_locale->is_default && c <= (pg_wchar) 127) + return pg_ascii_toupper((unsigned char) c); if (c <= (pg_wchar) UCHAR_MAX) return toupper_l((unsigned char) c, pg_regex_locale->info.lt); return c; @@ -590,10 +598,16 @@ pg_wc_tolower(pg_wchar c) case PG_REGEX_STRATEGY_BUILTIN: return unicode_lowercase_simple(c); case PG_REGEX_STRATEGY_LIBC_WIDE: + /* force C behavior for ASCII characters, per comments above */ + if (pg_regex_locale->is_default && c <= (pg_wchar) 127) + return pg_ascii_tolower((unsigned char) c); if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF) return towlower_l((wint_t) c, pg_regex_locale->info.lt); /* FALL THRU */ case PG_REGEX_STRATEGY_LIBC_1BYTE: + /* force C behavior for ASCII characters, per comments above */ + if (pg_regex_locale->is_default && c <= (pg_wchar) 127) + return pg_ascii_tolower((unsigned char) c); if (c <= (pg_wchar) UCHAR_MAX) return tolower_l((unsigned char) c, pg_regex_locale->info.lt); return c; -- 2.34.1