On Sat, 2025-04-12 at 05:34 -0700, Noah Misch wrote:
> I think the code for (2) and for "I/i in Turkish" haven't returned. 
> Given
> commit e3fa2b0 restored the v17 "I/i in Turkish" treatment for plain
> lower(),
> the regex code likely needs a similar restoration.  If not, the regex
> comments
> would need to change to match the code.

Great find, thank you! I'm curious how you came about this difference,
was it through testing or code inspection?

Patch attached. I also updated the top of the comment so that it's
clear that it's referring to the libc provider specifically, and that
ICU still has an issue with non-UTF8 encodings.

Also, the force-to-ASCII-behavior special case is different for
pg_wc_tolower/uppper vs LOWER()/UPPER: the former depends only on
whether it's the default locale, whereas the latter depends on whether
it's the default locale and the encoding is single-byte. Therefore the
results in the tr_TR.UTF-8 locale for the libc provider are
inconsistent:

  => select 'i' ~* 'I', 'I' ~* 'i', lower('I') = 'i', upper('i') = 'I';
   ?column? | ?column? | ?column? | ?column? 
  ----------+----------+----------+----------
   t        | t        | f        | f

That behavior goes back a long way, so I'm not suggesting that we
change it.

Regards,
        Jeff Davis

From e8a68f42f5802d138ba04043b25b7d42862be29d Mon Sep 17 00:00:00 2001
From: Jeff Davis <j...@j-davis.com>
Date: Mon, 14 Apr 2025 11:34:11 -0700
Subject: [PATCH v1] Another unintentional behavior change in commit
 e9931bfb75.

Reported-by: Noah Misch <n...@leadboat.com>
Discussion: https://postgr.es/m/20250412123430.8c.nmi...@google.com
---
 src/backend/regex/regc_pg_locale.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index ed7411df83d..41b993ad773 100644
--- a/src/backend/regex/regc_pg_locale.c
+++ b/src/backend/regex/regc_pg_locale.c
@@ -21,9 +21,10 @@
 #include "utils/pg_locale.h"
 
 /*
- * To provide as much functionality as possible on a variety of platforms,
- * without going so far as to implement everything from scratch, we use
- * several implementation strategies depending on the situation:
+ * For the libc provider, to provide as much functionality as possible on a
+ * variety of platforms without going so far as to implement everything from
+ * scratch, we use several implementation strategies depending on the
+ * situation:
  *
  * 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
  * the <ctype.h> functions since those will obey LC_CTYPE.  Note that these
@@ -33,8 +34,9 @@
  *
  * 2a. When working in UTF8 encoding, we use the <wctype.h> functions.
  * This assumes that every platform uses Unicode codepoints directly
- * as the wchar_t representation of Unicode.  On some platforms
- * wchar_t is only 16 bits wide, so we have to punt for codepoints > 0xFFFF.
+ * as the wchar_t representation of Unicode.  (XXX: This could be a problem
+ * for ICU in non-UTF8 encodings.)  On some platforms wchar_t is only 16 bits
+ * wide, so we have to punt for codepoints > 0xFFFF.
  *
  * 2b. In all other encodings, we use the <ctype.h> functions for pg_wchar
  * values up to 255, and punt for values above that.  This is 100% correct
@@ -562,10 +564,16 @@ pg_wc_toupper(pg_wchar c)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_uppercase_simple(c);
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towupper_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_toupper((unsigned char) c);
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return toupper_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
@@ -590,10 +598,16 @@ pg_wc_tolower(pg_wchar c)
 		case PG_REGEX_STRATEGY_BUILTIN:
 			return unicode_lowercase_simple(c);
 		case PG_REGEX_STRATEGY_LIBC_WIDE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_tolower((unsigned char) c);
 			if (sizeof(wchar_t) >= 4 || c <= (pg_wchar) 0xFFFF)
 				return towlower_l((wint_t) c, pg_regex_locale->info.lt);
 			/* FALL THRU */
 		case PG_REGEX_STRATEGY_LIBC_1BYTE:
+			/* force C behavior for ASCII characters, per comments above */
+			if (pg_regex_locale->is_default && c <= (pg_wchar) 127)
+				return pg_ascii_tolower((unsigned char) c);
 			if (c <= (pg_wchar) UCHAR_MAX)
 				return tolower_l((unsigned char) c, pg_regex_locale->info.lt);
 			return c;
-- 
2.34.1

Reply via email to