Trying out native UTF-8 locales on Windows

Thomas Munro Mon, 27 Oct 2025 21:23:11 -0700

Here's a very short patch to experiment with the idea of using
Windows' native UTF-8 support when possible, ie when using
"en-US.UTF-8" in a UTF-8 database.  Otherwise it continues to use the
special Windows-only wchar_t conversion that allows for locales with
non-matching locales, ie the reason you're allowed to use
"English_United States.1252" in a UTF-8 database on that OS, something
we wouldn't allow on Unix.


As I understand it, that mechanism dates from the pre-Windows 10 era
when it had no .UTF-8 locales but users wanted or needed to use UTF-8
databases.  I think some locales used encodings that we don't even
support as server encodings, eg SJIS in Japan, so that was a
workaround.  I assume you could use "ja-JP.UTF-8" these days.

CI tells me it compiles and passes, but I am not a Windows person, I'm
primarily interested in code cleanup and removing weird platform
differences.  I wonder if someone directly interested in Windows would
like to experiment with this and report whether (1) it works as
expected and (2) "en-US.UTF-8" loses performance compared to "en-US"
(which I guess uses WIN1252 encoding and triggers the conversion
path?), and similarly for other locale pairs you might be interested
in?

It's possible that strcoll_l() internally converts the whole string to
wchar_t internally anyway, in which case it might turn out to be
marginally slower.  We often have to copy the char strings up front
ourselves in the regular path strcoll_l() path in order to
null-terminate them, something that is skipped in the wchar_t
conversion path that combines widening with null-termination in one
step.  Not sure if that'd kill the idea, but it'd at least be nice to
know if we might eventually be able to drop the special code paths and
strange configuration possibilities compared to Unix, and use it in
less performance critical paths.   At the very least, the comments are
wrong...

From ea19cd4953c45adb235af33e173933c0fb0f5730 Mon Sep 17 00:00:00 2001
From: Thomas Munro <[email protected]>
Date: Sat, 25 Oct 2025 18:04:06 +1300
Subject: [PATCH] Allow UTF-8 locales to use strcoll_l() on Windows.

On Windows we allow locales with non-UTF-8 encodings in UTF-8 databases
for historical reasons, and convert to wchar_t when collating strings.
Allow plain strcoll_l() to be reached instead of wcscoll_l() when the
locale is a UTF-8.

XXX Does this work as expected?

XXX How is the performance?  It might be converting to wchar_t
internally anyway, depending on whether it can work incrementally like
ICU.
---
 src/backend/utils/adt/pg_locale_libc.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/pg_locale_libc.c b/src/backend/utils/adt/pg_locale_libc.c
index f56b5dbdd37..a075ab26893 100644
--- a/src/backend/utils/adt/pg_locale_libc.c
+++ b/src/backend/utils/adt/pg_locale_libc.c
@@ -722,7 +722,8 @@ create_pg_locale_libc(Oid collid, MemoryContext context)
 	if (!result->collate_is_c)
 	{
 #ifdef WIN32
-		if (GetDatabaseEncoding() == PG_UTF8)
+		if (GetDatabaseEncoding() == PG_UTF8 &&
+			pg_get_encoding_from_locale(collate, true) != PG_UTF8)
 			result->collate = &collate_methods_libc_win32_utf8;
 		else
 #endif
@@ -975,8 +976,13 @@ get_collation_actual_version_libc(const char *collcollate)
 /*
  * strncoll_libc_win32_utf8
  *
- * Win32 does not have UTF-8. Convert UTF8 arguments to wide characters and
- * invoke wcscoll_l().
+ * Historical versions of Windows didn't have UTF-8 locales.  To support UTF-8
+ * databases, we allowed *any* locale to be used in UTF-8 databases (see
+ * check_locale_encoding()).  This function supports mismatched encodings by
+ * converting strings to wchar_t on the fly and calling wcscoll_l().
+ *
+ * This is not called for UTF-8 locales in UTF-8 databases, but is still needed
+ * as long as we tolerate mismatches.
  *
  * An input string length of -1 means that it's NUL-terminated.
  */
-- 
2.51.1

Trying out native UTF-8 locales on Windows

Reply via email to