I commented on bug-gettext that these fixes work for me when
environment variables LANG, LC_*, or LANGUAGE aren't set. Thanks again!

The commit 00211fc69c92 ("setlocale: Support the UTF-8 environment on
native Windows.") introduces Windows support for C.UTF-8 in
setlocale.c. Emulating C.UTF-8 with "English_United States.65001" looks
like a great idea with LC_CTYPE. However, with some other locale
categories it's problematic. I think "C.UTF-8" should map to a mixed
locale.

I noticed the informative link[1] in the Gettext commit 3873b7f1c777
("intl: Treat C.UTF-8 locale like C locale."). The wiki page makes
me suspect that LC_ALL=C.UTF-8 should set LC_CTYPE to
"English_United States.65001" while setting all other categories to
"C". This seems pretty clear for LC_COLLATE and LC_NUMERIC but I'm not
sure about the other categories.

[1] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8

LC_COLLATE=C.UTF-8 should sort in Unicode codepoint order. In UTF-8 this
is the same as byte order, thus "C" does the right thing (at least for
valid UTF-8 inputs). In contrast, "English_United States.65001" sorts by
English rules, for example, putting "ä" before "b". Test with
strcoll("b", "ä").

LC_NUMERIC can make a difference if thousand separators are requested.
Thousand separators in printf() are a POSIX feature that UCRT's printf()
doesn't support. However, MinGW-w64's replacement via "#define
__USE_MINGW_ANSI_STDIO 1" provides an implementation that does support
thousand separators.

-- 
Lasse Collin

Reply via email to