Hi, On 2022-09-24 23:16, Thorsten Glaser wrote: > Package: locales > Version: 2.35-1 > Severity: normal > X-Debbugs-Cc: t...@mirbsd.de > > While adjusting my localedata patch script to the latest glibc uploads > I discovered a surprising difference in some categories — for example:
Starting with glibc 2.35, we do not patch the glibc to add C.UTF-8 support, instead we use the upstream code which comes with the following NEWS entry [1]: * Support for the C.UTF-8 locale has been added to glibc. The locale supports full code-point sorting for all valid Unicode code points. A limitation in the framework for fnmatch, regexec, and regcomp requires a compromise to save space and only ASCII-based range expressions are supported for now (see bug 28255). The full size of the locale is only ~400KiB, with 346KiB coming from LC_CTYPE information for Unicode. This locale harmonizes downstream C.UTF-8 already shipping in various downstream distributions. The locale is not built into glibc, and must be installed. The point of having it merged upstream, is that all distributions will now use the same definition for the C.UTF-8 locale, which was not the case before. > (sid-amd64)tglase@tglase:~ $ LC_ALL=C ./tstspc > U+0009 > U+000A > U+000B > U+000C > U+000D > U+0020 > (sid-amd64)tglase@tglase:~ $ LC_ALL=C.UTF-8 ./tstspc > U+0009 > U+000A > U+000B > U+000C > U+000D > U+0020 > U+1680 > U+2000 > U+2001 > U+2002 > U+2003 > U+2004 > U+2005 > U+2006 > U+2008 > U+2009 > U+200A > U+2028 > U+2029 > U+205F > U+3000 This is expected given the LC_CTYPE information used for the C.UTF-8 comes from Unicode. > The test program is thus: gcc -O2 -Wall -Wextra -Wformat -o tstspc tstspc.c > > //--------------------------------cut-here------------------------------ [snip] > //--------------------------------cut-here------------------------------ > > > In my localedata patch script, I take specific care to change the > copy of i18n_ctype before applying it to C.UTF-8 as follows: > > space → <U0009>..<U000D>;<U0020> > cntrl → <U0000>..<U001F>;<U007F> > blank → <U0009>;<U0020> > > They are as mandated by POSIX for the C locale. I believe I said > in my original 2013 proposal for a C.UTF-8 locale that it should > be as close to C as possible while using UTF-8 as encoding. Those are mandated for the POSIX C locale, but POSIX does not say anything (yet) about the C.UTF-8 locale. The choice made by upstream has been discussed during many years [2], if you disagree with it, please come back to upstream. Regards Aurelien [1] https://sourceware.org/git/?p=glibc.git;a=blob;f=NEWS;h=faa7ec1871da1a34ed943fd8d406496e58fb2c2e;hb=f94f6d8a3572840d3ba42ab9ace3ea522c99c0c2 [2] https://sourceware.org/glibc/wiki/Proposals/C.UTF-8 -- Aurelien Jarno GPG: 4096R/1DDD8C9B aurel...@aurel32.net http://www.aurel32.net