On Sun, Nov 28, 2010 at 05:21:33PM +0000, Thorsten Glaser wrote: > Fun to be reading this. Me like ;-) > > Anyway. With my Debian hat on, the C/POSIX locales must not use > UTF-8 as encoding, because otherwise, all kind of hell breaks > loose (consider running 'tr u x' on a binary or other legacy > encoded text file, and tr is just an example).
From my reading of the standards a UTF-8 C locale would be required to behave identically to the existing ASCII C locale: • will consider all byte sequences valid • will use only the ASCII collation sequences (LC_COLLATE would be identical) • LC_CTYPE would probably also be identical (SUS specifies this less strictly than LC_COLLATE), but for backward compatibility should probably remain the same. About the only difference would be the lack of a need for the transliteration table, and the fact that the nl_langinfo(CODESET) would return UTF-8. That's pretty much it. I'd like to persue this in the long term, but I doubt I'll have the time to commit to it for several months. If anyone else wishes to tackle it, feel free to go for it! > There are plans > for C.UTF-8 though, and I’m a bit ashamed at having slacked off > there… No worries, there's not much going to happen at this stage in the squeeze freeze. Hopefully easy to get added early in the wheezy cycle though! http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=522776 (the very end) and #609306 (same bug but a feature request for eglibc). Regards, Roger -- .''`. Roger Leigh : :' : Debian GNU/Linux http://people.debian.org/~rleigh/ `. `' Printing on GNU/Linux? http://gutenprint.sourceforge.net/ `- GPG Public Key: 0x25BFB848 Please GPG sign your mail.
signature.asc
Description: Digital signature