Charles Wilson <cygwin <at> cwilson.fastmail.fm> writes: > > If cygwin wants to be POSIX compatible then the C locale cannot use > > UTF-8.
Not true. POSIX has no restrictions against the C locale not being a multi- byte charset. > > "The tables in Locale Definition describe the characteristics and > behavior of the POSIX locale for data consisting entirely of characters > from the portable character set and the control character set. For other > characters, the behavior is unspecified. For C-language programs, the > POSIX locale shall be the default locale when the setlocale() function > is not called." > > IOW, it only imposes requirements on how the POSIX locale operates on > the basic 128 characters (*interpreted as characters*, with zero regard > to their hexidecimal values. For ASCII and UTF-8...those characters are > the "lower 128" 7bit hex values, and are the same; behavior with respect > to "other characters" -- the "upper 128" for single byte, and any > multibyte -- is explicitly "unspecified". So C.UTF-8 is a perfectly > valid default POSIX locale. I concur with Chuck's reading of POSIX - the C locale is allowed to use a multibyte character encoding, _precisely_ because behavior is unspecified if an application attempts to ever interpret any 8-bit bytes in a character context. Using the UTF-8 charset in the "C" locale is permitted by POSIX, and any application that thinks that "C" implies a unibyte charset is broken. In non-character contexts (such as strcmp), the "C" locale has guarantees that 8-bit bytes will sort in the same order as extended ascii (ie. based on the byte's values, regardless of whether the byte represents any character, and regardless of whether the charset has multibyte encodings). And thankfully, UTF-8 has the nice property that strcmp happens to also perform character sorting (at least, for properly normalized character sequences). The problem is only visible when using character contexts. But that is exactly what gcc is doing - it is using the charset determination (UTF-8 in cygwin's case) coupled with the "C" locale to make decisions on which quoting characters to use, and that's where gcc is falling foul of POSIX. > The underlying issue is actually gcc: its i18n messages appear > explicitly to "translate" from (e.g.) _("error in file '%s'") to "error > in file {fancy-left-quote}%s{fancy-right-quote}" when the encoding is > UTF-8. Working around that by specifying setlocale("C") isn't > sufficient, without also specifying the encoding... The correct workaround is indeed to specify a locale with specific charset encodings, rather than relying on plain "C" (hopefully cygwin will support "C.ASCII", if it does not already). > But not all systems will recognize "C.ASCII" as /THE/ C locale, with > explicit ASCII encoding; they might not recognize "C.ASCII" at all. > Looks like to me that this silence concerning encoding is a hole in the > standard. As far as I know, the hole is intentional. But if others would like me to, I am willing to pursue the action of raising a defect against the POSIX standard, requesting that the next version of POSIX consider including a standardized name for a locale with guaranteed single-byte encoding. -- Eric Blake