On Mon, Nov 05, 2007 at 10:23:43PM -0500, Chet Ramey wrote: > Rich Felker wrote: > > > I'm not sure what you mean. For a Latin-1 locale there is no > > difference, but if the locale is a different legacy locale, the > > wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__ > > defined) needs to be returned. If you're doubtful about the intent of > > the standard, why not file a request for interpretation? > > I'm not doubtful about the standard's intent. When the user has not > chosen to use a locale that contains multibyte characters, not only > should bash not second-guess the user by returning a multibyte > character, functions such as mbrtowc or mblen/mbrlen will not return > "multibyte" values (e.g., mbrlen will return `1' and mbrtowc will return > `-61' -- converted to 195, since it's unsigned -- as its wchar value > while converting 1 character in your example).
This 195 _is_ its value as a multibyte character in a locale with ISO-8859-1 as its codeset. In such a locale, it's also the value of the byte (interpreted as unsigned). So here it doesn't matter which you use; either is equally correct. Where something different happens is if your locale has a different codeset. For instance, in KOI8-R, there is a character "²" which is placed on a different byte (9B) than in ISO-8859 encodings (B2). But regardless of your locale, $ printf %d\\n \'² should print 179, provided that your system implementation uses the same values for wchar_t regardless of locale. These semantics are useful because they actually tell you something about the identity of the character. But most importantly, it's just illogical for the function to behave differently based on whether MB_CUR_MAX is 1 or something greater than 1, rather than being based on the actual locale encoding. "²" is a "²" in a KOI8-R locale just as much as it is a "²" in a UTF-8 locale. Bash's printf should not treat the KOI8-R locale badly just because all characters happen to fit into one byte. The mbrtowc function will give the correct result for all locales, whether or not they have characters that take multiple bytes to represent; special-casing locales that don't just gives illogical (and non-conformant!) behavior. Rich P.S. For my own usage I'd be plenty happy as long as the bug is fixed in UTF-8 based locales since that's all I ever intend to use. But I maintain that the current behavior is incorrect and nonconformant in other locales as well. If you want a compromise, why not make the correct behavior be dependent on strict posix mode?