2010/1/4 Thomas Wolff: > My assumption has been that *printf should be byte-transparent unless where > it uses explicit wide character arguments.
What's that assumption based on? > After all, legacy applications that do not care about locales at all may > legitimately assume this since a C char [] is a byte sequence; Erm, the meaning of a byte sequence is up to each function. > this is not affected by the legacy casual usage of the word "character" > referring to a char value which does not automatically imply "wide > character". There is no casual usage of "byte" and "character" in the POSIX standard. See http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html. In particular: 3.84 Byte: An individually addressable unit of data storage that is exactly an octet, used to store a character or a portion of a character; see also Character. A byte is composed of a contiguous sequence of 8 bits. The least significant bit is called the "low-order" bit; the most significant is called the "high-order" bit. 3.87 Character: A sequence of one or more bytes representing a single graphic symbol or control code. 3.92 Character String: A contiguous sequence of characters terminated by and including the first null byte. 3.367 String: A contiguous sequence of bytes terminated by and including the first null byte. (And yep, a lot of confusion would go away if the 'char' type was called 'byte' instead, but of course that's out of the question.) > In that thread, someone had originally confused char * with wchar [] - the > issue resolves cleanly if these are properly distinguished. > > Comments on the EILSEQ clause from that thread: >> >> > It's talking about "characters" rather than "bytes" there, which I >> > think does leave the behaviour for invalid bytes undefined, That sentence had nothing to do with EILSEQ. Here it is in its original context: "I couldn't find specific text about invalid bytes in the POSIX printf spec, but it does say the following: "The format is a character string, beginning and ending in its initial shift state, if any. The format is composed of zero or more directives: ordinary characters, which are simply copied to the output stream, and conversion specifications, each of which shall result in the fetching of zero or more arguments." It's talking about "characters" rather than "bytes" there, which I think does leave the behaviour for invalid bytes undefined, so newlib's printf implementation is in its rights to just stop processing the string at one of those." To emphasise this again, the printf spec explictly says that "the format is a *character* string". > I don't think there is such a thing like an invalid multibyte character in a > char [] unless it is being interpreted with a multi-byte function, that's > what e.g. the mb* functions are for. Well, you're wrong. See the definition of 'character'. > In a legacy application, especially in an sprintf which may not even be > intended for printing, there is no intent to apply a multi-byte > interpretation. This is over-imposing semantics on a basic C type. No, it's necessary for printf to work correctly with all character sets. For example, the second byte in a double-byte SJIS character can actually be the same as the ASCII code for '%'. Hence, if printf blindly copied bytes until encountering a '%', it would not be possible to print such characters. > So I do not agree that printf is right here, and if it were, the third line > in the example would have had to fail as well, actually. Including invalid bytes in the format string is undefined behaviour. Anything can happen. And what likely happened is that the compiler replaced the third sprintf call with strcpy (which is specified on strings rather than character strings). The real discussion to be had here is whether "C" should continue to mean UTF-8 or return to ASCII for the sake of Linux compatibility. See http://cygwin.com/ml/cygwin-developers/2009-12/msg00112.html for that. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple