At Fri, 22 Apr 2005 12:14:53 +0100, Ross Paterson wrote: > According to the spec, mbrtowc(&wc, buf, 1, &st) should either return 1 > and set wc, or return 0, (size_t)-1 or (size_t)-2. In this locale it > returns either 0 or 1, but doesn't always set wc in the latter case, > as the following test program shows. I believe it should be returning > (size_t)-2 (incomplete encoding) for (most) letters, and setting wc in > all the other cases (except \0).
It works OK when I changed this source as follows. --- test.c~ 2005-04-23 11:45:25.000000000 +0900 +++ test.c 2005-04-23 12:17:11.000000000 +0900 @@ -7,7 +7,7 @@ main() { int c; mbstate_t st; - char buf[1]; + char buf[2]; size_t size; wchar_t wc; @@ -15,8 +15,9 @@ for (c = 0; c <= 0xff; c++) { wc = 0xbaad; buf[0] = c; + buf[1] = '\0'; memset(&st, 0, sizeof(st)); - size = mbrtowc(&wc, buf, 1, &st); + size = mbrtowc(&wc, buf, 2, &st); printf("c = 0x%02x, size = %d, wc = U+%04X\n", c, size, wc); } return 0; > (In iconvdata/tcvn5712-1.c, this decoding is treated as stateful, but > I don't think it should be.) It has five combined character: http://www.informatik.uni-leipzig.de/~duc/software/misc/tcvn.txt TCVN5712:1993 is very weird encodings, because 0xb0..0xb4 are postposing combined character. This means even if we read the first character, we cannot decide output character until we read the 2nd character. Historically they designed it to make stateless, however it's complete stateful - it does not make any difference. If we detect the input sequence is finished in intermediate state, thus if the character is between U+0041 and 0+01b0, we output such character to the stream. However the current implementation does not work correctly, it just stores to the internal buffer. I'll investigate this bug. Regards, -- gotom -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]