From the "Linux Programmer’s Manual" (release 3.15 of the Linux man-pages): "If the n bytes starting at s do not contain a complete multibyte character, mbrtowc() returns (size_t) -2."
On Mon, Jul 27, 2009 at 6:56 PM, Andy Koppe wrote: > I've encountered what looks like a bug in mbrtowc's handling of UTF-8. > Here's an example: > > #include <stdio.h> > #include <locale.h> > #include <stdlib.h> > #include <wchar.h> > > int main(void) { > wchar_t wc; > size_t ret; > mbstate_t s = { 0 }; > puts(setlocale(LC_CTYPE, "en_GB.UTF-8")); > printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0)); > printf("%i\n", mbrtowc(&wc, "\x94", 1, 0)); > printf("%i\n", mbrtowc(&wc, "\x84", 1, 0)); > printf("%x\n", wc); > return 0; > } > > The sequence E2 94 84 should translate to U+2514. Instead, the second > and third calls to mbrtowc report encoding errors. It does work > correctly if the three bytes are passed to mbrtowc() in one go: > > printf("%i\n", mbrtowc(&wc, "\xe2\x94\x84", 3, 0)); > > Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple