GCC 4.4.1 incorrectly parses the code point U+FFFF when generating a two-byte character. It mistakes this code point for a supplemental one, and generates an improper surrogate pair U+D7FF U+DFFF. This bug is present as far back as GCC 3.4.6.
Here is a test program that demonstrates the bug, and could function as a regression test. This program uses char16_t, but GCC 3.4.5 as shipped with MinGW also shows this bug when wchar_t is used. --CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE-- /* gcc-utf16-test.c -- demonstrate a bug in GCC 4.4.1, that causes the code point U+FFFF to convert incorrectly to UTF-16. Compile on GCC 4.4.1 with -std=gnu99. */ #include <stdio.h> #include <stdlib.h> int main(void) { static const __CHAR16_TYPE__ teststr1[] = u"\uFFFF"; static const __CHAR16_TYPE__ teststr2[] = u"\U00010000"; size_t i; printf("The string \"\\uFFFF\" converts as:"); for (i = 0; teststr1[i] != 0; i++) printf(" U+%04X", teststr1[i]); printf("\n"); if (teststr1[0] != 0xFFFF || teststr1[1] != 0) { printf("This conversion is INCORRECT. It should be U+FFFF.\n"); return EXIT_FAILURE; } printf("The string \"\\U00010000\" converts as:"); for (i = 0; teststr2[i] != 0; i++) printf(" U+%04X", teststr2[i]); printf("\n"); if (teststr2[0] != 0xD800 || teststr2[1] != 0xDC00 || teststr2[2] != 0) { printf("This conversion is INCORRECT. It should be U+D800 U+DC00.\n"); return EXIT_FAILURE; } return EXIT_SUCCESS; } --CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE-- The problem is a simple off-by-one error in the function one_utf8_to_utf16 in libcpp/charset.c . The following patch against the GCC 4.4.1 source corrects the bug: --CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE-- --- gcc-4.4.1/libcpp/charset.c.old 2009-04-09 19:23:07.000000000 -0400 +++ gcc-4.4.1/libcpp/charset.c 2009-10-12 04:06:25.000000000 -0400 @@ -354,7 +354,7 @@ return EILSEQ; } - if (s < 0xFFFF) + if (s <= 0xFFFF) { if (*outbytesleftp < 2) { --CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE--CUT HERE-- -- Summary: "\uFFFF" converts incorrectly to two-byte character Product: gcc Version: unknown Status: UNCONFIRMED Severity: minor Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: chasonr at newsguy dot com GCC build triplet: x86_64-unknown-linux GCC host triplet: x86_64-unknown-linux GCC target triplet: x86_64-unknown-linux http://gcc.gnu.org/bugzilla/show_bug.cgi?id=41698