------- Additional Comments From joseph at codesourcery dot com 2004-12-16 23:04 ------- Subject: Re: UCNs not recognized in identifiers (c++/c99)
The following example illustrates the problems with lack of normalisation. (I still expect WG14 and WG21 to consider the lack of normalisation to be both the current meaning of the standards and their correct meaning in context, though future revisions might change the exact lists of characters, but this is an appropriate example to present to them and shows why diagnostics would be needed for various cases.) \u05e9\u05bc\u05c1 \u05e9\u05c1\u05bc are valid identifiers in C99 but not C++ while \ufb2c is a valid identifier in C++ but not in C99. In Unicode, the three are canonically equivalent, the first being both NFC and NFD. 05BC HEBREW POINT DAGESH OR MAPIQ (combining class 21) 05C1 HEBREW POINT SHIN DOT (combining class 24) 05E9 HEBREW LETTER SHIN (combining class 0) FB2C HEBREW LETTER SHIN WITH DAGESH AND SHIN DOT (combining class 0) (U+FB2C is excluded from the compositions allowed in NFC, hence the decomposed form being NFC.) So with current C and C++ standards users cannot portably link some pointed Hebrew identifiers between the two languages; it would be advisable for them to avoid such identifiers. Warning for any use of the characters permitted by C++ but not C seems appropriate in the expectation that such characters will cease to be permitted in future, regardless of any other changes there may be. Making the C++ extern "C" \ufb2c into something else would seem to me to be the road to madness, though we could see how other implementations of the C++ ABI interpret it as regards identifiers with UCNs. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449