https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168
--- Comment #2 from Lewis Hyatt <lhyatt at gmail dot com> --- (In reply to Martin Liška from comment #0) > I see the following regression: > > $ cat red.cc > #ifdef WINDOWS > § > #endif > > $ g++-9 red.cc -c > $ g++ red.cc -c > red.cc:2:1: error: extended character § is not valid in an identifier > 2 | �� > | ^ The corrupted colorization in the diagnostics is a bug that I submitted a patch for already. David prefers that fix to wait for GCC 11. Regarding the behavior, if you replace the § with the equivalent UCN: #ifdef WINDOWS \u00A7 #endif then you will get the same behavior with older GCC before my patch too. My patch causes the UTF-8 to be interpreted as an identifier rather than a stray token, hence it ends up with the same error. As it happens, if you compile your test in C mode, it will succeed, because the UTF-8 logic for C mode treats the invalid character as a stray token rather than part of an identifier, then it gets compiled out fine. In C++, it is rather a syntax error by design so it triggers this error. When switching to UCN syntax, it is an error for both C and C++ so fails either way. Looking at the relevant code in charset.c (_cpp_valid_ucn and _cpp_valid_utf8) ... I think it is probably just a matter of checking pfile->state.skipping in more places. I made _cpp_valid_utf8 so as to preserve all the analogous behavior of the existing _cpp_valid_ucn. It seems that _cpp_valid_ucn checks pfile->state.skipping in some cases, like for $ in identifiers, but not for others, such as the invalid character case. I am happy to submit a patch to fix this, but I am not sure in what all cases it is correct to skip the error. For instance, this code can be made to trigger an error too, in C90 mode: $ cat t.c #ifdef WINDOWS int \u00E4; #endif $ gcc-8 -c t.c -std=c90 -fextended-identifiers t.c:2:5: warning: universal character names are only valid in C++ and C99 int \u00E4; ^ That is because _cpp_valid_ucn doesn't check pfile->state.skipping for this case either. I think, especially in C++, there are probably at least some cases where an error should be triggered even in conditionally compiled code, but I don't know enough off hand to say for sure. FWIW, the below patch fixes the present issue, but it doesn't tackle equivalent UCN behavior or fix the related issues... I just need some guidance as to the expected behavior to do that. -Lewis diff --git a/libcpp/charset.c b/libcpp/charset.c index d9281c5fb97..129f234349e 100644 --- a/libcpp/charset.c +++ b/libcpp/charset.c @@ -1260,7 +1260,7 @@ _cpp_valid_utf8 (cpp_reader *pfile, way). In C, this byte rather becomes grammatically a separate token. */ - if (CPP_OPTION (pfile, cplusplus)) + if (!pfile->state.skipping && CPP_OPTION (pfile, cplusplus)) cpp_error (pfile, CPP_DL_ERROR, "extended character %.*s is not valid in an identifier", (int) (*pstr - base), base); @@ -1273,7 +1273,7 @@ _cpp_valid_utf8 (cpp_reader *pfile, break; case 2: - if (identifier_pos == 1) + if (!pfile->state.skipping && identifier_pos == 1) { /* This is treated the same way in C++ or C99 -- lexed as an identifier which is then invalid because an identifier is