https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94168

--- Comment #2 from Lewis Hyatt <lhyatt at gmail dot com> ---
(In reply to Martin Liška from comment #0)
> I see the following regression:
> 
> $ cat red.cc
> #ifdef WINDOWS
> §
> #endif
> 
> $ g++-9 red.cc -c
> $ g++ red.cc -c
> red.cc:2:1: error: extended character § is not valid in an identifier
>     2 | ��
>       | ^

The corrupted colorization in the diagnostics is a bug that I submitted a patch
for already. David prefers that fix to wait for GCC 11.

Regarding the behavior, if you replace the § with the equivalent UCN:

#ifdef WINDOWS
\u00A7
#endif

then you will get the same behavior with older GCC before my patch too. My
patch causes the UTF-8 to be interpreted as an identifier rather than a stray
token, hence it ends up with the same error.

As it happens, if you compile your test in C mode, it will succeed, because the
UTF-8 logic for C mode treats the invalid character as a stray token rather
than part of an identifier, then it gets compiled out fine. In C++, it is
rather a syntax error by design so it triggers this error. When switching to
UCN syntax, it is an error for both C and C++ so fails either way.

Looking at the relevant code in charset.c (_cpp_valid_ucn and _cpp_valid_utf8)
... I think it is probably just a matter of checking pfile->state.skipping in
more places. I made _cpp_valid_utf8 so as to preserve all the analogous
behavior of the existing _cpp_valid_ucn. It seems that _cpp_valid_ucn checks
pfile->state.skipping in some cases, like for $ in identifiers, but not for
others, such as the invalid character case.

I am happy to submit a patch to fix this, but I am not sure in what all cases
it is correct to skip the error. For instance, this code can be made to trigger
an error too, in C90 mode:

$ cat t.c
#ifdef WINDOWS
int \u00E4;
#endif

$ gcc-8 -c t.c -std=c90 -fextended-identifiers
t.c:2:5: warning: universal character names are only valid in C++ and C99
 int \u00E4;
     ^

That is because _cpp_valid_ucn doesn't check pfile->state.skipping for this
case either. I think, especially in C++, there are probably at least some cases
where an error should be triggered even in conditionally compiled code, but I
don't know enough off hand to say for sure.

FWIW, the below patch fixes the present issue, but it doesn't tackle equivalent
UCN behavior or fix the related issues... I just need some guidance as to the
expected behavior to do that.

-Lewis

diff --git a/libcpp/charset.c b/libcpp/charset.c
index d9281c5fb97..129f234349e 100644
--- a/libcpp/charset.c
+++ b/libcpp/charset.c
@@ -1260,7 +1260,7 @@ _cpp_valid_utf8 (cpp_reader *pfile,
             way).  In C, this byte rather becomes grammatically a separate
             token.  */

-         if (CPP_OPTION (pfile, cplusplus))
+         if (!pfile->state.skipping && CPP_OPTION (pfile, cplusplus))
            cpp_error (pfile, CPP_DL_ERROR,
                       "extended character %.*s is not valid in an identifier",
                       (int) (*pstr - base), base);
@@ -1273,7 +1273,7 @@ _cpp_valid_utf8 (cpp_reader *pfile,
          break;

        case 2:
-         if (identifier_pos == 1)
+         if (!pfile->state.skipping && identifier_pos == 1)
            {
              /* This is treated the same way in C++ or C99 -- lexed as an
                 identifier which is then invalid because an identifier is

Reply via email to