Re: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Jason Merrill Thu, 25 Jul 2024 11:35:30 -0700

On 7/17/24 6:04 PM, Jakub Jelinek wrote:

Hi!


The following patch implements the easy parts of the paper.
When @$` are added to the basic character set, it means that
R"@$`()@$`" should now be valid (here I've noticed most of the
raw string tests were tested solely with -std=c++11 or -std=gnu++11
and I've tried to change that), and on the other side even if
by extension $ is allowed in identifiers, \u0024 or \U00000024
or \u{24} should not be, similarly how \u0041 is not allowed.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

The paper in 3.1 claims though that
#include <stdio.h>

#define STR(x) #x

int main()
{
   printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
}
should have been accepted before this paper (and rejected after it),
but g++ rejects it.

I've tried to understand it, but am confused on what is the right
behavior and why.

Consider
#define STR(x) #x
const char *a = "\u00b7";
const char *b = STR(\u00b7);
const char *c = "\u0041";
const char *d = STR(\u0041);
const char *e = STR(a\u00b7);
const char *f = STR(a\u0041);
const char *g = STR(a \u00b7);
const char *h = STR(a \u0041);
const char *i = "\u066d";
const char *j = STR(\u066d);
const char *k = "\u0040";
const char *l = STR(\u0040);
const char *m = STR(a\u066d);
const char *n = STR(a\u0040);
const char *o = STR(a \u066d);
const char *p = STR(a \u0040);

Neither clang nor gcc emit any diagnostics on the a, c, i and k
initializers, those are certainly valid (c is invalid in C23 though).  g++
emits with -pedantic-errors errors on all the others, while clang++ on the
ones with STR involving \u0041, \u0040 and a\u0066d.  The chosen values are
\u0040 '@' as something being changed by this paper, \u0041 'A' as basic
character set char valid in identifiers before/after, \u00b7 as an example
of character which is pedantically valid in identifiers if not at the start
and \u066d s something pedantically not valid in identifiers.

Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
string/character literal which corresponds to basic character set character
(or control character) is ill-formed, that would make d, f, h cases invalid
for C++ and l, n, p cases invalid for C++26.

https://eel.is/c++draft/lex.name states which characters can appear at the
start of the identifier and which can appear after the start.  And
https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
either identifier, or tons of other things, or "each non-whitespace
character that cannot be one of the above"

Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
invalid if the preprocessing token is being converted into token.

And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
the basic character set matches the last category, the program is
ill-formed."

Now, e.g.  for the C++23 STR(\u0040) case, \u0040 is there not in the basic
character set, so valid outside of the literals (not the case anymore in
C++26), but it isn't nondigit and doesn't have XID_Start property, so it
isn't IMHO an identifier and so must be the "each non-whitespace character
that cannot be one of the above" case.  Why doesn't the above mentioned
https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?


Your argument makes sense to me, though...

Ignoring
that, I'd say it would be then stringized and that feels like it is what
clang++ is doing.  Now, e.g.  for the STR(a\u066d) case, I wonder why that
isn't lexed as a identifier followed by \u066d "each non-whitespace
character that cannot be one of the above" token and stringified similarly,
clang++ rejects that.

What GCC libcpp seems to be doing is that if that forms_identifier_p calls
_cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
or second+ in identifier, and e.g.  _cpp_valid_ucn then for UCNs valid in
string literals calls
   else if (identifier_pos)
     {
       int validity = ucn_valid_in_identifier (pfile, result, nst);

if (validity == 0)

         cpp_error (pfile, CPP_DL_ERROR,
                    "universal character %.*s is not valid in an identifier",
                    (int) (str - base), base);
       else if (validity == 2 && identifier_pos == 1)
         cpp_error (pfile, CPP_DL_ERROR,
    "universal character %.*s is not valid at the start of an identifier",
                    (int) (str - base), base);
     }
so basically all those invalid in identifiers cases emit an error and
pretend to be valid in identifiers, rather than what e.g.  _cpp_valid_utf8
does for C but not for C++ and only for the chars completely invalid in
identifiers rather than just valid in identifiers but not at the start:
           /* In C++, this is an error for invalid character in an identifier
              because logically, the UTF-8 was converted to a UCN during
              translation phase 1 (even though we don't physically do it that
              way).  In C, this byte rather becomes grammatically a separate
              token.  */

if (CPP_OPTION (pfile, cplusplus))

             cpp_error (pfile, CPP_DL_ERROR,
                        "extended character %.*s is not valid in an identifier",
                        (int) (*pstr - base), base);
           else
             {
               *pstr = base;
               return false;
             }
The comment doesn't really match what is done in recent C++ versions because
there UCNs are translated to characters and not the other way around.

...it seems wrong that calling forms_identifier_p gives an error andreturns true for characters that can't be part of an identifier, which Iwould expect to produce a false result. If we want to complain aboutthe pptoken#2 issue, that seems like it should happen in the CPP_OTHERsection of _cpp_lex_direct.

Our diagnostic for STR(\u0041) is similarly unhelpful, saying just "notvalid in an identifier" rather than anything about the basic characterset or that it should be spelled "A".

But if we're going to give an error either way, fixing this seems a lowpriority.

2024-07-17  Jakub Jelinek  <ja...@redhat.com>

        PR c++/110343
libcpp/
        * lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character set.
        (lex_raw_string): For C++26 allow $@` characters in prefix.
        * charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in identifiers.
gcc/testsuite/
        * c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
        remove c++ specific dg-options.
        * c-c++-common/raw-string-2.c: Likewise.
        * c-c++-common/raw-string-4.c: Likewise.
        * c-c++-common/raw-string-5.c: Likewise.  Expect some diagnostics
        only for non-c++26, for c++26 expect different.
        * c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
        remove c++ specific dg-options.
        * c-c++-common/raw-string-11.c: Likewise.
        * c-c++-common/raw-string-13.c: Likewise.
        * c-c++-common/raw-string-14.c: Likewise.
        * c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
        change c++ specific dg-options to just -Wtrigraphs.
        * c-c++-common/raw-string-16.c: Likewise.
        * c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
        remove c++ specific dg-options.
        * c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
        remove -std=c++11 from c++ specific dg-options.
        * c-c++-common/raw-string-19.c: Likewise.
        * g++.dg/cpp26/raw-string1.C: New test.
        * g++.dg/cpp26/raw-string2.C: New test.

--- libcpp/lex.cc.jj    2024-07-17 11:36:49.897873247 +0200
+++ libcpp/lex.cc       2024-07-17 20:04:43.936793506 +0200
@@ -2718,7 +2718,10 @@ lex_raw_string (cpp_reader *pfile, cpp_t
                       || c == '*' || c == '+' || c == '-' || c == '/'
                       || c == '^' || c == '&' || c == '|' || c == '~'
                       || c == '!' || c == '=' || c == ','
-                      || c == '"' || c == '\''))
+                      || c == '"' || c == '\''
+                      || ((c == '$' || c == '@' || c == '`')
+                          && CPP_OPTION (pfile, cplusplus)
+                          && CPP_OPTION (pfile, lang) > CLK_CXX23)))
            prefix[prefix_len++] = c;
          else
            {
--- libcpp/charset.cc.jj        2024-01-05 08:35:13.696827331 +0100
+++ libcpp/charset.cc   2024-07-17 20:18:13.665467035 +0200
@@ -1808,7 +1808,12 @@ _cpp_valid_ucn (cpp_reader *pfile, const
        result = 1;
      }
    else if (identifier_pos && result == 0x24
-          && CPP_OPTION (pfile, dollars_in_ident))
+          && CPP_OPTION (pfile, dollars_in_ident)
+          /* In C++26 when dollars are allowed in identifiers,
+             we should still reject \u0024 as $ is part of the basic
+             character set.  */
+          && !(CPP_OPTION (pfile, cplusplus)
+               && CPP_OPTION (pfile, lang) > CLK_CXX23))

I wonder about moving $ handling into the next else, so we don't need toworry about the basic charset here?


But the patch is OK.

Jason

Re: [PATCH] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Reply via email to