[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Tom Honermann via Phabricator via cfe-commits Mon, 12 Dec 2022 09:02:50 -0800

tahonermann accepted this revision.
tahonermann added a comment.
This revision is now accepted and ready to land.


This looks good. I'm accepting the change despite one remaining issue that was 
neither introduced nor addressed by this patch; the diagnostic caret will still 
be in the wrong location for many of the diagnostics issued by 
`tryReadNamedUCN()`. Most of them use `StartPtr` which we've established points 
to `N` instead of `\`. If you want to make an additional update to address 
that, I'll be happy to review again. I understand if you would rather that be 
fixed via a separate change.



================
Comment at: clang/lib/Lex/Lexer.cpp:3323
     if (Diagnose)
       Diag(StartPtr, diag::warn_ucn_escape_incomplete);
     return llvm::None;
----------------
tahonermann wrote:
> I was able to confirm that `StartPtr` does point to `N`. See the diagnostic 
> generated at https://godbolt.org/z/qnajcMeso; the diagnostic caret points to 
> `N` instead of `\`.
>   <source>:1:2: warning: incomplete delimited universal character name; 
> treating as '\' 'N' '{' identifier [-Wunicode]
>   \N{abc
>    ^
I think this still needs to be addressed. `tryReadNumericUCN()` handles this 
case by requiring that callers pass the location of the `\` as `SlashLoc`. I 
guess this is technically required since the `\` character could be written as 
either `\` or the `??/` trigraph.


================
Comment at: clang/lib/Lex/Lexer.cpp:3386
+
+  if (!Result || CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 4))
     StartPtr = CurPtr;
----------------
cor3ntin wrote:
> tahonermann wrote:
> > cor3ntin wrote:
> > > tahonermann wrote:
> > > > cor3ntin wrote:
> > > > > cor3ntin wrote:
> > > > > > tahonermann wrote:
> > > > > > > This is a bit of a tangent, but under what circumstances would 
> > > > > > > `CurPtr - StartPtr` not be equal to `Buffer.size() + 4`? 
> > > > > > > Actually, I'm not sure that +4 is correct. It looks like 
> > > > > > > `StartPtr` is expected to point to `N` at the beginning of the 
> > > > > > > function, so the initial `\` is not included in the range. If `N` 
> > > > > > > isn't seen, the function returns early. Likewise, if either of 
> > > > > > > the `{` or `}` delimiters is not found, the function returns 
> > > > > > > early.
> > > > > > > 
> > > > > > > I think the goal of this code is to skip the 
> > > > > > > `getAndAdvanceChar()` machinery when the buffer is being claimed 
> > > > > > > (no need to parse UCNs or trigraphs in the character name), but 
> > > > > > > it looks like, particularly with this DR, that we might always be 
> > > > > > > able to do that.
> > > > > > afaict, 4 is correct here because we are one past-the-end.
> > > > > > I do however agree that this whole pattern which is used a few 
> > > > > > times is probably unnecessary, i do think it would be good to 
> > > > > > investigate. Not in this patch though, imo
> > > > > I looked into it, I'm sure we could improve but not easily, 
> > > > > `getAndAdvanceChar` does set some flags on the token in the presence 
> > > > > of trigraphs/line splicing and we need those flags to be set, this is 
> > > > > the reason we do need to call that method.
> > > > > It's not super efficient but it's such an edge case... I'd rather not 
> > > > > touch that now
> > > > My concern is that, as is, the code always takes the `else` branch 
> > > > (except when `Result` is non-null). Assuming a buffer containing "X", 
> > > > the pointers are arranged as follows (where `$` is one past the end).
> > > >   \ N { X } $
> > > >     |   |   `- CurPtr
> > > >     |   `- Buffer
> > > >     `- StartPtr
> > > > `CurPtr - StartPtr` is 4, but `Buffer.size() + 4` is 5 (`Buffer.size()` 
> > > > is 1 in this case).
> > > > 
> > > > I think there might be an easy test to see if this is working as 
> > > > intended. If it isn't, I would expect a diagnostic to be issued if 
> > > > trigraphs are enabled and the buffer contains a trigraph sequence. 
> > > > Something like:
> > > >   \N{LOTUS??>}
> > > I can try to add tests
> > > 
> > > > My concern is that, as is, the code always takes the else branch 
> > > > (except when Result is non-null). 
> > > Yes, the if branch sole purpose is to set some state in Result.
> > > 
> > > At the start of the function, StartPtr points to `\`
> > > And I'll leave a comment, maybe that will clear up future confusions
> > > 
> > > There may be a potential refactor here, which is to have 
> > > `getAndAdvanceChar` take a `bool & ShouldCleanupToken` parameter instead 
> > > of a token so that we don't have to do that dance, but it's a bit outside 
> > > of the scope of this patch...
> > > At the start of the function, StartPtr points to `\`
> > 
> > It doesn't look like it does. The first use of `StartPtr` is at line 3314 
> > and it expects to read `N`:
> >   3314:   char C = getCharAndSize(StartPtr, CharSize);
> >   3315:   assert(C == 'N' && "expected \\N{...}");
> Yes, you are right. There was a bug in \u too, probably has been there for 
> ages.
> It's unfortunately not testable, any incorrect value would call 
> getCharAndSize which is going to do the right thing.
Thank you; the added comment makes this more clear. Basically, we only want to 
skip the call to `getAndAdvanceChar()` when we're sure the token wasn't spelled 
with trigraphs or line spaces.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138861/new/

https://reviews.llvm.org/D138861

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Reply via email to