Issue 125461
Summary clang mishandles a unicode escape beyond maximum codepoint limit.
Labels clang
Assignees
Reporter mrolle45
    Normally a unicode escape sequence can appear within an identifier token.  However, if the escape denotes a value > the maximum legal codepoint (0x10FFFF), this should not be a valid identifier. However, clang emits a diagnostic for the illegal codepoint, then ignores the escape sequence, and continues lexing the identifier.
Thus, for example, the string `Y\U00110000Z` is lexed as an identifier `YZ`.  The string `Y\U00110001Z` is lexed as the same identifier `YZ`.

Here's sample code, and the output from `clang -E`:
```c
#define Y\U00110000Z() "Y\U00110000Z is an identifier"
#define Y\U00110001Z() "Y\U00110001Z is an identifier"
Y\U00110000Z()
YZ()

"Y\U00110001Z is an identifier"
"Y\U00110001Z is an identifier"

```
You can see that `YZ` was defined as a macro twice, and it is called twice, once with the bad unicode escape and once without.

I don't know what you think the proper response should be, since the c++ standard indicates this as undefined behavior.  I suppose that `Y` should be an identifier by itself, and then the `\U00110000` should be an error token, followed by identifier `Z`.  You decide.  But certainly don't make `YZ` an identifier.

The error comes from trying to convert the UTF-32 character for 0x110000 to UTF-8, which (rightly) fails, and leaves the partial `Y` in its buffer, and then keeps on lexing with the character `Z`.


_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to