Issue |
125461
|
Summary |
clang mishandles a unicode escape beyond maximum codepoint limit.
|
Labels |
clang
|
Assignees |
|
Reporter |
mrolle45
|
Normally a unicode escape sequence can appear within an identifier token. However, if the escape denotes a value > the maximum legal codepoint (0x10FFFF), this should not be a valid identifier. However, clang emits a diagnostic for the illegal codepoint, then ignores the escape sequence, and continues lexing the identifier.
Thus, for example, the string `Y\U00110000Z` is lexed as an identifier `YZ`. The string `Y\U00110001Z` is lexed as the same identifier `YZ`.
Here's sample code, and the output from `clang -E`:
```c
#define Y\U00110000Z() "Y\U00110000Z is an identifier"
#define Y\U00110001Z() "Y\U00110001Z is an identifier"
Y\U00110000Z()
YZ()
"Y\U00110001Z is an identifier"
"Y\U00110001Z is an identifier"
```
You can see that `YZ` was defined as a macro twice, and it is called twice, once with the bad unicode escape and once without.
I don't know what you think the proper response should be, since the c++ standard indicates this as undefined behavior. I suppose that `Y` should be an identifier by itself, and then the `\U00110000` should be an error token, followed by identifier `Z`. You decide. But certainly don't make `YZ` an identifier.
The error comes from trying to convert the UTF-32 character for 0x110000 to UTF-8, which (rightly) fails, and leaves the partial `Y` in its buffer, and then keeps on lexing with the character `Z`.
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs