Issue 123560
Summary Are the codes in UnicodeCharSets.h accurate, and what version of Unicode standard are they for?
Labels new issue
Assignees
Reporter mrolle45
    The ranges in `XIDStartRanges[]` and `XIDContinueRanges[]` don't seem to correspond exactly to the codepoints shown as XID_START and XID_CONTINUE in the official Unicode documents.  I'm reading https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt for version 15.1, which the last commit says it was updated for.
As an example, the standard document has
```
00F8..01BA    ; XID_Start # L& [195] LATIN SMALL LETTER O WITH STROKE..LATIN SMALL LETTER EZH WITH TAIL
01BB ; XID_Start # Lo       LATIN LETTER TWO WITH STROKE
01BC..01BF    ; XID_Start # L&   [4] LATIN CAPITAL LETTER TONE FIVE..LATIN LETTER WYNN
01C0..01C3    ; XID_Start # Lo   [4] LATIN LETTER DENTAL CLICK..LATIN LETTER RETROFLEX CLICK
01C4..0293    ; XID_Start # L& [208] LATIN CAPITAL LETTER DZ WITH CARON..LATIN SMALL LETTER EZH WITH CURL
0294          ; XID_Start # Lo       LATIN LETTER GLOTTAL STOP
0295..02AF    ; XID_Start # L&  [27] LATIN LETTER PHARYNGEAL VOICED FRICATIVE..LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
02B0..02C1    ; XID_Start # Lm  [18] MODIFIER LETTER SMALL H..MODIFIER LETTER REVERSED GLOTTAL STOP
```
but your file has
```c
{0x00F8, 0x02C1}
```
Thus there are several codepoints in this range which clang considers to be identifier start characters, but are not such in the Unicode standard.

By the way, the standard file also contains the same codes as ID_START:
```
00F8..01BA    ; ID_Start # L& [195] LATIN SMALL LETTER O WITH STROKE..LATIN SMALL LETTER EZH WITH TAIL
01BB          ; ID_Start # Lo       LATIN LETTER TWO WITH STROKE
01BC..01BF    ; ID_Start # L&   [4] LATIN CAPITAL LETTER TONE FIVE..LATIN LETTER WYNN
01C0..01C3    ; ID_Start # Lo   [4] LATIN LETTER DENTAL CLICK..LATIN LETTER RETROFLEX CLICK
01C4..0293    ; ID_Start # L& [208] LATIN CAPITAL LETTER DZ WITH CARON..LATIN SMALL LETTER EZH WITH CURL
0294          ; ID_Start # Lo LATIN LETTER GLOTTAL STOP
0295..02AF    ; ID_Start # L&  [27] LATIN LETTER PHARYNGEAL VOICED FRICATIVE..LATIN SMALL LETTER TURNED H WITH FISHHOOK AND TAIL
02B0..02C1    ; ID_Start # Lm  [18] MODIFIER LETTER SMALL H..MODIFIER LETTER REVERSED GLOTTAL STOP
```

This is important to me personally because I am working on a C preprocessor which has an option to emulate clang version 20 (with `-E` switch), and to test it I am giving it various identifiers with unicode characters.  
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to