NFD in identifiers

Arfrever.FTA at GMail dot Com Thu, 07 May 2020 15:55:16 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94990


            Bug ID: 94990
           Summary: NFC / NFD in identifiers
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: Arfrever.FTA at GMail dot Com
  Target Milestone: ---

GCC 10 introduced support for non-ASCII characters in identifiers.
However it is incomplete in context of NFC / NFD [1]:

$ gcc -o /tmp/test -x c - <<<"int ś = 123; int main() {return ś;}"
<stdin>: In function ‘main’:
<stdin>:1:34: warning: `s\U00000301' is not in NFC [-Wnormalized=]
<stdin>:1:34: error: ‘ś’ undeclared (first use in this function)
<stdin>:1:34: note: each undeclared identifier is reported only once for each
function it appears in
$ 
(In first place, [LATIN SMALL LETTER S WITH ACUTE] is used, in second place,
[LATIN SMALL LETTER S, COMBINING ACUTE ACCENT] is used.)


Since many potential sequences are not possible in NFC form [2][3], it would
make more sense for GCC to perform NFD normalization [4] of all identifiers.

[1] https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
[2] https://en.wikipedia.org/wiki/Precomposed_character
[3]
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode
[4] https://en.wikipedia.org/wiki/Combining_character


For comparison, at least Python language performs some normalization of
identifiers:
$ python -c 'ś = 123; print(ś)'
123
$ 
Python identifiers are normalized to NFC form when possible, and are kept in
NFD form otherwise:
$ python -c $'á = 1\nb́ = 1\nimport unicodedata\nfor k, v in
dict(globals()).items():\n if v == 1:\n  print(k, [unicodedata.name(c) for c in
k])'
á ['LATIN SMALL LETTER A WITH ACUTE']
b́ ['LATIN SMALL LETTER B', 'COMBINING ACUTE ACCENT']
$

[Bug c/94990] New: NFC / NFD in identifiers

Reply via email to