https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94990
Bug ID: 94990 Summary: NFC / NFD in identifiers Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: Arfrever.FTA at GMail dot Com Target Milestone: --- GCC 10 introduced support for non-ASCII characters in identifiers. However it is incomplete in context of NFC / NFD [1]: $ gcc -o /tmp/test -x c - <<<"int ś = 123; int main() {return ś;}" <stdin>: In function ‘main’: <stdin>:1:34: warning: `s\U00000301' is not in NFC [-Wnormalized=] <stdin>:1:34: error: ‘ś’ undeclared (first use in this function) <stdin>:1:34: note: each undeclared identifier is reported only once for each function it appears in $ (In first place, [LATIN SMALL LETTER S WITH ACUTE] is used, in second place, [LATIN SMALL LETTER S, COMBINING ACUTE ACCENT] is used.) Since many potential sequences are not possible in NFC form [2][3], it would make more sense for GCC to perform NFD normalization [4] of all identifiers. [1] https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization [2] https://en.wikipedia.org/wiki/Precomposed_character [3] https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode [4] https://en.wikipedia.org/wiki/Combining_character For comparison, at least Python language performs some normalization of identifiers: $ python -c 'ś = 123; print(ś)' 123 $ Python identifiers are normalized to NFC form when possible, and are kept in NFD form otherwise: $ python -c $'á = 1\nb́ = 1\nimport unicodedata\nfor k, v in dict(globals()).items():\n if v == 1:\n print(k, [unicodedata.name(c) for c in k])' á ['LATIN SMALL LETTER A WITH ACUTE'] b́ ['LATIN SMALL LETTER B', 'COMBINING ACUTE ACCENT'] $