New submission from Tom Christiansen <tchr...@perl.com>: Unicode character names share a common namespace with formal aliases and with named sequences, but Python recognizes only the original name. That means not everything in the namespace is accessible from Python. (If this is construed to be an extant bug from than an absent feature, you probably want to change this from a wish to a bug in the ticket.)
This is a problem because aliases correct errors in the original names, and are the preferred versions. For example, ISO screwed up when they called U+01A2 LATIN CAPITAL LETTER OI. It is actually LATIN CAPITAL LETTER GHA according to the file NameAliases.txt in the Unicode Character Database. However, Python blows up when you try to use this: % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER OI}")' Ƣ % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL LETTER GHA}")' File "<string>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-27: unknown Unicode character name Exit 1 This unfortunate, because the formal aliases correct egregious blunders, such as the Standard reading "BRAKCET" instead of "BRACKET": $ uninames '^\s+%' Ƣ 01A2 LATIN CAPITAL LETTER OI % LATIN CAPITAL LETTER GHA ƣ 01A3 LATIN SMALL LETTER OI % LATIN SMALL LETTER GHA * Pan-Turkic Latin alphabets ೞ 0CDE KANNADA LETTER FA % KANNADA LETTER LLLA * obsolete historic letter * name is a mistake for LLLA ຝ 0E9D LAO LETTER FO TAM % LAO LETTER FO FON = fo fa * name is a mistake for fo sung ຟ 0E9F LAO LETTER FO SUNG % LAO LETTER FO FAY * name is a mistake for fo tam ຣ 0EA3 LAO LETTER LO LING % LAO LETTER RO = ro rot * name is a mistake, lo ling is the mnemonic for 0EA5 ລ 0EA5 LAO LETTER LO LOOT % LAO LETTER LO = lo ling * name is a mistake, lo loot is the mnemonic for 0EA3 ࿐ 0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN % TIBETAN MARK BKA- SHOG GI MGO RGYAN * used in Bhutan ꀕ A015 YI SYLLABLE WU % YI SYLLABLE ITERATION MARK * name is a misnomer ︘ FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET % PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET * misspelling of "BRACKET" in character name is a known defect # <vertical> 3017 𝃅 1D0C5 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS % BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS * misspelling of "FTHORA" in character name is a known defect There are only In Perl, \N{...} grants access to the single, shared, common namespace of Unicode character names, formal aliases, and named sequences without distinction: % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")' Ƣ % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' Ƣ % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER OI}")' | uniquote -x \x{1A2} % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER GHA}")' | uniquote -x \x{1A2} It is my suggestion that Python do the same thing. There are currently only 11 of these. The third element in this shared namespace of name, named sequences, are multiple code points masquerading under one name. They come from the NamedSequences.txt file in the Unicode Character Database. An example entry is: LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300 There are 418 of these named sequences as of Unicode 6.0.0. This shows that Perl can also access named sequences: $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' Ā̀ $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}")' | uniquote -x \x{100}\x{300} $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' ㇷ゚ $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER AINU P}")' | uniquote -x \x{31F7}\x{309A} Since it is a single namespace, it makes sense that all members of that namespace should be accessible using \N{...} as a sort of equal-opportunity accessor mechanism, and it does not make sense that they not be. Just makes sure you take only the approved named sequences from the NamedSequences.txt file. It would be unwise to give users access to the provisional sequences located in a neighboring file I shall not name :) because those are not guaranteed never to be withdrawn the way the others are, and so you would risk introducing an incompatibility. If you look at the ICU UCharacter class, you can see that they provide a more ---------- components: Interpreter Core messages: 142136 nosy: mrabarnett, tchrist priority: normal severity: normal status: open title: \N{...} neglects formal aliases and named sequences from Unicode charnames namespace type: feature request versions: Python 2.7, Python 3.1, Python 3.2, Python 3.3 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12753> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com