[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Tom Christiansen Mon, 15 Aug 2011 10:48:45 -0700

New submission from Tom Christiansen <[email protected]>:

Unicode character names share a common namespace with formal aliases and with 
named sequences, but Python recognizes only the original name. That means not 
everything in the namespace is accessible from Python.  (If this is construed 
to be an extant bug from than an absent feature, you probably want to change 
this from a wish to a bug in the ticket.)


This is a problem because aliases correct errors in the original names, and are 
the preferred versions.  For example, ISO screwed up when they called U+01A2 
LATIN CAPITAL LETTER OI.  It is actually LATIN CAPITAL LETTER GHA according to 
the file NameAliases.txt in the Unicode Character Database.  However, Python 
blows up when you try to use this:

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL 
LETTER OI}")'
    Ƣ

    % env PYTHONIOENCODING=utf8 python3.2-narrow -c 'print("\N{LATIN CAPITAL 
LETTER GHA}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 0-27: unknown Unicode character name
    Exit 1

This unfortunate, because the formal aliases correct egregious blunders, such 
as the Standard reading "BRAKCET" instead of "BRACKET":

$ uninames '^\s+%'
 Ƣ  01A2        LATIN CAPITAL LETTER OI
        % LATIN CAPITAL LETTER GHA
 ƣ  01A3        LATIN SMALL LETTER OI
        % LATIN SMALL LETTER GHA
        * Pan-Turkic Latin alphabets
 ೞ  0CDE        KANNADA LETTER FA
        % KANNADA LETTER LLLA
        * obsolete historic letter
        * name is a mistake for LLLA
 ຝ  0E9D        LAO LETTER FO TAM
        % LAO LETTER FO FON
        = fo fa
        * name is a mistake for fo sung
 ຟ  0E9F        LAO LETTER FO SUNG
        % LAO LETTER FO FAY
        * name is a mistake for fo tam
 ຣ  0EA3        LAO LETTER LO LING
        % LAO LETTER RO
        = ro rot
        * name is a mistake, lo ling is the mnemonic for 0EA5
 ລ  0EA5        LAO LETTER LO LOOT
        % LAO LETTER LO
        = lo ling
        * name is a mistake, lo loot is the mnemonic for 0EA3
 ࿐  0FD0        TIBETAN MARK BSKA- SHOG GI MGO RGYAN
        % TIBETAN MARK BKA- SHOG GI MGO RGYAN
        * used in Bhutan
 ꀕ A015        YI SYLLABLE WU
        % YI SYLLABLE ITERATION MARK
        * name is a misnomer
 ︘ FE18        PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET
        % PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET
        * misspelling of "BRACKET" in character name is a known defect
        # <vertical> 3017
 𝃅  1D0C5       BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS
        % BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS
        * misspelling of "FTHORA" in character name is a known defect

There are only 

In Perl, \N{...} grants access to the single, shared, common namespace of 
Unicode character names, formal aliases, and named sequences without 
distinction:

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER OI}")'
    Ƣ
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER GHA}")'
    Ƣ

    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER OI}")'  | uniquote -x
    \x{1A2}
    % env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER GHA}")' | uniquote -x
    \x{1A2}

It is my suggestion that Python do the same thing. There are currently only 11 
of these.  

The third element in this shared namespace of name, named sequences, are 
multiple code points masquerading under one name.  They come from the 
NamedSequences.txt file in the Unicode Character Database.  An example entry is:

    LATIN CAPITAL LETTER A WITH MACRON AND GRAVE;0100 0300

There are 418 of these named sequences as of Unicode 6.0.0.  This shows that 
Perl can also access named sequences:

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE}")'
  Ā̀

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{LATIN CAPITAL 
LETTER A WITH MACRON AND GRAVE}")' | uniquote -x
  \x{100}\x{300}

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER 
AINU P}")'            
  ㇷ゚

  $ env PERL_UNICODE=S perl -Mcharnames=:full -le 'print("\N{KATAKANA LETTER 
AINU P}")' | uniquote -x
   \x{31F7}\x{309A}


Since it is a single namespace, it makes sense that all members of that 
namespace should be accessible using \N{...} as a sort of equal-opportunity 
accessor mechanism, and it does not make sense that they not be.

Just makes sure you take only the approved named sequences from the 
NamedSequences.txt file. It would be unwise to give users access to the 
provisional sequences located in a neighboring file I shall not name :) because 
those are not guaranteed never to be withdrawn the way the others are, and so 
you would risk introducing an incompatibility.

If you look at the ICU UCharacter class, you can see that they provide a more

----------
components: Interpreter Core
messages: 142136
nosy: mrabarnett, tchrist
priority: normal
severity: normal
status: open
title: \N{...} neglects formal aliases and named sequences from Unicode 
charnames namespace
type: feature request
versions: Python 2.7, Python 3.1, Python 3.2, Python 3.3

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12753>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12753] \N{...} neglects formal aliases and named sequences from Unicode charnames namespace

Reply via email to