Steven D'Aprano <steve+pyt...@pearwood.info> added the comment:

Here's my implementation:

from unicodedata import name
from unicodedata import lookup as _lookup
from fnmatch import translate
from re import compile, I

_NAMES = None

def getnames():
    global _NAMES
    if _NAMES is None:
        _NAMES = []
        for i in range(0x110000):
            s = name(chr(i), '')
            if s:
                _NAMES.append(s)
    return _NAMES

def lookup(name_or_glob):
    if any(c in name_or_glob for c in '*?['):
        match = compile(translate(name_or_glob), flags=I).match
        return [name for name in getnames() if match(name)]
    else:
        return _lookup(name_or_glob)




The major limitation of my implementation is that it doesn't match name aliases 
or sequences.

http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt

For example:

lookup('TAMIL SYLLABLE TAA?')  # NamedSequence

ought to return ['தா'] but doesn't.

Parts of the Unicode documentation uses the convention that canonical names are 
in UPPERCASE, aliases are lowercase, and sequences are in Mixed Case. and I 
think that we should follow that convention:

http://www.unicode.org/charts/aboutcharindex.html

That makes it easy to see what is the canonical name and what isn't.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue35549>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to