On Thu, Dec 15, 2016 at 4:53 PM, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: > Suppose I have a Unicode character, and I want to determine the script or > scripts it belongs to. > > For example: > > U+0033 DIGIT THREE "3" belongs to the script "COMMON"; > U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN"; > U+03BE GREEK SMALL LETTER XI "ΞΎ" belongs to the script "GREEK". > > Is this information available from Python?
Tools/makunicodedata.py doesn't include data from "Scripts.txt". If adding an external dependency is ok, then you can use PyICU. For example: >>> icu.Script.getScript('\u0033').getName() 'Common' >>> icu.Script.getScript('\u0061').getName() 'Latin' >>> icu.Script.getScript('\u03be').getName() 'Greek' There isn't documentation specific to Python, so you'll have to figure things out experimentally with reference to the C API. http://icu-project.org/apiref/icu4c http://icu-project.org/apiref/icu4c/uscript_8h.html -- https://mail.python.org/mailman/listinfo/python-list