On 2016-12-15 21:57, Terry Reedy wrote:
On 12/15/2016 1:06 PM, MRAB wrote:
On 2016-12-15 16:53, Steve D'Aprano wrote:
Suppose I have a Unicode character, and I want to determine the script or
scripts it belongs to.
For example:
U+0033 DIGIT THREE "3" belongs to the script "COMMON";
U+0061 LATIN SMALL LETTER A "a" belongs to the script "LATIN";
U+03BE GREEK SMALL LETTER XI "ΞΎ" belongs to the script "GREEK".
Is this information available from Python?
More about Unicode scripts:
http://www.unicode.org/reports/tr24/
http://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt
http://www.unicode.org/Public/UCD/latest/ucd/ScriptExtensions.txt
Interestingly, there's issue 6331 "Add unicode script info to the
unicode database". Looks like it didn't make it into Python 3.6.
https://bugs.python.org/issue6331
Opened in 2009 with patch and 2 revisions for 2.x. At least the Python
code needs to be updated.
Approved in principle by Martin, then unicodedata curator, but no longer
active. Neither, very much, are the other 2 listed in the Expert's index.
From what I could see, both the Python API (there is no doc patch yet)
and internal implementation need more work. If I were to get involved,
I would look at the APIs of PyICU (see Eryk Sun's post) and the
unicodescript module on PyPI (mention by Pander Musubi, on the issue).
For what it's worth, the post has prompted me to get back to a module I
started which will report such Unicode properties, essentially the ones
that the regex module supports. It just needs a few more tweaks and
packaging up...
--
https://mail.python.org/mailman/listinfo/python-list