[issue6331] Add unicode script info to the unicode database

Elizabeth Myers Tue, 02 Sep 2014 01:23:43 -0700

Elizabeth Myers added the comment:

> I think this needs to be fixed, then - we need to study why there are
> so many new records (e.g. what script contributes most new records),
> and then look for alternatives.


The "Common" script appears to be very fragmented and may be the cause of the 
issues.

> One alternative could be to create a separate Trie for scripts.

Not having seen the one in C yet, I have one written in Python, custom-made for 
storing the script database, based on the general idea of a range tree. It 
stores ranges individually straight out of Scripts.txt. The general idea is you 
take the average of the lower and upper bounds of a given range (they can be 
equal). When searching, you compare the codepoint value to the average in the 
present node, and use that to find which direction to search the tree in.

Without coalescing neighbouring ranges that are the same script, I have 1,606 
nodes in the tree (for Unicode 7.0, which added a lot of scripts). After 
coalescing, there appear to be 806 nodes.

If anyone cares, I'll be more than happy to post code for inspection.

> I don't know what this will be used for, but one application is
> certainly regular expressions. So we need an efficient test whether
> the character is in the expected script or not. It would be bad if
> such a test would have to do a .lower() on each lookup.

This is actually required for restriction-level detection as described in 
Unicode TR39, for all levels of restriction above ASCII-only 
(http://www.unicode.org/reports/tr39/#Restriction_Level_Detection).

----------
nosy: +Elizacat

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue6331>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue6331] Add unicode script info to the unicode database

Reply via email to