Ezio Melotti added the comment:

> I'm not sure what the "Other_ID_Start property" mentioned in [1] and
> [2] means, though. Can we get someone with more in-depth knowledge of
> unicode to help with this? 

See http://www.unicode.org/reports/tr31/#Backward_Compatibility.
Basically they were considered valid ID_Start characters in previous versions 
of Unicode, but they are no longer valid.  I think it's safe to leave them out 
(perhaps they could/should be removed from the Python parser too), but if you 
want to add them the list includes only 4 characters (there are 12 more for 
Other_ID_Continue).

> The real question is how to do this *fast*, since HyperParser does a
> *lot* of these checks. Do you think caching would be a good approach?

I think it would be enough to check explicitly for ASCII chars, since most of 
them will be ASCII anyway.  If they are not ASCII you can use 
unicodedata.category (or .isidentifier() if it does the right thing).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue21765>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to