On 5 July 2013 03:03, Dave Angel <da...@davea.name> wrote: > On 07/04/2013 09:24 PM, Steven D'Aprano wrote: >> On Thu, 04 Jul 2013 17:54:20 +0100, Rotwang wrote: >>> It's perhaps worth mentioning that some non-ascii characters are allowed >>> in identifiers in Python 3, though I don't know which ones. >> >> PEP 3131 describes the rules: >> >> http://www.python.org/dev/peps/pep-3131/ > > The isidentifier() method will let you weed out the characters that cannot > start an identifier. But there are other groups of characters that can > appear after the starting "letter". So a more reasonable sample might be > something like: ... > In particular, > http://docs.python.org/3.3/reference/lexical_analysis.html#identifiers > > has a definition for id_continue that includes several interesting > categories. I expected the non-ASCII digits, but there's other stuff there, > like "nonspacing marks" that are surprising. > > I'm pretty much speculating here, so please correct me if I'm way off.
For my calculation above, I used this code I quickly mocked up: > import unicodedata as unidata > from sys import maxunicode > from collections import defaultdict > from itertools import chain > > def get(): > xid_starts = set() > xid_continues = set() > > id_start_categories = "Lu, Ll, Lt, Lm, Lo, Nl".split(", ") > id_continue_categories = "Mn, Mc, Nd, Pc".split(", ") > > characters = (chr(n) for n in range(maxunicode + 1)) > > print("Making normalized characters") > > normalized = (unidata.normalize("NFKC", character) for character in > characters) > normalized = set(chain.from_iterable(normalized)) > > print("Assigning to categories") > > for character in normalized: > category = unidata.category(character) > > if category in id_start_categories: > xid_starts.add(character) > elif category in id_continue_categories: > xid_continues.add(character) > > return xid_starts, xid_continues Please note that "xid_continues" actually represents "xid_continue - xid_start". -- http://mail.python.org/mailman/listinfo/python-list