Re: unicode as valid naming symbols

Ned Batchelder Tue, 01 Apr 2014 06:37:12 -0700

On 4/1/14 9:00 AM, Chris Angelico wrote:

On Tue, Apr 1, 2014 at 10:59 PM, Antoon Pardon
<[email protected]> wrote:

On 01-04-14 12:58, Chris Angelico wrote:

But because, in the future, Python may choose to create new operators,
the simplest and safest way to ensure safety is to put a boundary on
what can be operators and what can be names; Unicode character classes
are perfect for this. It's also possible that all Unicode whitespace
characters might become legal for indentation and separation (maybe
they are already??), so obviously they're ruled out as identifiers;
anyway, I honestly do not think people would want to use U+2007 FIGURE
SPACE inside a name. So if we deny whitespace, and accept letters and
digits, it makes good sense to deny mathematical symbols so as to keep
them available for operators. (It also makes reasonable sense to
*permit* mathematical symbols, thus allowing you to use them for
functions/methods, in the same way that you can use "n", "o", and "t",
but not "not"; but with word operators, the entire word has to be used
as-is before it's a collision - with a symbolic one, any instance of
that symbol inside a name will change parsing entirely. It's a
trade-off, and Python's made a decision one way and not the other.)


This mostly makes sense to me. The only caveat I have is that since we
also allow _ (U+005F LOW LINE) in names which belongs to the category
<puctuation, connector>, we should allow other symbols within this
category in a name.

But I confess that is mostly personal taste, since I find names_like_this
ugly. Names-like-this look better to me but that wouldn't be workable
in python. But maybe there is some connector that would be aestetically
pleasing and not causing other problems.


That's reasonable. The Pc category doesn't have much in it:

http://www.fileformat.info/info/unicode/category/Pc/list.htm

If the definition of "characters permitted in identifiers" is derived
exclusively from the Unicode categories, including Pc would make fine
sense. Probably the definition should be: First character is L* or Pc,
subsequent characters are L*, N*, or Pc, and either Mn or M*
(combining characters). Or something like that.

Maybe I'm misunderstanding the discussion... It seems like we're talkingabout a hypothetical definition of identifiers based on Unicodecharacter categories, but there's no need: Python 3 has definedprecisely that. From the docs(https://docs.python.org/3/reference/lexical_analysis.html#identifiers):


---<snip>---------

Python 3.0 introduces additional characters from outside the ASCII range(see PEP 3131). For these characters, the classification uses theversion of the Unicode Character Database as included in the unicodedatamodule.


Identifiers are unlimited in length. Case is significant.

identifier   ::=  xid_start xid_continue*

id_start ::= <all characters in general categories Lu, Ll, Lt, Lm,Lo, Nl, the underscore, and characters with the Other_ID_Start property>id_continue ::= <all characters in id_start, plus characters in thecategories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>xid_start ::= <all characters in id_start whose NFKC normalizationis in "id_start xid_continue*">xid_continue ::= <all characters in id_continue whose NFKCnormalization is in "id_continue*">


The Unicode category codes mentioned above stand for:

    Lu - uppercase letters
    Ll - lowercase letters
    Lt - titlecase letters
    Lm - modifier letters
    Lo - other letters
    Nl - letter numbers
    Mn - nonspacing marks
    Mc - spacing combining marks
    Nd - decimal numbers
    Pc - connector punctuations

Other_ID_Start - explicit list of characters in PropList.txt tosupport backwards compatibility

    Other_ID_Continue - likewise

All identifiers are converted into the normal form NFKC while parsing;comparison of identifiers is based on NFKC.


---<end snip>-----


ChrisA



--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list

Re: unicode as valid naming symbols

Reply via email to