On 4/1/14 9:00 AM, Chris Angelico wrote:
On Tue, Apr 1, 2014 at 10:59 PM, Antoon Pardon
<antoon.par...@rece.vub.ac.be> wrote:
On 01-04-14 12:58, Chris Angelico wrote:
But because, in the future, Python may choose to create new operators,
the simplest and safest way to ensure safety is to put a boundary on
what can be operators and what can be names; Unicode character classes
are perfect for this. It's also possible that all Unicode whitespace
characters might become legal for indentation and separation (maybe
they are already??), so obviously they're ruled out as identifiers;
anyway, I honestly do not think people would want to use U+2007 FIGURE
SPACE inside a name. So if we deny whitespace, and accept letters and
digits, it makes good sense to deny mathematical symbols so as to keep
them available for operators. (It also makes reasonable sense to
*permit* mathematical symbols, thus allowing you to use them for
functions/methods, in the same way that you can use "n", "o", and "t",
but not "not"; but with word operators, the entire word has to be used
as-is before it's a collision - with a symbolic one, any instance of
that symbol inside a name will change parsing entirely. It's a
trade-off, and Python's made a decision one way and not the other.)
This mostly makes sense to me. The only caveat I have is that since we
also allow _ (U+005F LOW LINE) in names which belongs to the category
<puctuation, connector>, we should allow other symbols within this
category in a name.
But I confess that is mostly personal taste, since I find names_like_this
ugly. Names-like-this look better to me but that wouldn't be workable
in python. But maybe there is some connector that would be aestetically
pleasing and not causing other problems.
That's reasonable. The Pc category doesn't have much in it:
http://www.fileformat.info/info/unicode/category/Pc/list.htm
If the definition of "characters permitted in identifiers" is derived
exclusively from the Unicode categories, including Pc would make fine
sense. Probably the definition should be: First character is L* or Pc,
subsequent characters are L*, N*, or Pc, and either Mn or M*
(combining characters). Or something like that.
Maybe I'm misunderstanding the discussion... It seems like we're talking
about a hypothetical definition of identifiers based on Unicode
character categories, but there's no need: Python 3 has defined
precisely that. From the docs
(https://docs.python.org/3/reference/lexical_analysis.html#identifiers):
---<snip>---------
Python 3.0 introduces additional characters from outside the ASCII range
(see PEP 3131). For these characters, the classification uses the
version of the Unicode Character Database as included in the unicodedata
module.
Identifiers are unlimited in length. Case is significant.
identifier ::= xid_start xid_continue*
id_start ::= <all characters in general categories Lu, Ll, Lt, Lm,
Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue ::= <all characters in id_start, plus characters in the
categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start ::= <all characters in id_start whose NFKC normalization
is in "id_start xid_continue*">
xid_continue ::= <all characters in id_continue whose NFKC
normalization is in "id_continue*">
The Unicode category codes mentioned above stand for:
Lu - uppercase letters
Ll - lowercase letters
Lt - titlecase letters
Lm - modifier letters
Lo - other letters
Nl - letter numbers
Mn - nonspacing marks
Mc - spacing combining marks
Nd - decimal numbers
Pc - connector punctuations
Other_ID_Start - explicit list of characters in PropList.txt to
support backwards compatibility
Other_ID_Continue - likewise
All identifiers are converted into the normal form NFKC while parsing;
comparison of identifiers is based on NFKC.
---<end snip>-----
ChrisA
--
Ned Batchelder, http://nedbatchelder.com
--
https://mail.python.org/mailman/listinfo/python-list