Re: [Python-ideas] What about regexp string litterals : re".*" ?

Stephen J. Turnbull Fri, 31 Mar 2017 00:24:17 -0700

Abe Dillon writes:

 > Note that the entire documentation is 250 words while just the syntax
 > portion of Python docs for the re module is over 3000 words.

Since Verbal Expressions (below, VEs, indicating notation) "compile"
to regular expressions (spelling out indicates the internal matching
implementation), the documentation of VEs presumably ignores
everything except the limited language it's useful for.  To actually
understand VEs, you need to refer to the RE docs.  Not a win IMO.

 > > You think that example is more readable than the proposed transalation
 > >     ^(http)(s)?(\:\/\/)(www\.)?([^\ ]*)$
 > > which is better written
 > >     ^https?://(www\.)?[^ ]*$
 > > or even
 > >     ^https?://[^ ]*$
 > 
 > 
 > Yes. I find it *far* more readable. It's not a soup of symbols like Perl
 > code. I can only surmise that you're fluent in regex because it seems
 > difficult for you to see how the above could be less readable than English
 > words.

Yes, I'm fairly fluent in regular expression notation (below, REs).
I've maintained a compiler for one dialect.

I'm not interested in the difference between words and punctuation
though.  The reason I find the middle RE most readable is that it
"looks like" what it's supposed to match, in a contiguous string as
the object it will match will be contiguous.  If I need to parse it to
figure out *exactly* what it matches, yes, that takes more effort.
But to understand a VE's semantics correctly, I'd have to look it up
as often as you have to look up REs because many words chosen to notate
VEs have English meanings that are (a) ambiguous, as in all natural
language, and (b) only approximate matches to RE semantics.

 > I could tell it only matches URLs that are the only thing inside
 > the string because it clearly says: start_of_line() and
 > end_of_line().

That's not the problem.  The problem is the semantics of the method
"find".  "then" would indeed read better, although it doesn't exactly
match the semantics of concatenation in REs.

 > I would have had to refer to a reference to know that "^" doesn't
 > always mean "not", it sometimes means "start of string" and
 > probably other things. I would also have to check a reference to
 > know that "$" can mean "end of string" (and probably other things).

And you'll still have to do that when reading other people's REs.

 > > Are those groups capturing in Verbal Expressions?  The use of
 > > "find" (~ "search") rather than "match" is disconcerting to the
 > > experienced user.
 > 
 > You can alternately use the word "then". The source code is just
 > one python file. It's very easy to read. I actually like "then"
 > over "find" for the example:

You're missing the point.  The reader does not get to choose the
notation, the author does.  I do understand what several varieties of
RE mean, but the variations are of two kinds: basic versus extended
(ie, what tokens need to be escaped to be taken literally, which ones
have special meaning if escaped), and extensions (which can be
ignored).  Modern RE facilities are essentially all of the extended
variety.  Once you've learned that, you're in good shape for almost
any RE that should be written outside of an obfuscated code contest.

This is a fundamental principle of Python design: don't make readers
of code learn new things.  That includes using notation developed
elsewhere in many cases.

 > What does alternation look like?
 > 
 > .OR(option1).OR(option2).OR(option3)...
 >
 > How about alternation of
 > > non-trivial regular expressions?
 > 
 > .OR(other_verbal_expression)

Real examples, rather than pseudo code, would be nice.  I think you,
too, will find that examples of even fairly simple nested alternations
containing other constructs become quite hard to read, as they fall
off the bottom of the screen.

For example, the VE equivalent of

    scheme = "(https?|ftp|file):"

would be (AFAICT):

    scheme = VerEx().then(VerEx().then("http")
                                 .maybe("s")
                                 .OR("ftp")
                                 .OR("file"))
                    .then(":")

which is pretty hideous, I think.  And the colon is captured by a
group.  If perversely I wanted to extract that group from a match,
what would its index be?

I guess you could keep the linear arrangement with

    scheme = (VerEx().add("(")
                     .then("http")
                     .maybe("s")
                     .OR("ftp")
                     .OR("file")
                     .add(")")
                     .then(":"))

but is that really an improvement over

    scheme = VerEx().add("(https?|ftp|file):")

;-)

 > > As far as I can see, Verbal Expressions are basically a way of
 > > making it so painful to write regular expressions that people
 > > will restrict themselves to regular expressions
 > 
 > What's so painful to write about them?

One thing that's painful is that VEs "look like" context-free
grammars, but clumsy and without the powerful semantics.  You can get
the readability you want with greater power using grammars, which is
why I would prefer we work on getting a parser module into the stdlib.

But if one doesn't know about grammars, it's still not great.  The
main pains about writing VEs for me are (1) reading what I just wrote,
(2) accessing capturing groups, and (3) verbosity.  Even a VE to
accurately match what is normally a fairly short string, such as the
scheme, credentials, authority, and port portions of a "standard" URL,
is going to be hundreds of characters long and likely dozens of lines
if folded as in the examples.

Another issue is that we already have a perfectly good poor man's
matching library: glob.  The URL example becomes

    http{,s}://{,www.}*

Granted you lose the anchors, but how often does that matter?  You
apparently don't use them often enough to remember them.

 > Does your IDE not have autocompletion?

I don't want an IDE.  I have Emacs.

 > I find REs so painful to write that I usually just use string
 > methods if at all feasible.

Guess what?  That's the right thing to do anyway.  They're a lot more
readable and efficient when partitioning a string into two or three
parts, or recognizing a short list of affixes.  But chaining many
methods, as VEs do, is not a very Pythonic way to write a program.

 > > I don't think that this failure to respect the developer's taste
 > > is restricted to this particular implementation, either.
 > 
 > I generally find it distasteful to write a pseudolanguage in
 > strings inside of other languages (this applies to SQL as well).

You mean like arithmetic operators?  (Lisp does this right, right?
Only one kind of expression, the function call!)  It's a matter of
what you're used to.  I understand that people new to text-processing,
or who don't do so much of it, don't find REs easy to read.  So how is
this a huge loss?  They don't use regular expressions very often!  In
fact, they're far more likely to encounter, and possibly need to
understand, REs written by others!

 > Especially when the design principals of that pseudolanguage are
 > *diametrically opposed* to the design principals of the host
 > language. A key principal of Python's design is: "you read code a
 > lot more often than you write code, so emphasize
 > readability". Regex seems to be based on: "Do the most with the
 > fewest key-strokes.

So is all of mathematics.  There's nothing wrong with concise
expression for use in special cases.  

 > Readability be dammed!". It makes a lot more sense to wrap the
 > psudolanguage in constructs that bring it in-line with the host
 > language than to take on the mental burden of trying to comprehend
 > two different languages at the same time.
 > 
 > If you disagree, nothing's stopping you from continuing to write
 > res the old-fashion way.

I don't think that RE and SQL are "pseudo" languages, no.  And I, and
most developers, will continue to write regular expressions using the
much more compact and expressive RE notation.  (In fact with the
exception of the "word" method, in VEs you still need to use RE notion
to express most of the Python extensions.)  So what you're saying is
that you don't read much code, except maybe your own.  Isn't that your
problem?  Those of us who cooperate widely on applications using
regular expressions will continue to communicate using REs.  If that
leaves you out, that's not good.  But adding VEs to the stdlib (and
thus encouraging their use) will split the community into RE users and
VE users, if VEs are at all useful.  That's a bad.  I don't see that
the potential usefulness of VEs to infrequent users of regular
expressions outweighing the downsides of "many ways to do it" in the
stdlib.

 > Can we at least agree that baking special re syntax directly into
 > the language is a bad idea?

I agree that there's no particular need for RE literals.  If one wants
to mark an RE as some special kind of object, re.compile() does that
very well both by converting to a different type internally and as a
marker syntactically.

 > On Wed, Mar 29, 2017 at 11:49 PM, Nick Coghlan <[email protected]> wrote:
 >
 > > We don't really want to ease the use of regexps in Python - while
 > > they're an incredibly useful tool in a programmer's toolkit,
 > > they're so cryptic that they're almost inevitably a
 > > maintainability nightmare.

I agree with Nick.  Regular expressions, whatever the notation, are a
useful tool (no suspension of disbelief necessary for me, though!).
But they are cryptic, and it's not just the notation.  People (even
experienced RE users) are often surprised by what fairly simple
regular expression match in a given text, because people want to read
a regexp as instructions to a one-pass greedy parser, and it isn't.

For example, above I wrote

    scheme = "(https?|ftp|file):"

rather than

    scheme = "(\w+):"

because it's not unlikely that I would want to treat those differently
from other schemes such as mailto, news, and doi.  In many
applications of regular expressions (such as tokenization for a
parser) you need many expressions.  Compactness really is a virtue in
REs.

Steve

_______________________________________________
Python-ideas mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] What about regexp string litterals : re".*" ?

Reply via email to