Re: Regular Expressions, Speed, Python, and NFA

Rob Gaddi Fri, 14 Apr 2017 10:28:45 -0700

On 04/14/2017 08:12 AM, Malik Rumi wrote:

I am running some tests using the site regex101 to figure out the correct 
regexs to use for a project. I was surprised at how slow it was, constantly 
needing to increase the timeouts. I went Googling for a reason, and solution, 
and found Russ Cox’s article from 2007: 
https://swtch.com/~rsc/regexp/regexp1.html . I couldn’t understand why, if this 
was even remotely correct, we don’t use NFA in Python, which led me here:


https://groups.google.com/forum/#!msg/comp.lang.python/L1ZFI_R2hAo/C12Nf3patWIJ;context-place=forum/comp.lang.python
 where all of these issues were addressed. Unfortunately, this is also from 
2007.

BTW, John Machin in one of his replies cites Navarro’s paper, but that link is broken. 
Navarro’s work can now be found at 
http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3112&rep=rep1&type=pdf
 But be forewarned, it is 68 pages of dense reading. I am not a computer science major. 
I am not new to Python, but I don’t think I’m qualified to take on the idea of creating 
a new NFA module for Python.

Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)
Cheers,
John

Well, just as an idea, there is a portable C library for this at
http://laurikari.net/tre/ released under LGPL.  If one is willing to
give up PCRE extensions for speed, it might be worth the work to
wrap this library using SWIG.
Kirk Sluder


(BTW, this link is also old. TRE is now at https://github.com/laurikari/tre/ )

I am not a computer science major. I am not new to Python, but I don’t think 
I’m qualified to take on the idea of creating a new NFA module for Python.  Nor 
am I entirely sure I want to try something new (to me) like TRE.

Most threads related to this topic are older than 2007. I did find this 
https://groups.google.com/forum/#!searchin/comp.lang.python/regex$20speed%7Csort:relevance/comp.lang.python/O7rUwVoD2t0/NYAQM0mUX7sJ
 from 2011 but I did not do an exhaustive search.

The bottom line is I wanted to know if anything has changed since 2007, and if 
there is a) any hope for improving regex speeds in Python, or b) some 3rd party 
module/library that is already out there and solves this problem? Or should I 
just take this advice?

The cheap way in terms of programmer time is to pipe out to grep or
awk on this one.
Kirk Sluder


Thanks.

I'll also throw in the obligatory quote from Jamie Zawinsky, "Somepeople, when confronted with a problem, think 'I know, I'll use regularexpressions.' Now they have two problems."

It's not that regexes are the wrong tool for any job; I personally usethem all the time. But they're, tautologically, the wrong tool for anyjob that can be done better with a different one. In Python, you've got"in", .startswith, .endswith, and .split to handle simple parsing tasks.On the other end, you've got lxml and the like to handle complex tasksthat provably cannot be done with regexes at all, let alone efficiently.

This leaves them in a fairly bounded middle ground, where is my task socomplex that it warrants something as difficult to read as a regex, butstill simple enough to be solved by one. And that ground is definitelyoccupied. But not vast. So is it worth the time to try to write a moreefficient regex parser for Python? Yours if you want it to be, but notmine.



--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Regular Expressions, Speed, Python, and NFA

Reply via email to