On 04/14/2017 08:12 AM, Malik Rumi wrote:
I am running some tests using the site regex101 to figure out the correct
regexs to use for a project. I was surprised at how slow it was, constantly
needing to increase the timeouts. I went Googling for a reason, and solution,
and found Russ Cox’s article from 2007:
https://swtch.com/~rsc/regexp/regexp1.html . I couldn’t understand why, if this
was even remotely correct, we don’t use NFA in Python, which led me here:
https://groups.google.com/forum/#!msg/comp.lang.python/L1ZFI_R2hAo/C12Nf3patWIJ;context-place=forum/comp.lang.python
where all of these issues were addressed. Unfortunately, this is also from
2007.
BTW, John Machin in one of his replies cites Navarro’s paper, but that link is broken.
Navarro’s work can now be found at
http://citeseer.ist.psu.edu/viewdoc/download?doi=10.1.1.21.3112&rep=rep1&type=pdf
But be forewarned, it is 68 pages of dense reading. I am not a computer science major.
I am not new to Python, but I don’t think I’m qualified to take on the idea of creating
a new NFA module for Python.
Getting back to the "It would be nice ..." bit: yes, it would be nice
to have even more smarts in re, but who's going to do it? It's not a
"rainy Sunday afternoon" job :-)
Cheers,
John
-
Well, just as an idea, there is a portable C library for this at
http://laurikari.net/tre/ released under LGPL. If one is willing to
give up PCRE extensions for speed, it might be worth the work to
wrap this library using SWIG.
Kirk Sluder
(BTW, this link is also old. TRE is now at https://github.com/laurikari/tre/ )
I am not a computer science major. I am not new to Python, but I don’t think
I’m qualified to take on the idea of creating a new NFA module for Python. Nor
am I entirely sure I want to try something new (to me) like TRE.
Most threads related to this topic are older than 2007. I did find this
https://groups.google.com/forum/#!searchin/comp.lang.python/regex$20speed%7Csort:relevance/comp.lang.python/O7rUwVoD2t0/NYAQM0mUX7sJ
from 2011 but I did not do an exhaustive search.
The bottom line is I wanted to know if anything has changed since 2007, and if
there is a) any hope for improving regex speeds in Python, or b) some 3rd party
module/library that is already out there and solves this problem? Or should I
just take this advice?
The cheap way in terms of programmer time is to pipe out to grep or
awk on this one.
Kirk Sluder
Thanks.
I'll also throw in the obligatory quote from Jamie Zawinsky, "Some
people, when confronted with a problem, think 'I know, I'll use regular
expressions.' Now they have two problems."
It's not that regexes are the wrong tool for any job; I personally use
them all the time. But they're, tautologically, the wrong tool for any
job that can be done better with a different one. In Python, you've got
"in", .startswith, .endswith, and .split to handle simple parsing tasks.
On the other end, you've got lxml and the like to handle complex tasks
that provably cannot be done with regexes at all, let alone efficiently.
This leaves them in a fairly bounded middle ground, where is my task so
complex that it warrants something as difficult to read as a regex, but
still simple enough to be solved by one. And that ground is definitely
occupied. But not vast. So is it worth the time to try to write a more
efficient regex parser for Python? Yours if you want it to be, but not
mine.
--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order. See above to fix.
--
https://mail.python.org/mailman/listinfo/python-list