Re: How to escape strings for re.finditer?

Roel Schroeven Tue, 28 Feb 2023 01:36:26 -0800

Op 28/02/2023 om 3:44 schreef Thomas Passin:

On 2/27/2023 9:16 PM, [email protected] wrote:
And, just for fun, since there is nothing wrong with your code, thisminor change is terser:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
for match in re.finditer(re.escape('abc_degree + 1') , example):
...     print(match.start(), match.end())
...
...
4 18
26 40
Just for more fun :) -
Without knowing how general your expressions will be, I think thefollowing version is very readable, certainly more readable than regexes:
example = 'X - abc_degree + 1 + qq + abc_degree + 1'
KEY = 'abc_degree + 1'

for i in range(len(example)):
    if example[i:].startswith(KEY):
        print(i, i + len(KEY))
# prints:
4 18
26 40

I think it's often a good idea to use a standard library functioninstead of rolling your own. The issue becomes less clear-cut when thestandard library doesn't do exactly what you need (as here, wherere.finditer() uses regular expressions while the use case only usessimple search strings). Ideally there would be a str.finditer() methodwe could use, but in the absence of that I think we still need toconsider using the almost-but-not-quite fitting re.finditer().


Two reasons:

(1) I think it's clearer: the name tells us what it does (though ofcourse we could solve this in a hand-written version by wrapping it in asuitably named function).

(2) Searching for a string in another string, in a performant way, isnot as simple as it first appears. Your version works correctly, butslowly. In some situations it doesn't matter, but in other cases itwill. For better performance, string searching algorithms jump aheadeither when they found a match or when they know for sure there isn't amatch for some time (see e.g. the Boyer–Moore string-search algorithm).You could write such a more efficient algorithm, but then it becomesmore complex and more error-prone. Using a well-tested existing functionbecomes quite attractive.

To illustrate the difference performance, I did a simple test (using theparagraph above is test text):


    import re
    import timeit

    def using_re_finditer(key, text):
        matches = []
        for match in re.finditer(re.escape(key), text):
            matches.append((match.start(), match.end()))
        return matches


    def using_simple_loop(key, text):
        matches = []
        for i in range(len(text)):
            if text[i:].startswith(key):
                matches.append((i, i + len(key)))
        return matches

CORPUS = """Searching for a string in another string, in aperformant way, is not as simple as it first appears. Your version works correctly,but slowly. In some situations it doesn't matter, but in other cases it will.For better performance, string searching algorithms jump ahead either whenthey found a match or when they know for sure there isn't a match for some time(see e.g.

    the Boyer–Moore string-search algorithm). You could write such a more

efficient algorithm, but then it becomes more complex and moreerror-prone.

    Using a well-tested existing function becomes quite attractive."""
    KEY = 'in'

print('using_simple_loop:',timeit.repeat(stmt='using_simple_loop(KEY, CORPUS)', globals=globals(),number=1000)) print('using_re_finditer:',timeit.repeat(stmt='using_re_finditer(KEY, CORPUS)', globals=globals(),number=1000))

This does 5 runs of 1000 repetitions each, and reports the time inseconds for each of those runs.

Result on my machine:

using_simple_loop: [0.13952950000020792, 0.13063130000000456,0.12803450000001249, 0.13186180000002423, 0.13084610000032626] using_re_finditer: [0.003861400000005233, 0.004061900000124297,0.003478999999970256, 0.003413100000216218, 0.0037320000001273]

We find that in this test re.finditer() is more than 30 times faster(despite the overhead of regular expressions.

While speed isn't everything in programming, with such a largedifference in performance and (to me) no real disadvantages of usingre.finditer(), I would prefer re.finditer() over writing my own.


--
"The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom."
        -- Isaac Asimov

--
https://mail.python.org/mailman/listinfo/python-list

Re: How to escape strings for re.finditer?

Reply via email to