On 18Jul2018 17:40, Larry Martell <larry.mart...@gmail.com> wrote:
On Tue, Jul 17, 2018 at 11:43 AM, Neil Cerutti <ne...@norwich.edu> wrote:
On 2018-07-16, Larry Martell <larry.mart...@gmail.com> wrote:
I had some code that did this:

meas_regex = '_M\d+_'
meas_re = re.compile(meas_regex)

if meas_re.search(filename):
    stuff1()
else:
    stuff2()

I then had to change it to this:

if meas_re.search(filename):
    if 'MeasDisplay' in filename:
        stuff1a()
    else:
        stuff1()
else:
    if 'PatternFov' in filename:
        stuff2a()
   else:
        stuff2()

This code needs to process many tens of 1000's of files, and it
runs often, so it needs to run very fast. Needless to say, my
change has made it take 2x as long. Can anyone see a way to
improve that?

As others have mentioned, your stuff*() function must be doing very little work, because I'd expect the regexp stuff to be fairly quick.

Yeah, that was my first thought, but I haven't been able to come up
with a regex that works.

There are 4 cases I need to detect:

case1 = 'spam_M123_eggs_MeasDisplay_sausage'
case2 = 'spam_M123_eggs_sausage_and_spam'
case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
case4 = 'spam_spam_spam_eggs_sausage_and_spam'

I thought this regex would work:

'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'

Did you try making that a raw string:

 r'(......}'

to avoid mangling the backslashes (which Python will interpret before they get to the regexp parser)?

Print meas_regex to check it got past Python intact. Just print(meas_regex).

Also, "{0,1}" is usually written "?".

And then I could look at the match objects and see which of the 4
cases it was. But try as I might, I could not get it to work. Any
regex gurus want to tell me what I am doing wrong here?

Backslashes aside, it looks ok to me. So I'd better run it... Code:

   from __future__ import print_function
   import re

   case1 = 'spam_M123_eggs_MeasDisplay_sausage'
   case2 = 'spam_M123_eggs_sausage_and_spam'
   case3 = 'spam_spam_spam_PatternFov_eggs_sausage_and_spam'
   case4 = 'spam_spam_spam_eggs_sausage_and_spam'

   meas_regex = r'(_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}'
   print("meas_regex =", meas_regex)

   meas_re = re.compile(meas_regex)

   for case in case1, case2, case3, case4:
     print(case, end=" ")
     m = meas_re.search(case)
     if m:
       print("MATCH: group1 =", m.group(1), "group2 =", m.group(2))
     else:
       print("NO MATCH")

Output:

   meas_regex = (_M\d+_){0,1}.*?(MeasDisplay|PatternFOV){0,1}
   spam_M123_eggs_MeasDisplay_sausage MATCH: group1 = None group2 = None
   spam_M123_eggs_sausage_and_spam MATCH: group1 = None group2 = None
   spam_spam_spam_PatternFov_eggs_sausage_and_spam MATCH: group1 = None group2 
= None
   spam_spam_spam_eggs_sausage_and_spam MATCH: group1 = None group2 = None

Ah, and there's the problem. Though I'm surprised to get the Nones in the .group()s instead of the empty string; possibly that reflects "0 occurences". [...] A little testing with other tweaks to the regexp supports that. No matter. To your problem:

When you write "(_M\d+_){0,1}" or anything that is optional like that, it can match the empty string (the "0"). And that _always_ matches.

Likewise the second part of the pattern.

Because you want to know about _both_ the "M\d+_" _and_ the "MeasDisplay|PatternFOV" you can't put them both in the same pattern: if you make them optional, the pattern always matches the empty string even if the target is later on; if you make them mandatory (no "{0,1}") your pattern will only work when both are present.

Similar pitfalls apply for any combination, making one optional and the other mandatory: you can't do all 4 possibilities (niether, just the first, just the second, both) with one regex (== one match/search test).

So your code was already optimal.

I am surprised that your program took twice a long to run with your doubled test though. These are filenames, yes? So shouldn't the stuff*() functions be openin the file or something: I would expect that to dominate the runtime and your extra name testing to not be the slowdown.

What's going on inside the stuff*() functions? Might they also have become more complex with your new cases?

Cheers,
Cameron Simpson <c...@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to