[Python-ideas] Re: Addition to fnmatch.py

Jean Abou Samra Sun, 05 Jun 2022 08:26:14 -0700


Le 05/06/2022 à 16:52, John Carter a écrit :

I’d like to propose a simple addition to the 'fnmatch' module. Specificity a 
function that returns a list of names that match a pattern AND a list of those 
that don't.

In a recent project I found that I wished to split  a list of files and move 
those that matched a pattern to one folder and those that didn't to another 
folder. I was using fnmatch.filter to select the first set of files and then a 
list comprehension to generate the second set.

For a small number of files (~ 10) this was perfectly adequate. However as I needed 
to process many files (>>10000) the execution time was very significant. 
Profiling the code showed that the time was spent in generating the second set. I 
tried a number of solutions including designing a negative filter, walking the file 
system to find those files that had not been moved and using more and more convoluted 
ways to improve the second selection. Eventually I gave in and hacked a local copy of 
fnmatch.py as below:

def split(names, pat):
     """Return the subset of the list NAMES that match PAT."""
     """Also returns those names not in NAMES"""
     result = []
     notresult = []
     pat = os.path.normcase(pat)
     pattern_match = _compile_pattern(pat)
     if os.path is posixpath:
         # normcase on posix is NOP. Optimize it away from the loop.
         for name in names:
             if not pattern_match(name):
                 result.append(name)
             else:
                 notresult.append(name)
     else:
         for name in names:
             if not pattern_match(os.path.normcase(name)):
                 result.append(name)
             else:
                 notresult.append(name)
     return result, notresult

The change is the addition of else clauses to the if not pattermath statements. 
This solved the problem and benchmarking showed that it only took a very small 
additional time (20ms for a million strings) to generate both lists

Number of tests cases 1000000
Example data ['Ba1txmKkiC', 'KlJx.f_AGj', 'Umwbw._Wa9', '4YlgA5LVpI’]
Search for '*A*'
Test                            Time(sec)       Positive        Negative
WCmatch.filter          1.953125           26211          0
filter                         0.328125    14259          0
split                  0.343750    14259      85741
List Comp.           270.468751     14259      85741

The list comprehension [x for x in a if x not in b]*, was nearly 900 times 
slower.

‘fnmatch’ was an appropriate solution to this problem as typing ‘glob’ style 
search patterns was easier than having to enter regular expressions when 
prompted by my code.

I would like to propose that split, even though it is very simple, be included 
in the 'fnmatch' module.

John

*a is the original and b is those that match.




Searching an element in a list scans the elements
one by one, so [x for x in a if x not in b]
has quadratic complexity. Try list(set(a) - set(b))
instead.

>>> import timeit

>>> timeit.timeit('[x for x in a if x not in b]', setup='a =list(range(10_000)); b = list(range(1000))', number=100)

5.258457429998089
>>>

>>> timeit.timeit('list(set(a)-set(b))', setup='a =list(range(10_000)); b = list(range(1000))', number=100)

0.07163406799372751


Jean


_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/VPWSYBDEW7HZBLU3JBDPFPO6KI3VM4LC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Addition to fnmatch.py

Reply via email to