Gilles Ganault wrote:
> Problem is, when I add re.DOTLINE, the search takes less than a second
> for a 500KB file... and about 1mn30 for a file that's 1MB, with both
> files holding similar contents.
> 
> Why such a huge difference in performance?
> 
> ========= Using Re =============
> import re
> import time
> 
> pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"
> 
> pages = ["500KB.html","1MB.html"]
> 
> #Veeeeeeeeeeery slow when parsing 1MB file !
> p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
> #p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)
> 
> for page in pages:
>       f = open(page, "r") 
>       response = f.read() 
>       f.close()
> 
>       start = time.strftime("%H:%M:%S", time.localtime(time.time()))
>       print "before findall @ " + start
>       packed = p.findall(response)
>       if packed:
>               for item in packed:
>                       print item
> ===========================
> 

I don't know if it'll result in a performance difference, but since you're just 
saving the result of re.findall() to a variable in order to iterate over it, 
you might as well just use re.finditer() instead:

        for item in p.finditer(response):
                print item

At least then it can start printing as soon as it hits a match instead of 
needing to find all the matches first.

-Jay
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to