Gilles Ganault wrote: > Problem is, when I add re.DOTLINE, the search takes less than a second > for a 500KB file... and about 1mn30 for a file that's 1MB, with both > files holding similar contents. > > Why such a huge difference in performance? > > ========= Using Re ============= > import re > import time > > pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>" > > pages = ["500KB.html","1MB.html"] > > #Veeeeeeeeeeery slow when parsing 1MB file ! > p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL) > #p = re.compile(pattern,re.IGNORECASE|re.MULTILINE) > > for page in pages: > f = open(page, "r") > response = f.read() > f.close() > > start = time.strftime("%H:%M:%S", time.localtime(time.time())) > print "before findall @ " + start > packed = p.findall(response) > if packed: > for item in packed: > print item > =========================== >
I don't know if it'll result in a performance difference, but since you're just saving the result of re.findall() to a variable in order to iterate over it, you might as well just use re.finditer() instead: for item in p.finditer(response): print item At least then it can start printing as soon as it hits a match instead of needing to find all the matches first. -Jay -- http://mail.python.org/mailman/listinfo/python-list