On Sat, 21 Jul 2007 22:18:56 -0400, Carsten Haese <[EMAIL PROTECTED]> wrote: >That's your problem right there. RE is not the right tool for that job. >Use an actual HTML parser such as BeautifulSoup
Thanks a lot for the tip. I tried it, and it does look interesting, although I've been unsuccessful using a regex with BS to find all occurences of the pattern. Incidently, as far as using Re alone is concerned, it appears that re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE must be added. Problem is, when I add re.DOTLINE, the search takes less than a second for a 500KB file... and about 1mn30 for a file that's 1MB, with both files holding similar contents. Why such a huge difference in performance? ========= Using Re ============= import re import time pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>" pages = ["500KB.html","1MB.html"] #Veeeeeeeeeeery slow when parsing 1MB file ! p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL) #p = re.compile(pattern,re.IGNORECASE|re.MULTILINE) for page in pages: f = open(page, "r") response = f.read() f.close() start = time.strftime("%H:%M:%S", time.localtime(time.time())) print "before findall @ " + start packed = p.findall(response) if packed: for item in packed: print item =========================== Thank you. -- http://mail.python.org/mailman/listinfo/python-list