Re: [2.5] Regex doesn't support MULTILINE?

Gilles Ganault Sat, 21 Jul 2007 22:02:30 -0700

On Sat, 21 Jul 2007 22:18:56 -0400, Carsten Haese
<[EMAIL PROTECTED]> wrote:
>That's your problem right there. RE is not the right tool for that job.
>Use an actual HTML parser such as BeautifulSoup


Thanks a lot for the tip. I tried it, and it does look interesting,
although I've been unsuccessful using a regex with BS to find all
occurences of the pattern.

Incidently, as far as using Re alone is concerned, it appears that
re.MULTILINE isn't enough to get Re to include newlines: re.DOTLINE
must be added.

Problem is, when I add re.DOTLINE, the search takes less than a second
for a 500KB file... and about 1mn30 for a file that's 1MB, with both
files holding similar contents.

Why such a huge difference in performance?

========= Using Re =============
import re
import time

pattern = "<span class=.?defaut.?>(\d+:\d+).*?</span>"

pages = ["500KB.html","1MB.html"]

#Veeeeeeeeeeery slow when parsing 1MB file !
p = re.compile(pattern,re.IGNORECASE|re.MULTILINE|re.DOTALL)
#p = re.compile(pattern,re.IGNORECASE|re.MULTILINE)

for page in pages:
        f = open(page, "r") 
        response = f.read() 
        f.close()

        start = time.strftime("%H:%M:%S", time.localtime(time.time()))
        print "before findall @ " + start
        packed = p.findall(response)
        if packed:
                for item in packed:
                        print item
===========================

Thank you.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: [2.5] Regex doesn't support MULTILINE?

Reply via email to