Support Desk wrote:

the code I am using is
regex = r'<a href=["|\']([^"|\']+)["|\']>'

that's way too fragile to work with real-life HTML (what if the link has a TITLE attribute, for example? or contains whitespace after the HREF?)

you might want to consider using a real HTML parser for this task.

page_text = urllib.urlopen('http://somesite.com')
page_text = page_text.read()

links = re.findall(regex, text, re.IGNORECASE)

the RE looks fine for the subset of all valid A elements that it can handle, though.

got any examples of pages where you see that behaviour?

</F>

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to