Support Desk wrote:
the code I am using is
regex = r'<a href=["|\']([^"|\']+)["|\']>'
that's way too fragile to work with real-life HTML (what if the link has
a TITLE attribute, for example? or contains whitespace after the HREF?)
you might want to consider using a real HTML parser for this task.
page_text = urllib.urlopen('http://somesite.com')
page_text = page_text.read()
links = re.findall(regex, text, re.IGNORECASE)
the RE looks fine for the subset of all valid A elements that it can
handle, though.
got any examples of pages where you see that behaviour?
</F>
--
http://mail.python.org/mailman/listinfo/python-list