gervaz wrote: > Hi all, I need to find all the address in a html source page, I'm > using: > 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</ > b>)?</a>' > but the [^</a>]+ pattern retrieve all the strings not containing < > or / or a etc, although I just not want the word "</a>". How can I > specify: 'do not search the string "blabla"?'
Have considered BeautifulSoup? from BeautifulSoup import BeautifulSoup from urlparse import urlparse for a in BeautifulSoup(page)("a"): try: href = a["href"] except KeyError: pass else: url = urlparse(href) if url.hostname == "mysite.com": print href Peter -- http://mail.python.org/mailman/listinfo/python-list