To complement what Peter wrote: I'd approach this problem using XPath. XPath is a query language for XML/HTML documents; it's a great tool to have in your web scraping toolbox (among other tasks). With Python's excellent lxml library you can do some XPath processing. Here's how I might tackle this problem:
== [ scrape.py ] ====================================================== from lxml import etree # ...somehow get HTML/XML into the variable xml root = etree.HTML(xml) hrefs = root.xpath("//a[@href and starts-with(@href, 'http://')]/@href") # magic =========> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ print(hrefs) # if you want to see what this looks like == [ end scrape.py ] ================================================== The argument to the xpath method here is an XPath expression. The overall form is: //a[.....]/@href The '//a' at the beginning means: starting at the root node of the document, find all a (anchor) elements that match the condition specified by ".....". The '/@href' at the end means: give me the href attribute of the nodes (if any) that remain. Looking inside the square brackets (what's known as the predicate in the XPath world), we find @href and starts-with(@href, 'http://') The 'and' bit should be clear (there are two conditions that need to be checked). The first part says: the a element should have an href attribute. The second part says that the value of the href element had better start with 'http://'. In fact, we could simplify the predicate to starts-with(@href, 'http://') If an element does not even have an href attribute, its value does not start with 'http://'. It's not an error, and no exception will be thrown, when the XPath evaluator applies the starts-with function to an a element that does not have an href attribute. Hope this helps. Best regards, Jesse -- Jesse Alama http://xml.sh -- https://mail.python.org/mailman/listinfo/python-list