Hi All, I'm writting a spider program. I need to go to serveral URLs and extract information from the HTML source. Including links. I was using FancyURLOpener and my own function that extracts the links from a HTML page. The problem is that I always need to change it. This is because some sites use lower case tag names, others upper case tag names. Some of them use href="page.html" others do it without the quotation href=page.html but I could even find unclosed quotations <a href="page.html> double opened and unclosed <a tags etc. There are many kinds of malformed HTML pages out there and it seems I'm not capable of handling all of them. The question: is there a good library for Python for extraction links and images out of (possibly malformed) HTML soucre code? Like the "references" function in Lynx. I need to handle relative and absolute references and I need to know the anchor text too and the position of the anchor inside the HTML source file.
For example this malformed link: <a href="page.html>Sample link</a> could be converted to: ['page.html','http://samplesite.current_location/page.html','Samle link'] Thanks in advance Les -- http://mail.python.org/mailman/listinfo/python-list