Laszlo Zsolt Nagy wrote: > [...] > For example this malformed link: > > <a href="page.html>Sample link</a> > > could be converted to: > > ['page.html','http://samplesite.current_location/page.html','Samle link']
Your options AFAIK are: * Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) * Various implementations of tidy (uTidyLib, mxTidy) * XIST (http://www.livinglogic.de/Python/xist) For XIST code that extracts the above info from a HTML page looks like this: -------- import sys from ll import url from ll.xist import parsers from ll.xist.ns import html def links(u): node = parsers.parseURL(u, tidy=True, base=None) for x in node//html.a: yield str(x["href"]), str(u/str(x["href"])), unicode(x) for data in links(url.URL(sys.argv[1])): print data -------- This outputs something like: ('http://www.python.org/', 'http://www.python.org/', u'\r\n ') ('http://www.python.org/search/', 'http://www.python.org/search/', u'Search') ('http://www.python.org/download/', 'http://www.python.org/download/', u'Download') ('http://www.python.org/doc/', 'http://www.python.org/doc/', u'Documentation') ... Hope that helps, Walter Dörwald -- http://mail.python.org/mailman/listinfo/python-list