Paul Boddie wrote: > Ravi Teja wrote: > > > > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps > > web pages in general. > > import libxml2dom > import urllib > f = urllib.urlopen("http://wiki.python.org/moin/") > s = f.read() > f.close() > # s contains HTML not XML text > d = libxml2dom.parseString(s, html=1) > # get the community-related links > for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"): > print label.nodeValue
I wasn't aware that your module does html as well. > Of course, lxml should be able to do this kind of thing as well. I'd be > interested to know why this "is not a good idea", though. No reason that you don't know already. http://www.boddie.org.uk/python/HTML.html "If the document text is well-formed XML, we could omit the html parameter or set it to have a false value." XML parsers are not required to be forgiving to be regarded compliant. And much HTML out there is not well formed. -- http://mail.python.org/mailman/listinfo/python-list