[EMAIL PROTECTED] wrote: > I understand that the web is full of ill-formed XHTML web pages but > this is Microsoft: > > http://moneycentral.msn.com/companyreport?Symbol=BBBY
Yes, thank you Microsoft! > I can't validate it and xml.minidom.dom.parseString won't work on it. > > If this was just some teenager's web site I'd move on. Is there any > hope avoiding regular expression hacks to extract the data from this > page? The standards adherence from Microsoft services is clearly at "teenage level", but here's a recipe: import libxml2dom import urllib f = urllib.urlopen("http://moneycentral.msn.com/companyreport? Symbol=BBBY") d = libxml2dom.parse(f, html=1) f.close() You now have a document which contains a DOM providing libxml2's interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to work with the given document. Other tools may give acceptable results, however. Paul -- http://mail.python.org/mailman/listinfo/python-list