In article <[EMAIL PROTECTED]>, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote:
> I understand that the web is full of ill-formed XHTML web pages but > this is Microsoft: > > http://moneycentral.msn.com/companyreport?Symbol=BBBY > > I can't validate it and xml.minidom.dom.parseString won't work on it. > > If this was just some teenager's web site I'd move on. Is there any > hope avoiding regular expression hacks to extract the data from this > page? Valid XHTML is scarcer than hen's teeth. Luckily, someone else has already written the ugly regex parsing hacks for you. Try Connelly Barnes' HTMLData: http://oregonstate.edu/~barnesc/htmldata/ Or BeautifulSoup as others have suggested. -- Philip http://NikitaTheSpider.com/ Whole-site HTML validation, link checking and more -- http://mail.python.org/mailman/listinfo/python-list