En Mon, 17 Sep 2007 17:31:19 -0300, <[EMAIL PROTECTED]> escribi�: > I am attempting to extract some XML from an HTML document that I get > returned from a form based web page. For some reason, I cannot figure > out how to do this. I thought I could use the minidom module to do it, > but all I get is a screwy traceback: > > Traceback (most recent call last): > File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in > parseFile > parser.Parse(buffer, 0) > ExpatError: mismatched tag: line 1, column 357
So your HTML is not a well formed XML document, as many html pages, and you can't use an XML parser. (even a valid HTML document may not be valid XML). Let's try with some mismatched tags: py> text = '''<html> ... <body> ... <p>lots of <div>screwy text including divs and <span>spans</p> ... <Row status="o"> ... <RecordNum>1126264</RecordNum> ... <Make>Mitsubishi</Make> ... <Model>Mirage DE</Model> ... </Row> ... </body> ... </html>''' py> py> import xml.dom.minidom py> doc = xml.dom.minidom.parseString(text) Traceback (most recent call last): ... xml.parsers.expat.ExpatError: mismatched tag: line 3, column 60 You will need a more robust parser, like BeautifulSoup <http://www.crummy.com/software/BeautifulSoup/> py> from BeautifulSoup import BeautifulSoup py> soup = BeautifulSoup(text) py> for row in soup.findAll("row"): ... print row.recordnum, row.make.contents, row.model.string ... <recordnum>1126264</recordnum> [u'Mitsubishi'] Mirage DE Depending on your document, you may prefer to extract the XML blocks using BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML parser) or xml.etree.ElementTree -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list