[EMAIL PROTECTED] wrote: > I am attempting to extract some XML from an HTML document that I get > returned from a form based web page. For some reason, I cannot figure > out how to do this. > Here's a sample of the html: > > <html> > <body> > lots of screwy text including divs and spans > <Row status="o"> > <RecordNum>1126264</RecordNum> > <Make>Mitsubishi</Make> > <Model>Mirage DE</Model> > </Row> > </body> > </html> > > What's the best way to get at the XML? Do I need to somehow parse it > using the HTMLParser and then parse that with minidom or what?
lxml makes this pretty easy: >>> parser = etree.HTMLParser() >>> tree = etree.parse(the_file_or_url, parser) This is actually a tree that can be treated as XML, e.g. with XPath, XSLT, tree iteration, ... You will also get plain XML when you serialise it to XML: >>> xml_string = etree.tostring(tree) Note that this doesn't add any namespaces, so you will not magically get valid XHTML or something. You could rewrite the tags by hand, though. Stefan -- http://mail.python.org/mailman/listinfo/python-list