On Sep 18, 1:56 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > I am attempting to extract some XML from an HTML document that I get > > returned from a form based web page. For some reason, I cannot figure > > out how to do this. > > Here's a sample of the html: > > > <html> > > <body> > > lots of screwy text including divs and spans > > <Row status="o"> > > <RecordNum>1126264</RecordNum> > > <Make>Mitsubishi</Make> > > <Model>Mirage DE</Model> > > </Row> > > </body> > > </html> > > > What's the best way to get at the XML? Do I need to somehow parse it > > using the HTMLParser and then parse that with minidom or what? > > lxml makes this pretty easy: > > >>> parser = etree.HTMLParser() > >>> tree = etree.parse(the_file_or_url, parser) > > This is actually a tree that can be treated as XML, e.g. with XPath, XSLT, > tree iteration, ... You will also get plain XML when you serialise it to XML: > > >>> xml_string = etree.tostring(tree) > > Note that this doesn't add any namespaces, so you will not magically get valid > XHTML or something. You could rewrite the tags by hand, though. > > Stefan
I got it to work with lxml. See below: def Parser(filename): parser = etree.HTMLParser() tree = etree.parse(r'path/to/nextpage.htm', parser) xml_string = etree.tostring(tree) events = ("recordnum", "primaryowner", "customeraddress") context = etree.iterparse(StringIO(xml_string), tag='') for action, elem in context: tag = elem.tag if tag == 'primaryowner': owner = elem.text elif tag == 'customeraddress': address = elem.text else: pass print 'Primary Owner: %s' % owner print 'Address: %s' % address Does this make sense? It works pretty well, but I don't really understand everything that I'm doing. Mike -- http://mail.python.org/mailman/listinfo/python-list