[EMAIL PROTECTED] a écrit : > On Sep 18, 1:56 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: >> [EMAIL PROTECTED] wrote: >>> I am attempting to extract some XML from an HTML document that I get >>> returned from a form based web page. For some reason, I cannot figure >>> out how to do this. >>> Here's a sample of the html: >>> <html> >>> <body> >>> lots of screwy text including divs and spans >>> <Row status="o"> >>> <RecordNum>1126264</RecordNum> >>> <Make>Mitsubishi</Make> >>> <Model>Mirage DE</Model> >>> </Row> >>> </body> >>> </html> >>> What's the best way to get at the XML? Do I need to somehow parse it >>> using the HTMLParser and then parse that with minidom or what? >> lxml makes this pretty easy: >> >> >>> parser = etree.HTMLParser() >> >>> tree = etree.parse(the_file_or_url, parser) >> >> This is actually a tree that can be treated as XML, e.g. with XPath, XSLT, >> tree iteration, ... You will also get plain XML when you serialise it to XML: >> >> >>> xml_string = etree.tostring(tree) >> >> Note that this doesn't add any namespaces, so you will not magically get >> valid >> XHTML or something. You could rewrite the tags by hand, though. >> >> Stefan > > I got it to work with lxml. See below: > > def Parser(filename): > parser = etree.HTMLParser() > tree = etree.parse(r'path/to/nextpage.htm', parser) > xml_string = etree.tostring(tree) > events = ("recordnum", "primaryowner", "customeraddress") > context = etree.iterparse(StringIO(xml_string), tag='') > for action, elem in context: > tag = elem.tag > if tag == 'primaryowner': > owner = elem.text > elif tag == 'customeraddress': > address = elem.text > else: > pass > > print 'Primary Owner: %s' % owner > print 'Address: %s' % address > > Does this make sense? It works pretty well, but I don't really > understand everything that I'm doing. > > Mike >
Q? Once you get your document into an XML tree in memory, while do you go to event-based handling to extract your data ? Try to directly manipulate the tree. parser = etree.HTMLParser() tree = etree.parse(r'path/to/nextpage.htm', parser) myrows = tree.findall(".//Row") # Then work with the sub-elements. for r in myrows : rnumelem = r.find("RecordNum") makeeleme = r.find("Make") modelelem = r.find("Model") & co. -- http://mail.python.org/mailman/listinfo/python-list