Den Fri, 02 Mar 2007 15:32:58 -0800 skrev [EMAIL PROTECTED]: > I'm trying to extract some data from an XHTML Transitional web page. > xml.dom.minidom.parseString("text of web page") gives errors about it > not being well formed XML. > Do I just need to add something like <?xml ...?> or what?
As many HTML Transitional pages are very bad formed, you can't really create a dom of them. I've written multiple grabbers, which grab tv data from html pages, and parses it into xml. Basicly there are three ways to get the info: # Use find(): If you are only searching for a few data pieces, you might be able to find some html code always appearing before the data you need. # Use regular expressions: This can very quickly get all data from a table or so into a nice list. Only problem is regular expressions having a little steep learing curve. # Use a SAX parser: This will iterate through all html items, not carring if they validate or not. You will define a method to be called each time it finds a tag, a piece of text etc. > What is best way to do this? In the beginning I mostly did the SAX way, but it really generates a lot of code, which is not necessaryly more readable than the regular expressions. -- http://mail.python.org/mailman/listinfo/python-list