[EMAIL PROTECTED] wrote: > I'm trying to extract some data from an XHTML Transitional web page. > > What is best way to do this?
May I suggest html5lib [1]? It's based on the parsing section of the WHATWG "HTML5" spec [2] which is in turn based on the behavior of major web browsers so it should parse more or less* any invalid markup you throw at it. Despite the name "html5lib" it works with any (X)HTML document. By default, you have the option of producing a minidom tree, an ElementTree, or a "simpletree" - a lightweight DOM-like html5lib-specific tree. If you are happy to pull from SVN I recommend that version; it has a few bug fixes over the 0.2 release as well as improved features including better error reporting and detection of encoding from <meta> elements (the next release is imminent). [1] http://code.google.com/p/html5lib/ [2] http://whatwg.org/specs/web-apps/current-work/#parsing * There might be a problem if e.g. the document uses a character encoding that python does not support, otherwise it should parse anything. -- http://mail.python.org/mailman/listinfo/python-list