hi paul... so you're the guy behind the libxml2dom ehh..!! glad to say hey!
so this really is an issue with libxml2dom. ok, good, at least i know where the issue is. and yeah, i know the real issue is the fact that the html isn't valid!! shouldn't have multiple "html" trees... from what i can tell, this isn't really solved via tidy/beautifulsoup either, as a multiple html tree structure probably won't be looked at as being invalid fom a token perspective. ok, i can somehow live with this, i can accommodate it. but tell me, when the parse module/class for libxml2dom does its thing, why does it not go forward on the tree when it comes to a </html>, if there's more text in the string to process??? oh, also, regarding screen parsing/crawling, i've seen a number of sites that have discussed using a web testing app, like selinium, and driving a browser process, in order to really capture all the required data. any thoughts on the pros/cons of this kind of approach to scraping data... thanks -bruce -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Paul Boddie Sent: Tuesday, August 26, 2008 8:48 AM To: python-list@python.org Subject: Re: libxml2dom - parsing maligned html On 26 Aug, 17:28, "bruce" <[EMAIL PROTECTED]> wrote: > so it's as if the parseString only reads the initial "html" tree. i've > reviewed as much as i can find regarding libxml2dom to try to figure out how > i can get it to read/parse/handle both html trees/nodes. Maybe there's some possibility to have libxml2 read directly from a file descriptor and to stop after parsing the first document, leaving the descriptor open; currently, this isn't supported by libxml2dom, however. Another possibility is to feed text to libxml2 until it can return a well-formed document, which I do as part of the libxml2dom.xmpp module, but I don't really support this feature in the public API. Again, improvements to libxml2dom may happen if I find the time to do them. Paul -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list