En Sun, 30 Mar 2008 00:19:08 -0300, Michael Wieher <[EMAIL PROTECTED]> escribió:
> Was this not of any use? > > http://www.boddie.org.uk/python/HTML.html > > I think, since HTML is a sub-set of XML, any XML parser could be adapted > to > do this... That's not true. A perfectly valid HTML document might even not be well formed XML; some closing tags are not mandatory, attributes may not be quoted, tags may be written in uppercase, etc. Example: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd"> <HTML><TITLE>Invalid xml</title><p Id=Abc>a</html> The above document validates with no errors on http://validator.w3.org If you are talking about XHTML documents, yes, they *should* be valid XML documents. > I doubt there's an HTML-specific version, but I would imagine you > could wrap any XML parser, or really, create your own that derives from > the > XML-parser-class... The problem is that many HTML and XHTML pages that you find on the web aren't valid, some are ridiculously invalid. Browsers have a "quirks" mode, and can imagine/guess more or less the writer's intent only because HTML tags have some meaning. A generic XML parser, on the other hand, usually just refuses to continue parsing an ill-formed document. You can't simply "adapt any XML parser to to that". BeautifulSoup, by example, does a very good job trying to interpret and extract some data from the "tag soup", and may be useful to the OP. http://www.crummy.com/software/BeautifulSoup/ -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list