John Nagle wrote: > I have a small web crawler robust enough to parse > real-world HTML, which can be appallingly bad. I currently use > an extra-robust version of BeautifulSoup, and even that sometimes > blows up. So I'm very interested in a new Python parser which supposedly > handles bad HTML in the same way browsers do. But if it's slower > than BeautifulSoup, there's a problem.
Well, if performance matters in any way, you can always use lxml's blazingly fast parser first, possibly trying a couple of different configurations, and only if all fail, fall back to running html5lib over the same input. That should give you a tremendous speed-up over your current code in most cases, while keeping things robust in the hard cases. Note the numbers that Ian Bicking has for HTML parser performance: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ You should be able to run lxml's parser ten times in different configurations (e.g. different charset overrides) before it even reaches the time that BeautifulSoup would need to parse a document once. Given that undeclared character set detection is something where BS is a lot better than lxml, you can also mix the best of both worlds and use BS's character set detection to configure lxml's parser if you notice that the first parsing attempts fail. And yes, html5lib performs pretty badly in comparison (or did, at the time). But the numbers seem to indicate that if you can drop the ratio of documents that require a run of html5lib below 30% and use lxml's parser for the rest, you will still be faster than with BeautifulSoup alone. Stefan -- http://mail.python.org/mailman/listinfo/python-list