Stefan Behnel wrote:
John Nagle wrote:
I have a small web crawler robust enough to parse
real-world HTML, which can be appallingly bad. I currently use
an extra-robust version of BeautifulSoup, and even that sometimes
blows up. So I'm very interested in a new Python parser which supposedly
handles bad HTML in the same way browsers do. But if it's slower
than BeautifulSoup, there's a problem.
Well, if performance matters in any way, you can always use lxml's
blazingly fast parser first, possibly trying a couple of different
configurations, and only if all fail, fall back to running html5lib over
the same input.
Detecting "fail" is difficult. A common problem is badly terminated
comments which eat most of the document if you follow the spec. The
document seems to parse correctly, but most of it is missing. The
HTML 5 spec actually covers things like
<!This is a bogus SGML directive>
and treats it as a bogus comment. (That's because HTML 5 doesn't
include general SGML; the only directive recognized is DOCTYPE.
Anything else after "<!" is treated as a token-level error.)
So using an agreed-upon parsing method, in the form of html5lib,
is desirable, in that it should mimic browser behavior.
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list