On 3/11/2012 2:45 PM, Cameron Simpson wrote:
On 11Mar2012 13:30, John Nagle<na...@animats.com>  wrote:
|     "html5lib" is apparently not thread safe.
| (see "http://code.google.com/p/html5lib/issues/detail?id=189";)
| Looking at the code, I've only found about three problems.
| They're all the usual "cached in a global without locking" bug.
| A few locks would fix that.
|
|     But html5lib calls the XML SAX parser. Is that thread-safe?
| Or is there more trouble down at the bottom?
|
| (I run a multi-threaded web crawler, and currently use BeautifulSoup,
| which is thread safe, although dated.  I'm looking at converting to
| html5lib.)

IIRC, BeautifulSoup4 may do that for you:

   http://www.crummy.com/software/BeautifulSoup/bs4/doc/

   http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser
     "Beautiful Soup 4 uses html.parser by default, but you can plug in
     lxml or html5lib and use that instead."

   I want to use HTML5 standard parsing of bad HTML.  (HTML5 formally
defines how to parse bad comments, for example.)  I currently have
a modified version of BeautifulSoup that's more robust than the
standard one, but it doesn't handle errors the same way browsers do.

                                John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to