On 11Mar2012 13:30, John Nagle <na...@animats.com> wrote: | "html5lib" is apparently not thread safe. | (see "http://code.google.com/p/html5lib/issues/detail?id=189") | Looking at the code, I've only found about three problems. | They're all the usual "cached in a global without locking" bug. | A few locks would fix that. | | But html5lib calls the XML SAX parser. Is that thread-safe? | Or is there more trouble down at the bottom? | | (I run a multi-threaded web crawler, and currently use BeautifulSoup, | which is thread safe, although dated. I'm looking at converting to | html5lib.)
IIRC, BeautifulSoup4 may do that for you: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser "Beautiful Soup 4 uses html.parser by default, but you can plug in lxml or html5lib and use that instead." Just for interest, re locking, I wrote a little decorator the other day, thus: @locked_property def foo(self): compute foo here ... return foo value and am rolling its use out amongst my classes. Code: def locked_property(func, lock_name='_lock', prop_name=None, unset_object=None): ''' A property whose access is controlled by a lock if unset. ''' if prop_name is None: prop_name = '_' + func.func_name def getprop(self): ''' Attempt lockless fetch of property first. Use lock if property is unset. ''' p = getattr(self, prop_name) if p is unset_object: with getattr(self, lock_name): p = getattr(self, prop_name) if p is unset_object: p = func(self) setattr(self, prop_name, p) return p return property(getprop) It tries to be lockless in the common case. I suspect it is only safe in CPython where there is a GIL. If raw python assignments and fetches can overlap (eg Jypthon I think?) I probably need shared "read" lock around the first "p = getattr(self, prop_name). Any remarks? Cheers, -- Cameron Simpson <c...@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Ed Campbell's <e...@tekelex.com> pointers for long trips: 1. lay out the bare minimum of stuff that you need to take with you, then put at least half of it back. -- http://mail.python.org/mailman/listinfo/python-list