Re: Overcoming python performance penalty for multicore CPU

J Kenneth King Mon, 08 Feb 2010 08:27:34 -0800

Paul Rubin <no.em...@nospam.invalid> writes:

> Stefan Behnel <stefan...@behnel.de> writes:
>> Well, if multi-core performance is so important here, then there's a pretty
>> simple thing the OP can do: switch to lxml.
>>
>> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it
> only works on well-formed XML.  The point of Beautiful Soup is that it
> works on all kinds of garbage hand-written legacy HTML with mismatched
> tags and other sorts of errors.  Beautiful Soup is slower because it's
> full of special cases and hacks for that reason, and it is written in
> Python.  Writing something that complex in C to handle so much
> potentially malicious input would be quite a lot of work to write at
> all, and very difficult to ensure was really safe.  Look at the many
> browser vulnerabilities we've seen over the years due to that sort of
> problem, for example.  But, for web crawling, you really do need to
> handle the messy and wrong HTML properly.


If the difference is great enough, you might get a benefit from
analyzing all pages with lxml and throwing invalid pages into a bucket
for later processing with BeautifulSoup.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Overcoming python performance penalty for multicore CPU

Reply via email to