Paul Rubin <no.em...@nospam.invalid> writes: > Stefan Behnel <stefan...@behnel.de> writes: >> Well, if multi-core performance is so important here, then there's a pretty >> simple thing the OP can do: switch to lxml. >> >> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ > > Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it > only works on well-formed XML. The point of Beautiful Soup is that it > works on all kinds of garbage hand-written legacy HTML with mismatched > tags and other sorts of errors. Beautiful Soup is slower because it's > full of special cases and hacks for that reason, and it is written in > Python. Writing something that complex in C to handle so much > potentially malicious input would be quite a lot of work to write at > all, and very difficult to ensure was really safe. Look at the many > browser vulnerabilities we've seen over the years due to that sort of > problem, for example. But, for web crawling, you really do need to > handle the messy and wrong HTML properly.
If the difference is great enough, you might get a benefit from analyzing all pages with lxml and throwing invalid pages into a bucket for later processing with BeautifulSoup. -- http://mail.python.org/mailman/listinfo/python-list