On Fri, Feb 08, 2008 at 09:01:06AM +0000, Andy Robinson wrote: > FWIW, we parse tens of thousands of pages every week to build let > people republish content into nice PDFs. Beautiful Soup was the only > thing that made this sane, as many pages are not structured to be easy > to parse. Like you we found the network was the limit, and simply > kicking off several scraping processes in parallel solved that (e.g. > one run of a script parses hotels from A-F, the next from G-M and so > on...). I can't imagine using anything else.
We do HTML parsing all day every day, so I wrote a Python-extension module in C to do it. But we had very particular requirements, specifically that we need to not only understand "real-life" HTML, but also generate detailed, precise diagnostics whenever the HTML is not correct according to the spec. The C module is only 900 lines of code though. _______________________________________________ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk