Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Jon Ribbens Fri, 08 Feb 2008 04:16:12 -0800

On Fri, Feb 08, 2008 at 09:01:06AM +0000, Andy Robinson wrote:
> FWIW, we parse tens of thousands of pages every week to build let
> people republish content into nice PDFs.  Beautiful Soup was the only
> thing that made this sane, as many pages are not structured to be easy
> to parse.  Like you we found the network was the limit, and simply
> kicking off several scraping processes in parallel solved that (e.g.
> one run of a script parses hotels from A-F, the next from G-M and so
> on...). I can't imagine using anything else.


We do HTML parsing all day every day, so I wrote a Python-extension
module in C to do it. But we had very particular requirements,
specifically that we need to not only understand "real-life" HTML,
but also generate detailed, precise diagnostics whenever the HTML
is not correct according to the spec. The C module is only 900 lines
of code though.
_______________________________________________
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk

Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Reply via email to