Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Alexander Harrowell Thu, 07 Feb 2008 15:46:37 -0800

On Thu, Feb 7, 2008 at 7:11 PM, Shaun Laughey <[EMAIL PROTECTED]> wrote:


>
> Hi,
> I have used Beautiful Soup for parsing html.
> It works very nicely and I didn't see much of an issue with speed in
> parsing several hundred html files every hour or so.
> I also rolled my own using various regex's and stuff nicked from a
> perl lib. It was awful and feature incomplete. Beautiful Soup worked
> better.
>
> Shaun Laughey.
>

To clarify, I use BeautifulSoup for a small project that parses frequently
changing HTML on a number of websites (>1MB each), extracts the content of
specific tags, filters out certain strings from the content, and serves it
up in a consistent format. The input HTML comes from the wild, and often
contains odd tags, funny characters, and other inconsistencies.

It has so far worked near-perfectly for the last 9 months. Speed appears to
be a conventional problem with BS, which is why I mentioned it, but when I
analysed the code in an effort to speed it up I discovered that 90%+ of the
time taken was accounted for by network latency in getting the data from the
remote sites.

Alex

_______________________________________________
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk

Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

Reply via email to