Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

2008-02-08 Thread Andy Robinson
On 07/02/2008, Alexander Harrowell <[EMAIL PROTECTED]> wrote:
> To clarify, I use BeautifulSoup for a small project that parses frequently
> changing HTML on a number of websites (>1MB each), extracts the content of
> specific tags, filters out certain strings from the content, and serves it
> up in a consistent format. The input HTML comes from the wild, and often
> contains odd tags, funny characters, and other inconsistencies.
>
> It has so far worked near-perfectly for the last 9 months. Speed appears to
> be a conventional problem with BS, which is why I mentioned it, but when I
> analysed the code in an effort to speed it up I discovered that 90%+ of the
> time taken was accounted for by network latency in getting the data from the
> remote sites.
>


FWIW, we parse tens of thousands of pages every week to build let
people republish content into nice PDFs.  Beautiful Soup was the only
thing that made this sane, as many pages are not structured to be easy
to parse.  Like you we found the network was the limit, and simply
kicking off several scraping processes in parallel solved that (e.g.
one run of a script parses hotels from A-F, the next from G-M and so
on...). I can't imagine using anything else.

Best Regards,
-- 
Andy Robinson
CEO/Chief Architect
ReportLab Europe Ltd.
165 The Broadway, Wimbledon, London SW19 1NE, UK
Tel +44-20-8544-8049
___
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk


[python-uk] Conference

2008-02-08 Thread John Pinner
As well as PyCon UK (12th-14th September) we have another UK
conference which may be of interest, the UKUUG Spring Conference.

It's not a pure Python conference, but has a significant Python
content (possibly because I'm involved in the organisation ;)

The url is http://spring2008.ukuug.org

The official UKUUG announcement is below

Best wishes,

John
--
John Pinner



UKUUG is pleased to announce full details about the forthcoming Spring
Conference & Tutorials.

The event will take place on 31st March, 1st & 2nd April in Birmingham.

3 parallel tutorials will be held on Monday 31st March and a two day
conference (with parallel streams) will take place on Monday 1st & Tuesday
2nd April.

In addition the UKUUG is hosting the UK's first PostgreSQL User
Conference on Tuesday 2nd April at the same venue.

All the information, including abstracts, bios, and an online booking
form can be found at:

http://spring2008.ukuug.org/

Take a look now


___
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk


Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?

2008-02-08 Thread Jon Ribbens
On Fri, Feb 08, 2008 at 09:01:06AM +, Andy Robinson wrote:
> FWIW, we parse tens of thousands of pages every week to build let
> people republish content into nice PDFs.  Beautiful Soup was the only
> thing that made this sane, as many pages are not structured to be easy
> to parse.  Like you we found the network was the limit, and simply
> kicking off several scraping processes in parallel solved that (e.g.
> one run of a script parses hotels from A-F, the next from G-M and so
> on...). I can't imagine using anything else.

We do HTML parsing all day every day, so I wrote a Python-extension
module in C to do it. But we had very particular requirements,
specifically that we need to not only understand "real-life" HTML,
but also generate detailed, precise diagnostics whenever the HTML
is not correct according to the spec. The C module is only 900 lines
of code though.
___
python-uk mailing list
python-uk@python.org
http://mail.python.org/mailman/listinfo/python-uk