Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?
On 07/02/2008, Alexander Harrowell <[EMAIL PROTECTED]> wrote: > To clarify, I use BeautifulSoup for a small project that parses frequently > changing HTML on a number of websites (>1MB each), extracts the content of > specific tags, filters out certain strings from the content, and serves it > up in a consistent format. The input HTML comes from the wild, and often > contains odd tags, funny characters, and other inconsistencies. > > It has so far worked near-perfectly for the last 9 months. Speed appears to > be a conventional problem with BS, which is why I mentioned it, but when I > analysed the code in an effort to speed it up I discovered that 90%+ of the > time taken was accounted for by network latency in getting the data from the > remote sites. > FWIW, we parse tens of thousands of pages every week to build let people republish content into nice PDFs. Beautiful Soup was the only thing that made this sane, as many pages are not structured to be easy to parse. Like you we found the network was the limit, and simply kicking off several scraping processes in parallel solved that (e.g. one run of a script parses hotels from A-F, the next from G-M and so on...). I can't imagine using anything else. Best Regards, -- Andy Robinson CEO/Chief Architect ReportLab Europe Ltd. 165 The Broadway, Wimbledon, London SW19 1NE, UK Tel +44-20-8544-8049 ___ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk
[python-uk] Conference
As well as PyCon UK (12th-14th September) we have another UK conference which may be of interest, the UKUUG Spring Conference. It's not a pure Python conference, but has a significant Python content (possibly because I'm involved in the organisation ;) The url is http://spring2008.ukuug.org The official UKUUG announcement is below Best wishes, John -- John Pinner UKUUG is pleased to announce full details about the forthcoming Spring Conference & Tutorials. The event will take place on 31st March, 1st & 2nd April in Birmingham. 3 parallel tutorials will be held on Monday 31st March and a two day conference (with parallel streams) will take place on Monday 1st & Tuesday 2nd April. In addition the UKUUG is hosting the UK's first PostgreSQL User Conference on Tuesday 2nd April at the same venue. All the information, including abstracts, bios, and an online booking form can be found at: http://spring2008.ukuug.org/ Take a look now ___ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk
Re: [python-uk] Favourite ways of scrubbing HTML/whitelisting specific HTML tags?
On Fri, Feb 08, 2008 at 09:01:06AM +, Andy Robinson wrote: > FWIW, we parse tens of thousands of pages every week to build let > people republish content into nice PDFs. Beautiful Soup was the only > thing that made this sane, as many pages are not structured to be easy > to parse. Like you we found the network was the limit, and simply > kicking off several scraping processes in parallel solved that (e.g. > one run of a script parses hotels from A-F, the next from G-M and so > on...). I can't imagine using anything else. We do HTML parsing all day every day, so I wrote a Python-extension module in C to do it. But we had very particular requirements, specifically that we need to not only understand "real-life" HTML, but also generate detailed, precise diagnostics whenever the HTML is not correct according to the spec. The C module is only 900 lines of code though. ___ python-uk mailing list python-uk@python.org http://mail.python.org/mailman/listinfo/python-uk