Re: web page text extractor

kublai Thu, 12 Jul 2007 12:48:44 -0700

On Jul 13, 2:19 am, Stefan Behnel <[EMAIL PROTECTED]> wrote:
> kublai wrote:
> > For a project, I need to develop a corpus of online news stories.  I'm
> > looking for an application that, given the url of a web page, "copies"
> > the rendered text of the web page (not the source HTNL text), opens a
> > text editor (Notepad), and displays the copied text for the user to
> > examine and save into a text file. Graphics and sidebars to be
> > ignored. The examples I have come across are much too complex for me
> > to customize for this simple job. Can anyone lead me to the right
> > direction?
>
> Super-simplistic:
>
>   >>> import lxml.etree as et
>   >>> parser = et.HTMLParser()
>   >>> tree = et.parse("http://the/page.html";, parser)
>   >>> print tree.xpath("string(/html/body)")
>
> http://codespeak.net/lxml/
>
> You may want to use the incredibly versatile "lxml.html.clean" module first to
> remove any annoying content. It's not released yet but available in a branch:
>
> http://codespeak.net/svn/lxml/branch/html/
>
> Stefan


Hi, Stefan,
This looks very interesting. I will look into this first thing
tonight. Gotta hit some golf bugs, I mean, balls first. It's a
beautiful afternoon here in Edmonton.
Cheers,
gk

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: web page text extractor

Reply via email to