Re: web page text extractor

Stefan Behnel Thu, 12 Jul 2007 11:20:59 -0700

kublai wrote:
> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?


Super-simplistic:

  >>> import lxml.etree as et
  >>> parser = et.HTMLParser()
  >>> tree = et.parse("http://the/page.html";, parser)
  >>> print tree.xpath("string(/html/body)")

http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: web page text extractor

Reply via email to