On Jul 13, 2:19 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: > kublai wrote: > > For a project, I need to develop a corpus of online news stories. I'm > > looking for an application that, given the url of a web page, "copies" > > the rendered text of the web page (not the source HTNL text), opens a > > text editor (Notepad), and displays the copied text for the user to > > examine and save into a text file. Graphics and sidebars to be > > ignored. The examples I have come across are much too complex for me > > to customize for this simple job. Can anyone lead me to the right > > direction? > > Super-simplistic: > > >>> import lxml.etree as et > >>> parser = et.HTMLParser() > >>> tree = et.parse("http://the/page.html", parser) > >>> print tree.xpath("string(/html/body)") > > http://codespeak.net/lxml/ > > You may want to use the incredibly versatile "lxml.html.clean" module first to > remove any annoying content. It's not released yet but available in a branch: > > http://codespeak.net/svn/lxml/branch/html/ > > Stefan
Hi, Stefan, This looks very interesting. I will look into this first thing tonight. Gotta hit some golf bugs, I mean, balls first. It's a beautiful afternoon here in Edmonton. Cheers, gk -- http://mail.python.org/mailman/listinfo/python-list