Re: web page text extractor

Miki Thu, 12 Jul 2007 06:50:54 -0700

Hello jk,

> For a project, I need to develop a corpus of online news stories.  I'm
> looking for an application that, given the url of a web page, "copies"
> the rendered text of the web page (not the source HTNL text), opens a
> text editor (Notepad), and displays the copied text for the user to
> examine and save into a text file. Graphics and sidebars to be
> ignored. The examples I have come across are much too complex for me
> to customize for this simple job. Can anyone lead me to the right
> direction?
Going simple :)


    from os import system
    from sys import argv

    OUTFILE = "geturl.txt"
    system("lynx -dump %s > %s" % (argv[1], OUTFILE))
    system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)

Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.

HTH,
--
Miki <[EMAIL PROTECTED]>
http://pythonwise.blogspot.com

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: web page text extractor

Reply via email to