Re: web page text extractor

2007-07-22 Thread Thomas Dickey
Miki <[EMAIL PROTECTED]> wrote: > (You can find lynx at http://lynx.browser.org/) not exactly - The current version of lynx is 2.8.6 It's available at http://lynx.isc.org/lynx2.8.6/ 2.8.7 Development & patches: http://lynx.isc.org/current/index.html -- Thomas E. Dickey http://i

Re: web page text extractor

2007-07-13 Thread rdahlstrom
To maintain paragraphs, replace any p or br tags with your favorite operating system's crlf. On Jul 13, 8:57 am, kublai <[EMAIL PROTECTED]> wrote: > On Jul 13, 5:44 pm, Paul McGuire <[EMAIL PROTECTED]> wrote: > > > > > On Jul 12, 4:42 am, kublai <[EMAIL PROTECTED]> wrote: > > > > Hello, > > > > Fo

Re: web page text extractor

2007-07-13 Thread kublai
On Jul 13, 5:44 pm, Paul McGuire <[EMAIL PROTECTED]> wrote: > On Jul 12, 4:42 am, kublai <[EMAIL PROTECTED]> wrote: > > > Hello, > > > For a project, I need to develop a corpus of online news stories. I'm > > looking for an application that, given the url of a web page, "copies" > > the rendered t

Re: web page text extractor

2007-07-13 Thread Paul McGuire
On Jul 12, 4:42 am, kublai <[EMAIL PROTECTED]> wrote: > Hello, > > For a project, I need to develop a corpus of online news stories. I'm > looking for an application that, given the url of a web page, "copies" > the rendered text of the web page (not the source HTNL text), opens a > text editor (N

Re: web page text extractor

2007-07-12 Thread kublai
On Jul 13, 2:19 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: > kublai wrote: > > For a project, I need to develop a corpus of online news stories. I'm > > looking for an application that, given the url of a web page, "copies" > > the rendered text of the web page (not the source HTNL text), opens

Re: web page text extractor

2007-07-12 Thread Stefan Behnel
kublai wrote: > For a project, I need to develop a corpus of online news stories. I'm > looking for an application that, given the url of a web page, "copies" > the rendered text of the web page (not the source HTNL text), opens a > text editor (Notepad), and displays the copied text for the user

Re: web page text extractor

2007-07-12 Thread kublai
On Jul 12, 10:22 pm, Jon Rosebaugh <[EMAIL PROTECTED]> wrote: > On 2007-07-12 04:42:25 -0500, kublai <[EMAIL PROTECTED]> said: > > > For a project, I need to develop a corpus of online news stories. I'm > > looking for an application that, given the url of a web page, "copies" > > the rendered tex

Re: web page text extractor

2007-07-12 Thread Alex Popescu
On Jul 12, 5:24 pm, "Andre Engels" <[EMAIL PROTECTED]> wrote: > 2007/7/12, Andre Engels <[EMAIL PROTECTED]>: > > I forgot to include > > import urllib2, re > > here > > > def textonly(url): > ># Get the HTML source on url and give only the main text > >f = urllib2.urlopen(url) > >text =

Re: web page text extractor

2007-07-12 Thread Jon Rosebaugh
On 2007-07-12 04:42:25 -0500, kublai <[EMAIL PROTECTED]> said: > For a project, I need to develop a corpus of online news stories. I'm > looking for an application that, given the url of a web page, "copies" > the rendered text of the web page (not the source HTNL text), opens a > text editor (Not

Re: web page text extractor

2007-07-12 Thread Andre Engels
2007/7/12, Andre Engels <[EMAIL PROTECTED]>: I forgot to include import urllib2, re here > def textonly(url): ># Get the HTML source on url and give only the main text >f = urllib2.urlopen(url) >text = f.read() >r = re.compile('\<[^\<\>]*\>') >newtext = r.sub('',text) >w

Re: web page text extractor

2007-07-12 Thread Andre Engels
2007/7/12, kublai <[EMAIL PROTECTED]>: > For a project, I need to develop a corpus of online news stories. I'm > looking for an application that, given the url of a web page, "copies" > the rendered text of the web page (not the source HTNL text), opens a > text editor (Notepad), and displays the

Re: web page text extractor

2007-07-12 Thread Miki
Hello jk, > For a project, I need to develop a corpus of online news stories. I'm > looking for an application that, given the url of a web page, "copies" > the rendered text of the web page (not the source HTNL text), opens a > text editor (Notepad), and displays the copied text for the user to

web page text extractor

2007-07-12 Thread kublai
Hello, For a project, I need to develop a corpus of online news stories. I'm looking for an application that, given the url of a web page, "copies" the rendered text of the web page (not the source HTNL text), opens a text editor (Notepad), and displays the copied text for the user to examine and