Re: web page text extractor

Alex Popescu Thu, 12 Jul 2007 08:56:41 -0700

On Jul 12, 5:24 pm, "Andre Engels" <[EMAIL PROTECTED]> wrote:
> 2007/7/12, Andre Engels <[EMAIL PROTECTED]>:
>
> I forgot to include
>
> import urllib2, re
>
> here
>
> > def textonly(url):
> >    # Get the HTML source on url and give only the main text
> >    f = urllib2.urlopen(url)
> >    text = f.read()
> >    r = re.compile('\<[^\<\>]*\>')
> >    newtext = r.sub('',text)
> >    while newtext != text:
> >       text = newtext
> >       newtext = r.sub('',text)
> >    return text
>
> --
> Andre Engels, [EMAIL PROTECTED]
> ICQ: 6260644  --  Skype: a_engels


Andre I think that unfortunately your solution will not ignore inlined
scripting, inlined styling, etc.
On the otherside, I don't think there are many solutions available,
other than the Lynx approach somebody
has already suggested.

bests,
./alex
--
.w( the_mindstorm )p.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: web page text extractor

Reply via email to