On Jul 12, 5:24 pm, "Andre Engels" <[EMAIL PROTECTED]> wrote: > 2007/7/12, Andre Engels <[EMAIL PROTECTED]>: > > I forgot to include > > import urllib2, re > > here > > > def textonly(url): > > # Get the HTML source on url and give only the main text > > f = urllib2.urlopen(url) > > text = f.read() > > r = re.compile('\<[^\<\>]*\>') > > newtext = r.sub('',text) > > while newtext != text: > > text = newtext > > newtext = r.sub('',text) > > return text > > -- > Andre Engels, [EMAIL PROTECTED] > ICQ: 6260644 -- Skype: a_engels
Andre I think that unfortunately your solution will not ignore inlined scripting, inlined styling, etc. On the otherside, I don't think there are many solutions available, other than the Lynx approach somebody has already suggested. bests, ./alex -- .w( the_mindstorm )p. -- http://mail.python.org/mailman/listinfo/python-list