Re: how to retreive the body text alone of a webpage

jimgardener Sat, 02 Oct 2010 07:07:10 -0700

thanks guys,
I tried this..

from BeautifulSoup import BeautifulSoup
import urllib


def get_page_body_text(url):
    h=urllib.urlopen(url)
    data=h.read()
    soup=BeautifulSoup(data)
    body_texts = soup.body(text=True)
    text = ''.join(body_texts)
    return text

...
    while True:
        #print 'size=%d'%len(get_page_body_text('http://
www.google.com'))
        print 'size=%d'%len(get_page_body_text('http://
sampleblogbyjim.blogspot.com/'))
        time.sleep(5)

when google.com is the url ,the code gets the correct length of
data.Then I tried a blog which I created for fun,
This causes the code to crash with an error


  File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>",

Any idea how this can be taken care of?The blog site must be creating
bad html..How do you deal with such a problem?
thanks
jim

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: how to retreive the body text alone of a webpage

Reply via email to