On May 27, 5:01 am, [EMAIL PROTECTED] wrote: > Hi, > > I wish to extract all the words on a set of webpages and store them in > a large dictionary. I then wish to procuce a list with the most common > words for the language under consideration. So, my code below reads > the page - > > http://news.bbc.co.uk/welsh/hi/newsid_7420000/newsid_7420900/7420967.stm > > a welsh language page. I hope to then establish the 1000 most commonly > used words in Welsh. The problem I'm having is that > soup.findAll(text=True) is returning the likes of - > > u'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" > "http://www.w3.org/TR/REC-html40/loose.dtd"' > > and - > > <a href=" \'+url+\'?rss=\'+rssURI+\'" class="sel" > > Any suggestions how I might overcome this problem? > > Thanks, > > Barry. > > Here's my code - > > import urllib > import urllib2 > from BeautifulSoup import BeautifulSoup > > # proxy_support = urllib2.ProxyHandler({"http":"http:// > 999.999.999.999:8080"}) > # opener = urllib2.build_opener(proxy_support) > # urllib2.install_opener(opener) > > page = urllib2.urlopen('http://news.bbc.co.uk/welsh/hi/newsid_7420000/ > newsid_7420900/7420967.stm') > soup = BeautifulSoup(page) > > pageText = soup.findAll(text=True) > print pageText
As an alternative datapoint, you can try out the htmlStripper example on the pyparsing wiki: http://pyparsing.wikispaces.com/space/showimage/htmlStripper.py -- Paul -- http://mail.python.org/mailman/listinfo/python-list