On Fri, Nov 29, 2013 at 12:44 PM, Joel Goldstick <joel.goldst...@gmail.com>wrote:
> > > > On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence > <breamore...@yahoo.co.uk>wrote: > >> On 29/11/2013 16:56, Max Cuban wrote: >> >>> I have the following code to extract certain links from a webpage: >>> >>> from bs4 import BeautifulSoup >>> import urllib2, sys >>> import re >>> >>> def tonaton(): >>> site = "http://tonaton.com/en/job-vacancies-in-ghana" >>> hdr = {'User-Agent' : 'Mozilla/5.0'} >>> req = urllib2.Request(site, headers=hdr) >>> jobpass = urllib2.urlopen(req) >>> invalid_tag = ('h2') >>> soup = BeautifulSoup(jobpass) >>> print soup.find_all('h2') >>> >>> The links are contained in the 'h2' tags so I get the links as follows: >>> >>> <h2><a href="/en/cashiers-accra">cashiers </a></h2> >>> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2> >>> <h2><a href="/en/automobile-technician-accra">Automobile >>> Technician</a></h2> >>> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2> >>> >>> But I'm interested in getting rid of all the 'h2' tags so that I have >>> links only in this manner: >>> >>> <a href="/en/cashiers-accra">cashiers </a> >>> <a href="/en/cake-baker-accra">Cake baker</a> >>> <a href="/en/automobile-technician-accra">Automobile Technician</a> >>> <a href="/en/marketing-officer-accra-4">Marketing Officer</a> >>> >>> >>> This is more a beautiful soup question than python. Have you gone >>> through their tutorial. Check here: >>> >> > They have an example that looks close here: > http://www.crummy.com/software/BeautifulSoup/bs4/doc/ > > One common task is extracting all the URLs found within a page’s <a> tags: > > for link in soup.find_all('a'): > print(link.get('href')) > # http://example.com/elsie > # http://example.com/lacie > # http://example.com/tillie > > In your case, you want the href values for the child of the h2 refences. > > So this might be close (untested) > Pardon my typo. Try this: > > for link in soup.find_all('h2'): > print (link.a.get('href')) > # http://example.com/elsie > # http://example.com/lacie > # http://example.com/tillie > > > > > > > -- > Joel Goldstick > http://joelgoldstick.com > -- Joel Goldstick http://joelgoldstick.com
-- https://mail.python.org/mailman/listinfo/python-list