On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamore...@yahoo.co.uk>wrote:
> On 29/11/2013 16:56, Max Cuban wrote: > >> I have the following code to extract certain links from a webpage: >> >> from bs4 import BeautifulSoup >> import urllib2, sys >> import re >> >> def tonaton(): >> site = "http://tonaton.com/en/job-vacancies-in-ghana" >> hdr = {'User-Agent' : 'Mozilla/5.0'} >> req = urllib2.Request(site, headers=hdr) >> jobpass = urllib2.urlopen(req) >> invalid_tag = ('h2') >> soup = BeautifulSoup(jobpass) >> print soup.find_all('h2') >> >> The links are contained in the 'h2' tags so I get the links as follows: >> >> <h2><a href="/en/cashiers-accra">cashiers </a></h2> >> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2> >> <h2><a href="/en/automobile-technician-accra">Automobile >> Technician</a></h2> >> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2> >> >> But I'm interested in getting rid of all the 'h2' tags so that I have >> links only in this manner: >> >> <a href="/en/cashiers-accra">cashiers </a> >> <a href="/en/cake-baker-accra">Cake baker</a> >> <a href="/en/automobile-technician-accra">Automobile Technician</a> >> <a href="/en/marketing-officer-accra-4">Marketing Officer</a> >> >> >> This is more a beautiful soup question than python. Have you gone >> through their tutorial. Check here: >> > They have an example that looks close here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/ One common task is extracting all the URLs found within a page’s <a> tags: for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie In your case, you want the href values for the child of the h2 refences. So this might be close (untested) for link in soup.find_all('a'): print (link.a.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie -- Joel Goldstick http://joelgoldstick.com
-- https://mail.python.org/mailman/listinfo/python-list