On 29/11/2013 16:56, Max Cuban wrote:
I have the following code to extract certain links from a webpage:

from bs4 import BeautifulSoup
import urllib2, sys
import re

def tonaton():
     site = "http://tonaton.com/en/job-vacancies-in-ghana";
     hdr = {'User-Agent' : 'Mozilla/5.0'}
     req = urllib2.Request(site, headers=hdr)
     jobpass = urllib2.urlopen(req)
     invalid_tag = ('h2')
     soup = BeautifulSoup(jobpass)
     print soup.find_all('h2')

The links are contained in the 'h2' tags so I get the links as follows:

<h2><a href="/en/cashiers-accra">cashiers </a></h2>
<h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
<h2><a href="/en/automobile-technician-accra">Automobile Technician</a></h2>
<h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>

But I'm interested in getting rid of all the 'h2' tags so that I have links 
only in this manner:

<a href="/en/cashiers-accra">cashiers </a>
<a href="/en/cake-baker-accra">Cake baker</a>
<a href="/en/automobile-technician-accra">Automobile Technician</a>
<a href="/en/marketing-officer-accra-4">Marketing Officer</a>


I therefore updated my code to look like this:

def tonaton():
     site = "http://tonaton.com/en/job-vacancies-in-ghana";
     hdr = {'User-Agent' : 'Mozilla/5.0'}
     req = urllib2.Request(site, headers=hdr)
     jobpass = urllib2.urlopen(req)
     invalid_tag = ('h2')
     soup = BeautifulSoup(jobpass)
     jobs = soup.find_all('h2')
     for tag in invalid_tag:
         for match in jobs(tag):
             match.replaceWithChildren()
     print jobs

But I couldn't get it to work, even though  I thought that was the best logic i 
could come up with.I'm a newbie though so I know there is something better that 
could be done.

Any help will be gracefully appreciated

Thanks


Please help us to help you. A good starter is your versions of Python and OS. But more importantly here, what does "couldn't get it to work" mean? The output you get isn't what you expected? You get a traceback, in which case please give us the whole of the output, not just the last line?

One last thing, I observe that you've a gmail address. This is currently guaranteed to send shivers down my spine. So if you're using google groups, would you be kind enough to read and action this, https://wiki.python.org/moin/GoogleGroupsPython, thanks.

--
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to