Jackie wrote: > I want to get the information of the professors (name,title) from the > following link: > > "http://www.economics.utoronto.ca/index.php/index/person/faculty/"
That's even XHTML, no need to go through BeautifulSoup. Use lxml instead. http://codespeak.net/lxml > Ideally, I'd like to have a output file where each line is one Prof, > including his name and title. In practice, I use the CSV module. > ---------------------------------------------------- > > import urllib,re,csv > > url = "http://www.economics.utoronto.ca/index.php/index/person/ > faculty/" > > sock = urllib.urlopen(url) > htmlSource = sock.read() > sock.close() import lxml.etree as et url = "http://www.economics.utoronto.ca/index.php/index/person/faculty/" tree = et.parse(url) > namePattern = re.compile(r'class="name">(.*)</a>') > titlePattern = re.compile(r'</a>, (.*)\s*</td>') > > name = namePattern.findall(htmlSource) > title_temp = titlePattern.findall(htmlSource) > title =[] > for item in title_temp: > item_new=" ".join(item.split()) #Suppress the > spaces between 'title' and </td> > title.extend([item_new]) > > > output =[] > for i in range(len(name)): > output.insert(i,[name[i],title[i]]) #Generate a list of > [name, title] # untested get_name_text = et.XPath('normalize-space(td[a/@class="name"]') name_list = [] for name_row in tree.xpath('//tr[td/a/@class = "name"]'): name_list.append( tuple(get_name_text(name_row).split(",", 3) + ["","",""])[:3] ) > writer = csv.writer(open("professor.csv", "wb")) > writer.writerows(output) #output CSV file writer = csv.writer(open("professor.csv", "wb")) writer.writerows(name_list) #output CSV file > -------------- End of Program > ---------------------------------------------- > > 3.Should I close the opened csv file("professor.csv")? How to close > it? I guess it has a "close()" function? Stefan -- http://mail.python.org/mailman/listinfo/python-list