There are ten web pages I want to deal with. from http://www.af.shejis.com/new_lw/html/125926.shtml to http://www.af.shejis.com/new_lw/html/125936.shtml
Each of them uses the charset of Chinese "gb2312", and firefox displays all of them in the right form, that's readable Chinese. My job is, I get every page and extract the html title of it and dispaly the title on linux shell Termial. And, my problem is, to some page, I get human readable title(that's in Chinese), but to other pages, I got disordered word. Since each page has the same charset, I don't know why I can't get every title in the same way. Here's my python code, get_title.py : [CODE] #!/usr/bin/python import urllib2 from BeautifulSoup import BeautifulSoup min_page=125926 max_page=125936 def make_page_url(page_index): return ur"".join([ur"http://www.af.shejis.com/new_lw/ html/",str(page_index),ur".shtml"]) def get_page_title(page_index): url=make_page_url(page_index) print "now getting: ", url user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers={'User-Agent':user_agent} req=urllib2.Request(url,None,headers) response=urllib2.urlopen(req) #print response.info() page=response.read() #extract tile by beautiful soup soup=BeautifulSoup(page) full_title=str(soup.html.head.title.string) #title is in the format of "title --title" #use this code to delete the "--" and the duplicate title title=full_title[full_title.rfind('-')+1::] return title for i in xrange(min_page,max_page): print get_page_title(i) [/CODE] Will somebody please help me out? Thanks in advance. -- http://mail.python.org/mailman/listinfo/python-list