On Jan 27, 5:18 am, "Frank Potter" <[EMAIL PROTECTED]> wrote: > There are ten web pages I want to deal with. > fromhttp://www.af.shejis.com/new_lw/html/125926.shtml > to http://www.af.shejis.com/new_lw/html/125936.shtml > > Each of them uses the charset of Chinese "gb2312", and firefox > displays all of them in the right form, that's readable Chinese. > > My job is, I get every page and extract the html title of it and > dispaly the title on linux shell Termial. > > And, my problem is, to some page, I get human readable title(that's in > Chinese), but to other pages, I got disordered word. Since each page > has the same charset, I don't know why I can't get every title in the > same way. > > Here's my python code, get_title.py : > > [CODE] > #!/usr/bin/python > import urllib2 > from BeautifulSoup import BeautifulSoup > > min_page=125926 > max_page=125936 > > def make_page_url(page_index): > return ur"".join([ur"http://www.af.shejis.com/new_lw/ > html/",str(page_index),ur".shtml"]) > > def get_page_title(page_index): > url=make_page_url(page_index) > print "now getting: ", url > user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' > headers={'User-Agent':user_agent} > req=urllib2.Request(url,None,headers) > response=urllib2.urlopen(req) > #print response.info() > page=response.read() > > #extract tile by beautiful soup > soup=BeautifulSoup(page) > full_title=str(soup.html.head.title.string) > > #title is in the format of "title --title" > #use this code to delete the "--" and the duplicate title > title=full_title[full_title.rfind('-')+1::] > > return title > > for i in xrange(min_page,max_page): > print get_page_title(i) > [/CODE] > > Will somebody please help me out? Thanks in advance.
This pyparsing solution seems to extract what you were looking for, but I don't know if this will render to Chinese or not. -- Paul from pyparsing import makeHTMLTags,SkipTo import urllib titleStart,titleEnd = makeHTMLTags("title") scanExpr = titleStart + SkipTo("- -",include=True) + SkipTo(titleEnd).setResultsName("titleChars") + titleEnd def extractTitle(htmlSource): titleSource = scanExpr.searchString(htmlSource, maxMatches=1)[0] return titleSource.titleChars for urlIndex in range(125926,125936+1): url = "http://www.af.shejis.com/new_lw/html/%d.shtml" % urlIndex pg = urllib.urlopen(url) html = pg.read() pg.close() print url,':',extractTitle(html) Gives: http://www.af.shejis.com/new_lw/html/125926.shtml : GSM±¾µØÍø×éÍø·½Ê½ http://www.af.shejis.com/new_lw/html/125927.shtml : GSM ±¾µØÍø×éÍø·½Ê½³õ̽ http://www.af.shejis.com/new_lw/html/125928.shtml : GSMµÄÊý¾ÝÒµÎñ http://www.af.shejis.com/new_lw/html/125929.shtml : GSMµÄÊý¾ÝÒµÎñºÍ³ÐÔØÄÜÁ¦ http://www.af.shejis.com/new_lw/html/125930.shtml : GSMµÄÍøÂçÑݽø- ´ÓGSMµ½GPRSµ½3G £¨¸½Í¼£© http://www.af.shejis.com/new_lw/html/125931.shtml : GSM¶ÌÏûÏ ¢ÒµÎñÔÚË®Çé×Ô¶¯²â±¨ÏµÍ³ÖеÄÓ¦ÓÃ¬Ø http://www.af.shejis.com/new_lw/html/125932.shtml : £Ç£Ó £Í½»»»ÏµÍ³µÄÍøÂçÓÅ»¯ http://www.af.shejis.com/new_lw/html/125933.shtml : GSMÇл»µô»°µÄ·ÖÎö¼ °½â¾ö°ì·¨ http://www.af.shejis.com/new_lw/html/125934.shtml : GSMÊÖ»ú²¦½ÐÊл°Ä £¿é¾ÖÓû§¹ÊÕϵÄÆÊÎö http://www.af.shejis.com/new_lw/html/125935.shtml : GSMÊÖ»úµ½WCDMAÖն˵ÄÑݱä http://www.af.shejis.com/new_lw/html/125936.shtml : GSMÊÖ»úµÄάÐÞ·½·¨ -- http://mail.python.org/mailman/listinfo/python-list