On 8 янв, 11:44, Water Lin <water...@ymail.invalid> wrote: > h0uk <vardan.pogos...@gmail.com> writes: > > On 8 янв, 08:44, Water Lin <water...@ymail.invalid> wrote: > >> I am a new guy to use Python, but I want to parse a html page now. I > >> tried to use HTMLParse. Here is my sample code: > >> ---------------------- > >> from HTMLParser import HTMLParser > >> from urllib2 import urlopen > > >> class MyParser(HTMLParser): > >> title = "" > >> is_title = "" > >> def __init__(self, url): > >> HTMLParser.__init__(self) > >> req = urlopen(url) > >> self.feed(req.read()) > > >> def handle_starttag(self, tag, attrs): > >> if tag == 'div' and attrs[0][1] == 'articleTitle': > >> print "Found link => %s" % attrs[0][1] > >> self.is_title = 1 > > >> def handle_data(self, data): > >> if self.is_title: > >> print "here" > >> self.title = data > >> print self.title > >> self.is_title = 0 > >> ----------------------- > > >> For the tag > >> ------- > >> <div class="articleTitle">open article title</div> > >> ------- > > >> I use my code to parse it. I can locate the div tag but I don't know how > >> to get the text for the tag which is "open article title" in my example. > > >> How can I get the html content? What's wrong in my handle_data function? > > >> Thanks > > >> Water Lin > > >> -- > >> Water Lin's notes and pencils:http://en.waterlin.org > >> Email: water...@ymail.com > > > I want to say your code works well > > But in handle_data I can't print self.title. I don't why I can't set the > self.title in handle_data. > > Thanks > > Water Lin > > -- > Water Lin's notes and pencils:http://en.waterlin.org > Email: water...@ymail.com
I have tested your code as : #!/usr/bin/env python # -*- conding: utf-8 -*- from HTMLParser import HTMLParser class MyParser(HTMLParser): title = "" is_title = "" def __init__(self, data): HTMLParser.__init__(self) self.feed(data) def handle_starttag(self, tag, attrs): if tag == 'div' and attrs[0][1] == 'articleTitle': print "Found link => %s" % attrs[0][1] self.is_title = 1 def handle_data(self, data): if self.is_title: print "here" self.title = data print self.title self.is_title = 0 if __name__ == "__main__": m = MyParser(""" <div class="secttlbarwrap"> <table cellpadding=0 cellspacing=0 width="100%"><tr><td> <div style="background: url(/groups/roundedcorners? c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px"> <td bgcolor="#999999" width="100%" height="4"><img alt="" width=1 height=1><td> <div style="background: url(/groups/roundedcorners? c=999999&bc=white&w=4&h=4&a=af) -4px 0px; width: 4px; height: 4px"> </div></table></div> <div class="articleTitle">open article title</div> <div class="secttlbar"> <div class="lf secttl"> <span id="thread_subject_site"> Ask how to use HTMLParser </span> </div> <div class="rf secmsg frtxt padt2"> <a class="uitl" id="showoptions_lnk2" href="#" onclick="TH_ToggleOptionsPane(); return false;">Parametrs</a> </div> <div class="hght0 clear" style="font-size:0;"></div> </div>""") All stuff printed and handled fine. Also, the 'print self.title' statement works fine. Try run my code. Vardan. -- http://mail.python.org/mailman/listinfo/python-list