Re: Ask how to use HTMLParser

Dave Angel Fri, 08 Jan 2010 02:39:19 -0800

Water Lin wrote:

h0uk <[email protected]> writes:

On 8 янв, 08:44, Water Lin <[email protected]> wrote:

I am a new guy to use Python, but I want to parse a html page now. I
tried to use HTMLParse. Here is my sample code:
----------------------
from HTMLParser import HTMLParser
from urllib2 import urlopen

class MyParser(HTMLParser):
    title = ""
    is_title = ""
    def __init__(self, url):
        HTMLParser.__init__(self)
        req = urlopen(url)
        self.feed(req.read())

    def handle_starttag(self, tag, attrs):
        if tag == 'div' and attrs[0][1] == 'articleTitle':
            print "Found link => %s" % attrs[0][1]
            self.is_title = 1

    def handle_data(self, data):
        if self.is_title:
            print "here"
            self.title = data
            print self.title
            self.is_title = 0
-----------------------

For the tag
-------
<div class="articleTitle">open article title</div>
-------

I use my code to parse it. I can locate the div tag but I don't know how
to get the text for the tag which is "open article title" in my example.

How can I get the html content? What's wrong in my handle_data function?

Thanks

Water Lin

--
Water Lin's notes and pencils:http://en.waterlin.org
Email: [email protected]

I want to say your code works well


But in handle_data I can't print self.title. I don't why I can't set the
self.title in handle_data.

Thanks

Water Lin

I don't know HTMLParser, but I see a possible confusion point in yourclass definition.

You have both class-attributes and instance-attributes of the same names(title and is_title). So if you have more than one instance of MyParser,then they won't see each other's changes. Normally, I'd move theinitialization of such attributes into the __init__() method, so thebehavior is clear.

When an instance-attribute has the same name as a class-attribute, theinstance-attribute takes precedence, and "hides" the class-attribute,for further processing in that same instance. So effectively, theclass-attribute acts as a default value.



--
http://mail.python.org/mailman/listinfo/python-list

Re: Ask how to use HTMLParser

Reply via email to