Kenneth McDonald wrote: > The problem I'm having with HTMLParser is simple; I don't seem to be > getting the actual text in the HTML document. I've implemented the > do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but > it never seems to receive any data. Is there another way to access the > text chunks as they come along?
the method is called "handle_data": http://docs.python.org/lib/module-HTMLParser.html > HTMLParser would probably be the way to go if I can figure this out. It > seems much simpler than htmllib, and satisfies my requirements. > > htmllib will write out the text data (using the AbstractFormatter and > AbstractWriter), but my problem here is conceptual. I simply don't > understand why all of these different "levels" of abstractness are > necessary, nor how to use them. if you're not interested in HTML *rendering*, use sgmllib instead. http://docs.python.org/lib/module-sgmllib.html the only difference between the libs is that HTMLParser is a bit stricter; on the other hand, if you want to parse really messy HTML, you should probably use BeautifulSoup instead: http://www.crummy.com/software/BeautifulSoup/ </F> -- http://mail.python.org/mailman/listinfo/python-list