I'm writing a program that will parse HTML and (mostly) convert it to MediaWiki format. The two Python modules I'm aware of to do this are HTMLParser and htmllib. However, I'm currently experiencing either real or conceptual difficulty with both, and was wondering if I could get some advice.
The problem I'm having with HTMLParser is simple; I don't seem to be getting the actual text in the HTML document. I've implemented the do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but it never seems to receive any data. Is there another way to access the text chunks as they come along? HTMLParser would probably be the way to go if I can figure this out. It seems much simpler than htmllib, and satisfies my requirements. htmllib will write out the text data (using the AbstractFormatter and AbstractWriter), but my problem here is conceptual. I simply don't understand why all of these different "levels" of abstractness are necessary, nor how to use them. As an example, the html <i>text</i> should be converted to ''text'' (double single-quotes at each end) in my mediawiki markup output. This would obviously be easy to achieve if I simply had an html parse that called a method for each start tag, text chunk, and end tag. But htmllib calls the tag functions in HTMLParser, and then does more things with both a formatter and a writer. To me, both seem unnecessarily complex (though I suppose I can see the benefits of a writer before generators gave the opportunity to simply yield chunks of output to be processed by external code.) In any case, I don't really have a good idea of what I should do with htmllib to get my converted tags, and then content, and then closing converted tags, written out. Please feel free to point to examples, code, etc. Probably the simplest solution would be a way to process text content in HTMLParser.HTMLParser. Thanks, Ken -- http://mail.python.org/mailman/listinfo/python-list