HTMLParser is behaving in, what I find to be, strange ways and I would like to better understand what it is doing and why.
First, it doesn't appear to translate HTML escape characters. I don't know the actual terminology but things like & don't get translated into & as one would like. Furthermore, not only does HTMLParser not translate it properly, it seems to omit it altogether! This prevents me from even doing the translation myself, so I can't even working around the issue. Why is it doing this? Is there some mode I need to set? Can anyone else duplicate this behaviour? Is it a bug? Secondly, HTMLParser often calls handle_data() consecutively, without any calls to handle_starttag() in between. I did not expect this. In HTML, you either have text or you have tags. Why split up my text into successive handle_data() calls? This makes no sense to me. At the very least, it does this in response to text with & like escape sequences (or whatever they're called), so that it may successively avoid those translations. Again, why is it doing this? Is there some mode I need to set? Can anyone else duplicate this behaviour? Is it a bug? These are serious problems for me and I would greatly appreciate a deeper understanding of these issues. Thank you... -- http://mail.python.org/mailman/listinfo/python-list