Carl Banks wrote: > On Apr 4, 2:43 pm, Robert Kern <[EMAIL PROTECTED]> wrote: >> Carl Banks wrote: >>> On Apr 4, 2:08 pm, John Nagle <[EMAIL PROTECTED]> wrote: >>>> BeautifulSoup can't parse this page usefully at all. >>>> It treats the entire page as a text chunk. It's actually >>>> HTMLParser that parses comments, so this is really an HTMLParser >>>> level problem. >>> Google for a program called "tidy". Install it, and run it as a >>> filter on any HTML you download. "tidy" has invested in it quite a >>> bit of work understanding common bad HTML and how browsers deal with >>> it. It would be pointless to duplicate that work in the Python >>> standard library; let HTMLParser be small and tight, and outsource the >>> handling of floozy input to a dedicated program. >> Well, BeautifulSoup is just such a dedicated library. > > No, not really.
Yes, it is. Whether it succeeds in all particulars is besides the point. The only mission of BeautifulSoup is to handle bad HTML. That tidy doesn't successfully handle some other subset of bad HTML doesn't mean it's not a dedicated program for handling bad HTML. >> However, it defers its >> handling of comments to HTMLParser. That's the problem. > > Well, it's up to the writers of Beautiful Soup to decide how much bad > HTML they want to accept. ISTM they're happy to live with the > limitations of HTMLParser, meaning that they do not consider Beautiful > Soup to be a library dedicated to reading every piece of bad HTML out > there. Sorry, let me be clearer: The problem is that they haven't overridden the handling of comments of SGMLParser (not HTMLParser, sorry) like it has many other parts of SGMLParser. Yes, any fix should go into BeautifulSoup and not SGMLParser. All it takes is someone to code up their desired behavior for these perverse comments and submit it to Leonard Richardson. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list