Carl Banks wrote: > On Apr 4, 2:08 pm, John Nagle <[EMAIL PROTECTED]> wrote:
>> BeautifulSoup can't parse this page usefully at all. >> It treats the entire page as a text chunk. It's actually >> HTMLParser that parses comments, so this is really an HTMLParser >> level problem. > > Google for a program called "tidy". Install it, and run it as a > filter on any HTML you download. "tidy" has invested in it quite a > bit of work understanding common bad HTML and how browsers deal with > it. It would be pointless to duplicate that work in the Python > standard library; let HTMLParser be small and tight, and outsource the > handling of floozy input to a dedicated program. Well, BeautifulSoup is just such a dedicated library. However, it defers its handling of comments to HTMLParser. That's the problem. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list