On Apr 4, 4:55 pm, Robert Kern <[EMAIL PROTECTED]> wrote: > Carl Banks wrote: > > On Apr 4, 2:43 pm, Robert Kern <[EMAIL PROTECTED]> wrote: > >> Carl Banks wrote: > >>> On Apr 4, 2:08 pm, John Nagle <[EMAIL PROTECTED]> wrote: > >>>> BeautifulSoup can't parse this page usefully at all. > >>>> It treats the entire page as a text chunk. It's actually > >>>> HTMLParser that parses comments, so this is really an HTMLParser > >>>> level problem. > >>> Google for a program called "tidy". Install it, and run it as a > >>> filter on any HTML you download. "tidy" has invested in it quite a > >>> bit of work understanding common bad HTML and how browsers deal with > >>> it. It would be pointless to duplicate that work in the Python > >>> standard library; let HTMLParser be small and tight, and outsource the > >>> handling of floozy input to a dedicated program. > >> Well, BeautifulSoup is just such a dedicated library. > > > No, not really. > > Yes, it is. Whether it succeeds in all particulars is besides the point. The > only mission of BeautifulSoup is to handle bad HTML.
I think the authors of BeautifulSoup have the right to decide what their own mission is. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list