Carl Banks wrote: > On Apr 4, 2:08 pm, John Nagle <[EMAIL PROTECTED]> wrote: > > The syntax that browsers understand as HTML comments is much less > > restrictive than what BeautifulSoup understands. I keep running into > > sites with formally incorrect HTML comments which are parsed happily > > by browsers. Here's yet another example, this one from > > "http://www.webdirectory.com". The page starts like this: > > > > <!Hello there! Welcome to The Environment Directory!> > > <!Not too much exciting HTML code here but it does the job! > > > <!See ya, - JD > > > > > <HTML><HEAD> > > <TITLE>Environment Web Directory</TITLE> > > > > Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle > > them > > without problems. > > > > BeautifulSoup can't parse this page usefully at all. > > It treats the entire page as a text chunk. It's actually > > HTMLParser that parses comments, so this is really an HTMLParser > > level problem. > > Google for a program called "tidy". Install it, and run it as a > filter on any HTML you download. "tidy" has invested in it quite a > bit of work understanding common bad HTML and how browsers deal with > it. It would be pointless to duplicate that work in the Python > standard library; let HTMLParser be small and tight, and outsource the > handling of floozy input to a dedicated program.
That's a good suggestion. In fact it looks like there's a Python API for tidy: http://utidylib.berlios.de/ Tried it, seems to get rid of <! comments > just fine. -- http://mail.python.org/mailman/listinfo/python-list