Carl Banks wrote: > On Apr 4, 2:08 pm, John Nagle <[EMAIL PROTECTED]> wrote: >> The syntax that browsers understand as HTML comments is much less >> restrictive than what BeautifulSoup understands. I keep running into >> sites with formally incorrect HTML comments which are parsed happily >> by browsers. Here's yet another example, this one from >> "http://www.webdirectory.com". The page starts like this: >> >> <!Hello there! Welcome to The Environment Directory!> >> <!Not too much exciting HTML code here but it does the job! > >> <!See ya, - JD > >> >> <HTML><HEAD> >> <TITLE>Environment Web Directory</TITLE> >> >> Those are, of course, invalid HTML comments. But Firefox, IE, etc. handle >> them >> without problems. >> >> BeautifulSoup can't parse this page usefully at all. >> It treats the entire page as a text chunk. It's actually >> HTMLParser that parses comments, so this is really an HTMLParser >> level problem. > > Google for a program called "tidy". Install it, and run it as a > filter on any HTML you download. "tidy" has invested in it quite a > bit of work understanding common bad HTML and how browsers deal with > it. It would be pointless to duplicate that work in the Python > standard library; let HTMLParser be small and tight, and outsource the > handling of floozy input to a dedicated program. > > eGenix have produced the mxTidy library that handily incorporates these features in a way that makes them easy for Python programmers to use.
regards Steve -- Steve Holden +44 150 684 7255 +1 800 494 3119 Holden Web LLC/Ltd http://www.holdenweb.com Skype: holdenweb http://del.icio.us/steve.holden Recent Ramblings http://holdenweb.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list