html = '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah </body></html>'
>>> import htmllib >>> import formatter >>> parser=htmllib.HTMLParser(formatter.NullFormatter()) >>> parser.feed(html) Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/lib/python2.4/sgmllib.py", line 95, in feed self.goahead(0) File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead k = self.parse_declaration(i) File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration self.error( File "/usr/lib/python2.4/htmllib.py", line 40, in error raise HTMLParseError(message) htmllib.HTMLParseError: unexpected '<' char in declaration the error is generated by unclosed DOCTYPE declaration what is the best way to handle this kind of document. should I use regex to check and strip, or does HTMLParser offers something? , can i override default sgmllib behaviour I have to work with this htmllib because of existing modules . thanks -- http://mail.python.org/mailman/listinfo/python-list