-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sakcee wrote: > html = > '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" > <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah > </body></html>' > >
html = """ <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> <html> <head> </head> <body bgcolor="#ffffff"> Foo foo , blah blah </body> </html> """ Try checking your html code. It looks really messy. ' char is not for multiple line strings. You can try the code above. As a suggestion, you should really focus on learning html basics ;) Regards Jesus (Neurogeek) >>>>import htmllib >>>>import formatter >>>>parser=htmllib.HTMLParser(formatter.NullFormatter()) >>>>parser.feed(html) > > > Traceback (most recent call last): > File "<stdin>", line 1, in ? > File "/usr/lib/python2.4/sgmllib.py", line 95, in feed > self.goahead(0) > File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead > k = self.parse_declaration(i) > File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration > self.error( > File "/usr/lib/python2.4/htmllib.py", line 40, in error > raise HTMLParseError(message) > htmllib.HTMLParseError: unexpected '<' char in declaration > > > the error is generated by unclosed DOCTYPE declaration > > what is the best way to handle this kind of document. should I use > regex to check and strip, or does HTMLParser offers something? , can i > override default sgmllib behaviour > I have to work with this htmllib because of existing modules . > > > thanks > -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S iNScofTDdJxLfOkaAR9Ejws= =+LTo -----END PGP SIGNATURE----- -- http://mail.python.org/mailman/listinfo/python-list