-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Sakcee wrote:
> html =
> '<html><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> <head></head> <body bgcolor=#ffffff>\r\n Foo foo , blah blah
> </body></html>'
> 
> 

html =
        """
        <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
        <html>
         <head>
         </head>
         <body bgcolor="#ffffff">
                Foo foo , blah blah
         </body>
        </html>
        """

Try checking your html code. It looks really messy. ' char is not for
multiple line strings. You can try the code above.

As a suggestion, you should really focus on learning html basics ;)

Regards

Jesus (Neurogeek)

>>>>import htmllib
>>>>import formatter
>>>>parser=htmllib.HTMLParser(formatter.NullFormatter())
>>>>parser.feed(html)
> 
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
>     self.goahead(0)
>   File "/usr/lib/python2.4/sgmllib.py", line 165, in goahead
>     k = self.parse_declaration(i)
> File "/usr/lib/python2.4/markupbase.py", line 132, in parse_declaration
>     self.error(
>   File "/usr/lib/python2.4/htmllib.py", line 40, in error
>     raise HTMLParseError(message)
> htmllib.HTMLParseError: unexpected '<' char in declaration
> 
> 
> the error is generated by unclosed DOCTYPE declaration
> 
> what is the best way to handle this kind of document. should I use
> regex to check and strip, or does HTMLParser offers something? , can i
> override default sgmllib behaviour
> I have to work with this htmllib because of existing modules .
> 
> 
> thanks
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFD+mZzdIssYB9vBoMRAoWXAJ9KuAnLLXhZVv4t6fDBpu3RW6oxFgCeM/1S
iNScofTDdJxLfOkaAR9Ejws=
=+LTo
-----END PGP SIGNATURE-----
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to