Yotam Medini <yo...@users.sourceforge.net> added the comment: The HTMLParser.py fails when inside <script> ... </script> it can fooled by JavaScript with less-than '<' conditional expressions. In the attached example:
$ tar tvzf lt-in-script-example.tgz | cut -c24- 796 2010-09-30 16:52 h2t.py 23678 2010-09-30 16:39 t.html here's what happens: $ python h2t.py t.html /tmp/t.txt HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py Traceback (most recent call last): File "h2t.py", line 31, in <module> text = html2text(f_html.read()) File "h2t.py", line 23, in html2text te = TextExtractor(html) File "h2t.py", line 15, in __init__ self.feed(html) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed self.goahead(0) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag self.error("malformed start tag") File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332 I have a suggested patch HTMLParser.diff fixing this problem, soon to be attached. -- yotam ---------- nosy: +yotam Added file: http://bugs.python.org/file19072/lt-in-script-example.tgz _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue670664> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com