Bugs item #1144533, was opened at 2005-02-19 21:02
Message generated for change (Comment added) made by leogah
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1144533&group_id=5470
Category: Python Library
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Allan Hoeltje (ahoeltje)
Assigned to: Nobody/Anonymous (nobody)
Summary: htmllib quote parse error within a <script>
Initial Comment:
I am using the htmllib to parse web pages for plain text content. I
came across a web page that contained a script construct similar
to the example below. Note that the script is itself writing a script.
The htmllib appears to be confused by the use of single and double
quotes used within the real <script> and </script> tags.
I am using "Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC
3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin" on a
PowerBook G4 running OSX 10.3.8.
<html>
<body>
<h1> This is a test </h1>
<br>
<blockquote>
<script language="JavaScript">
rnum = Math.round( Math.random() * 100000 );
document.write( '<scr' + 'ipt src="http://www.a.org/' +
rnum + '/"></scr' + 'ipt>' );
</script>
</blockquote>
</body>
</html>
Here is the Python trace:
Traceback (most recent call last):
File "cleanFeed.py", line 26, in ?
clean = stripHtml.strip( feed )
File "/Users/allan/Desktop/Mood for Today/stripHtml.py", line
144, in strip
parser.feed(s)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 327, in parse_endtag
self.error("bad end tag: %s" % `rawdata[i:j]`)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line
1, column 309
----------------------------------------------------------------------
Comment By: Richard Brodie (leogah)
Date: 2005-03-09 00:51
Message:
Logged In: YES
user_id=356893
Generally speaking, you are better off conditioning random
junk pulled off the web (with uTidylib or similar) before
feeding it to HTMLParser, which tends to report errors when
it finds them.
See: http://www.w3.org/TR/html4/appendix/notes.html#h-B.3.2.1
for an explanation of why the error message is strictly correct.
Someone may step in with a patch to make HTMLParser more
tolerant in this case; there will always be something else
though.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1144533&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com