Hi all,
while playing with PBP/mechanize/ClientForm, I ran into a problem with the way htmllib.HTMLParser was handling encoded tag attributes.
Specifically, the following HTML was not being handled correctly:
<option value="Small (6")">Small (6)</option>
The 'value' attr was being given the escaped value, not the correct unescaped value, 'Small (6")'.
It turns out that sgmllib.SGMLParser (on which htmllib.HTMLParser is based) does not unescape tag attributes. However, HTMLParser.HTMLParser (the newer, more XHTML-friendly class) does do so.
My proposed fix is to change sgmllib to unescape tags in the same way that HTMLParser.HTMLParser does. A context diff to sgmllib.py from Python 2.4 is at the bottom of this message.
I'm posting to this newsgroup before submitting the patch because I'm not too familiar with these classes and I want to make sure this behavior is correct.
One question I had was this: as you can see from the code below, a simple string.replace is done to replace encoded strings with their unencoded translations. Should handle_entityref be used instead, as with standard HTML text?
Another question: should this fix, if appropriate, be back-ported to older versions of Python? (I doubt sgmllib has changed much, so it should be pretty simple to do.)
thanks for any advice, --titus
*** /u/t/software/Python-2.4/Lib/sgmllib.py 2004-09-08 18:49:58.000000000 -0700
--- sgmllib.py 2004-12-16 23:30:51.000000000 -0800
***************
*** 272,277 ****
--- 272,278 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = self.unescape(attrvalue)
attrs.append((attrname.lower(), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
***************
*** 414,419 ****
--- 415,432 ----
def unknown_charref(self, ref): pass
def unknown_entityref(self, ref): pass
+ # Internal -- helper to remove special character quoting + def unescape(self, s): + if '&' not in s: + return s + s = s.replace("<", "<") + s = s.replace(">", ">") + s = s.replace("'", "'") + s = s.replace(""", '"') + s = s.replace("&", "&") # Must be last + + return s +
class TestSGMLParser(SGMLParser): -- http://mail.python.org/mailman/listinfo/python-list