Py 2.5: Bug in sgmllib

Michael Butscher Sun, 22 Oct 2006 04:27:02 -0700

Hi,

if I execute the following two lines in Python 2.5 (to feed in a 
*unicode* string):


import sgmllib
sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')



I get the exception:

Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')
  File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
    self.goahead(0)
  File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in 
parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: 
ordinal not in range(128)



The reason is that the character reference &#223; is converted to 
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte 
string to the remaining unicode string fails.


Workaround (not thoroughly tested): Override convert_codepoint in a 
derived class with:

    def convert_codepoint(self, codepoint):
        return unichr(codepoint)



Is this a bug or is SGMLParser not meant to be used for unicode strings 
(it should be documented then)?



Michael
-- 
http://mail.python.org/mailman/listinfo/python-list

Py 2.5: Bug in sgmllib

Reply via email to