
if I execute the following two lines in Python 2.5 (to feed in a 
*unicode* string):
import sgmllib
sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')

I get the exception:

Traceback (most recent call last):
  File "<pyshell#10>", line 1, in <module>
    sgmllib.SGMLParser().feed(u'<a title="te&#223;t"></a>')
  File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
  File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in 
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: 
ordinal not in range(128)

The reason is that the character reference &#223; is converted to 
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte 
string to the remaining unicode string fails.

Workaround (not thoroughly tested): Override convert_codepoint in a 
derived class with:

    def convert_codepoint(self, codepoint):
        return unichr(codepoint)

Is this a bug or is SGMLParser not meant to be used for unicode strings 
(it should be documented then)?


Reply via email to