Hi, if I execute the following two lines in Python 2.5 (to feed in a *unicode* string):
import sgmllib sgmllib.SGMLParser().feed(u'<a title="teßt"></a>') I get the exception: Traceback (most recent call last): File "<pyshell#10>", line 1, in <module> sgmllib.SGMLParser().feed(u'<a title="teßt"></a>') File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed self.goahead(0) File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead k = self.parse_starttag(i) File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in parse_starttag self._convert_ref, attrvalue) UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0: ordinal not in range(128) The reason is that the character reference ß is converted to *byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte string to the remaining unicode string fails. Workaround (not thoroughly tested): Override convert_codepoint in a derived class with: def convert_codepoint(self, codepoint): return unichr(codepoint) Is this a bug or is SGMLParser not meant to be used for unicode strings (it should be documented then)? Michael -- http://mail.python.org/mailman/listinfo/python-list