Here's another example of the annoying "attributes must be ASCII
but sgmllib doesn't check" problem.
Run "http://www.serversdirect.com" through BeautifulSoup, and watch it
blow up at this bogus HTML:
<LI>Support Multi-Core Intel® Xeon® processor 3200/3000 sequence
</LISUPPORT sequence 32003000 processor xeon® intel® multi-core>
The parser uses the ® symbol as part of an attribute name:
SGMLParser.feed(self, markup or "")
File "/usr/local/lib/python2.5/sgmllib.py", line 99, in feed
self.goahead(0)
File "/usr/local/lib/python2.5/sgmllib.py", line 138, in goahead
k = self.parse_endtag(i)
File "/usr/local/lib/python2.5/sgmllib.py", line 315, in parse_endtag
self.finish_endtag(tag)
File "/usr/local/lib/python2.5/sgmllib.py", line 353, in finish_endtag
method = getattr(self, 'end_' + tag)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 46:
ordinal not in range(128)
And we're downhill from there. Probably worth fixing, since it's one of the
few real-world HTML bugs that totally blows up BeautifulSoup.
John Nagle
SiteTruth
--
http://mail.python.org/mailman/listinfo/python-list