Codec lookup fails for bad codec name, blowing up BeautifulSoup

John Nagle Fri, 09 Nov 2007 12:47:33 -0800

   I just had our web page parser fail on "www.nasa.gov".
It seems that NASA returns an HTTP header with a charset of ".utf8", which
is non-standard.  This goes into BeautifulSoup, which blows up trying to
find a suitable codec.


   This happens because BeautifulSoup does this:

  def _codec(self, charset):
         if not charset: return charset
         codec = None
         try:
             codecs.lookup(charset)
             codec = charset
         except LookupError:
             pass
         return codec

The documentation for codecs.lookup says:

        lookup(encoding)
        Looks up a codec tuple in the Python codec registry and returns
        the function tuple as defined above.

        Encodings are first looked up in the registry's cache. If not found,
        the list of registered search functions is scanned.
        If no codecs tuple is found, a LookupError is raised.

So BeautifulSoup's lookup ought to be safe, right?  Wrong.
What actually happens is a ValueError exception:

        File "./sitetruth/BeautifulSoup.py", line 1770, in _codec
        codecs.lookup(charset)
        File "/usr/local/lib/python2.5/encodings/__init__.py", line 97,
        in search_function
        globals(), locals(), _import_tail)
        ValueError: Empty module name

This is a known bug. It's in the old tracker on SourceForge:
        [ python-Bugs-960874 ] codecs.lookup can raise exceptions other
        than LookupError        
but not in the new tracker.

        The "resolution" back in 2004 was "Won't Fix", without a change
to the documentation.  Grrr.

Patched BeautifulSoup to work around the problem:

     def _codec(self, charset):
         if not charset: return charset
         codec = None
         try:
             codecs.lookup(charset)
             codec = charset
         except (LookupError, ValueError):
             pass
         return codec


                                        John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Codec lookup fails for bad codec name, blowing up BeautifulSoup

Reply via email to