Re: Encoding sniffer?

Stuart Bishop Wed, 11 Jan 2006 00:55:09 -0800

[EMAIL PROTECTED] wrote:
>     Andreas> Does anyone know of a Python module that is able to sniff the
>     Andreas> encoding of text?
> 
> I have such a beast.  Search here:
> 
>     http://orca.mojam.com/~skip/python/
> 
> for "decode".
> 
> Skip


We have similar code. It looks functionally the same except that we also:

        Check if the string starts with a BOM.
        Detects probable ISO-8859-15 using a set of characters common
        is ISO-8859-15 but uncommon in ISO-8859-1
        Doctests :-)

    # Detect BOM
    _boms = [
        (codecs.BOM_UTF16_BE, 'utf_16_be'),
        (codecs.BOM_UTF16_LE, 'utf_16_le'),
        (codecs.BOM_UTF32_BE, 'utf_32_be'),
        (codecs.BOM_UTF32_LE, 'utf_32_le'),
        ]

    try:
        for bom, encoding in _boms:
            if s.startswith(bom):
                return unicode(s[len(bom):], encoding)
    except UnicodeDecodeError:
        pass

    [...]

    # If we have characters in this range, it is probably ISO-8859-15
    if re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None:
        try:
            return unicode(s, 'ISO-8859-15')
        except UnicodeDecodeError:
            pass

Feel free to update your available code. Otherwise, I can probably post ours
somewhere if necessary.

-- 
Stuart Bishop <[EMAIL PROTECTED]>
http://www.stuartbishop.net/

signature.asc
Description: OpenPGP digital signature

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Encoding sniffer?

Reply via email to