[EMAIL PROTECTED] wrote: > Andreas> Does anyone know of a Python module that is able to sniff the > Andreas> encoding of text? > > I have such a beast. Search here: > > http://orca.mojam.com/~skip/python/ > > for "decode". > > Skip
We have similar code. It looks functionally the same except that we also: Check if the string starts with a BOM. Detects probable ISO-8859-15 using a set of characters common is ISO-8859-15 but uncommon in ISO-8859-1 Doctests :-) # Detect BOM _boms = [ (codecs.BOM_UTF16_BE, 'utf_16_be'), (codecs.BOM_UTF16_LE, 'utf_16_le'), (codecs.BOM_UTF32_BE, 'utf_32_be'), (codecs.BOM_UTF32_LE, 'utf_32_le'), ] try: for bom, encoding in _boms: if s.startswith(bom): return unicode(s[len(bom):], encoding) except UnicodeDecodeError: pass [...] # If we have characters in this range, it is probably ISO-8859-15 if re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None: try: return unicode(s, 'ISO-8859-15') except UnicodeDecodeError: pass Feel free to update your available code. Otherwise, I can probably post ours somewhere if necessary. -- Stuart Bishop <[EMAIL PROTECTED]> http://www.stuartbishop.net/
signature.asc
Description: OpenPGP digital signature
-- http://mail.python.org/mailman/listinfo/python-list