[EMAIL PROTECTED] wrote: > Andreas> Does anyone know of a Python module that is able to sniff the > Andreas> encoding of text? > > I have such a beast. Search here: > > http://orca.mojam.com/~skip/python/ > > for "decode". > > Skip
We have similar code. It looks functionally the same except that we also:
Check if the string starts with a BOM.
Detects probable ISO-8859-15 using a set of characters common
is ISO-8859-15 but uncommon in ISO-8859-1
Doctests :-)
# Detect BOM
_boms = [
(codecs.BOM_UTF16_BE, 'utf_16_be'),
(codecs.BOM_UTF16_LE, 'utf_16_le'),
(codecs.BOM_UTF32_BE, 'utf_32_be'),
(codecs.BOM_UTF32_LE, 'utf_32_le'),
]
try:
for bom, encoding in _boms:
if s.startswith(bom):
return unicode(s[len(bom):], encoding)
except UnicodeDecodeError:
pass
[...]
# If we have characters in this range, it is probably ISO-8859-15
if re.search(r"[\xa4\xa6\xa8\xb4\xb8\xbc-\xbe]", s) is not None:
try:
return unicode(s, 'ISO-8859-15')
except UnicodeDecodeError:
pass
Feel free to update your available code. Otherwise, I can probably post ours
somewhere if necessary.
--
Stuart Bishop <[EMAIL PROTECTED]>
http://www.stuartbishop.net/
signature.asc
Description: OpenPGP digital signature
-- http://mail.python.org/mailman/listinfo/python-list
