Fredrik Lundh <[EMAIL PROTECTED]> wrote: > "[EMAIL PROTECTED]" wrote: > >> Question: what is a good strategy for taking an 8bit >> string of unknown encoding and recovering the largest >> amount of reasonable information from it (translated to >> utf8 if needed)? The string might be in any of the >> myriad encodings that predate unicode. Has anyone >> done this in Python already? The output must be clean >> utf8 suitable for arbitrary xml parsers. > > some alternatives: > > braindead bruteforce: > > try to do strict decoding as utf-8. if you succeed, you have an utf-8 > string. if not, assume iso-8859-1.
that was a mistake I made once. Do not use iso8859-1 as python codec, instead create your own codec called e.g. iso8859-1-ncc like this (just a sketch): decoding_map = codecs.make_identity_dict(range(32, 128)+range(128+32, 256)) decoding_map.update({}) encoding_map = codecs.make_encoding_map(decoding_map) and then use : def try_encoding(s, encodings): "try to guess the encoding of string s, testing encodings given in second parameter" for enc in encodings: try: test = unicode(s, enc) return enc except UnicodeDecodeError, r: pass return None guessed_unicode_text = try_encodings(text, ['utf-8', 'iso8859-1-ncc', 'cp1252', 'macroman']) it seems to work surprisingly well, if you know approximately the language(s) the text is expected to be in (e.g. replace cp1252 with cp1250, iso8859-1-ncc with iso8859-2-ncc for central european languages) -- ----------------------------------------------------------- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk | ----------------------------------------------------------- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list