Mike Meyer wrote: > "Diez B. Roggisch" <[EMAIL PROTECTED]> writes: > >>Michal wrote: >> >>>is there any way how to detect string encoding in Python? >>>I need to proccess several files. Each of them could be encoded in >>>different charset (iso-8859-2, cp1250, etc). I want to detect it, >>>and encode it to utf-8 (with string function encode). >> >>But there is _no_ way to be absolutely sure. 8bit are 8bit, so each >>file is "legal" in all encodings. > > > Not quite. Some encodings don't use all the valid 8-bit characters, so > if you encounter a character not in an encoding, you can eliminate it > from the list of possible encodings. This doesn't really help much by > itself, though.
----- test.py for enc in ["cp1250", "latin1", "iso-8859-2"]: print enc try: str.decode("".join([chr(i) for i in xrange(256)]), enc) except UnicodeDecodeError, e: print e ----- 192:~ deets$ python2.4 /tmp/test.py cp1250 'charmap' codec can't decode byte 0x81 in position 129: character maps to <undefined> latin1 iso-8859-2 So cp1250 doesn't have all codepoints defined - but the others have. Sure, this helps you to eliminate 1 of the three choices the OP wanted to choose between - but how many texts you have that have a 129 in them? Regards, Diez -- http://mail.python.org/mailman/listinfo/python-list