On Tue, Jun 28, 2016, at 06:25, Chris Angelico wrote: > For the OP's situation, frankly, I doubt there'll be anything other > than UTF-8, Latin-1, and CP-1252. The chances that someone casually > mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the > simple decode of "UTF-8, or failing that, 1252" is probably going to > give correct results for most of the content. The trick is figuring > out a correct boundary for the check; line-by-line may be sufficient, > or it may not.
For completeness, this can be done character-by-character (i.e. try to decode a UTF-8 character, if it fails decode the offending byte as 1252) with an error handler: import codecs def cp1252_errors(exception): input, idx = exception.object, exception.start byte = input[idx:idx+1] try: return byte.decode('windows-1252'), idx+1 except UnicodeDecodeError: # python's cp1252 doesn't accept 0x81, etc return byte.decode('latin1'), idx+1 codecs.register_error('cp1252', cp1252_errors) assert b"t\xe9st\xc3\xadng".decode('utf-8', errors='cp1252') == "t\u00e9st\u00edng" This is probably sufficient for most purposes; byte sequences that happen to be valid UTF-8 characters but mean something sensible in cp-1252 are rare. Just be fortunate that that's all you have to deal with - the equivalent problem for Japanese encodings, for instance, is much harder (you'd probably want the boundary to be "per run of non-ASCII* characters" if lines don't suffice, and detecting the difference between UTF-8, Shift-JIS, and EUC-JP is nontrivial). There's a reason the word "mojibake" comes from Japanese. *well, JIS X 0201, which is ASCII but for 0x5C and 0x7E. And unless you've got ISO-2022 codes to provide context for that, you've just got to guess what those two bytes mean. Fortunately (fsvo), many environments' fonts display the relevant ASCII characters as their JIS alternatives, taking that choice away from you. -- https://mail.python.org/mailman/listinfo/python-list