Benjamin Kaplan <bsk16 <at> case.edu> writes: > > > On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan <mail <at> anjanesh.net> wrote: > > It does auto-detect it as cp1252- look at the files in the traceback and > > you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong > > encoding, try opening it as utf-8 or latin1 and see if that fixes it.
Benjamin, "auto-detect" has strong connotations of the open() call (with mode including text and encoding not specified) reading some/all of the file and trying to guess what the encoding might be -- a futile pursuit and not what the docs say: """encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent, but any encoding supported by Python can be passed. See the codecs module for the list of supported encodings""" On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It would be interesting to know (1) what is produced on Anjanesh's machine (2) how the default encoding is derived (I would have thought I was a prime candidate for 'cp1252') (3) whether the 'default encoding' of open() is actually the same as the 'default encoding' of sys.getdefaultencoding() -- one would hope so but the docs don't say so. > Thanks a lot ! utf-8 and latin1 were accepted ! Benjamin and Anjanesh, Please understand that any_random_rubbish.decode('latin1') will be "accepted". This is *not* useful information to be greeted with thanks and exclamation marks. It is merely a by-product of the fact that *any* single-byte character set like latin1 that uses all 256 possible bytes can not fail, by definition; no character "maps to <undefined>". > If you want to read the file as text, find out which encoding it actually is. In one of those encodings, you'll probably see some nonsense characters. If you are just looking at the file as a sequence of bytes, open the file in binary mode rather than text. That way, you'll avoid this issue all together (just make sure you use byte strings instead of unicode strings). In fact, inspection of Anjanesh's report: """UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to <undefined> The string at position 10442 is something like this : "query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," """ draws two observations: (1) there is nothing in the reported string that can be unambiguously identified as corresponding to "0x9d" (2) it looks like a small snippet from a Python source file! Anjanesh, Is it a .py file? If so, is there something like "# encoding: cp1252" or "# encoding: utf-8" near the start of the file? *Please* tell us what sys.getdefaultencoding() returns on your machine. Instead of "something like", please report exactly what is there: print(ascii(open('the_file', 'rb').read()[10442-20:10442+21])) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list