On Thu, Jan 29, 2009 at 4:19 PM, John Machin <sjmac...@lexicon.net> wrote:
> Benjamin Kaplan <bsk16 <at> case.edu> writes: > > > > > > > On Thu, Jan 29, 2009 at 12:09 PM, Anjanesh Lekshminarayanan <mail <at> > anjanesh.net> wrote: > > > It does auto-detect it as cp1252- look at the files in the traceback > and > > > you'll see lib\encodings\cp1252.py. Since cp1252 seems to be the wrong > > > encoding, try opening it as utf-8 or latin1 and see if that fixes it. > > Benjamin, "auto-detect" has strong connotations of the open() call (with > mode > including text and encoding not specified) reading some/all of the file and > trying to guess what the encoding might be -- a futile pursuit and not what > the > docs say: > > """encoding is the name of the encoding used to decode or encode the file. > This > should only be used in text mode. The default encoding is platform > dependent, > but any encoding supported by Python can be passed. See the codecs module > for > the list of supported encodings""" > > On my machine [Windows XL SP3] sys.getdefaultencoding() returns 'utf-8'. It > would be interesting to know > (1) what is produced on Anjanesh's machine > (2) how the default encoding is derived (I would have thought I was a prime > candidate for 'cp1252') > (3) whether the 'default encoding' of open() is actually the same as the > 'default encoding' of sys.getdefaultencoding() -- one would hope so but the > docs > don't say so. First of all, you're right that might be confusing. I was thinking of auto-detect as in "check the platform and locale and guess what they usually use". I wasn't thinking of it like the web browsers use it. I think it uses locale.getpreferredencoding(). On my machine, I get sys.getpreferredencoding() == 'utf-8' and locale.getdefaultencoding()== 'cp1252'. When I open a file without specifying the encoding, it's cp1252. > > > Thanks a lot ! utf-8 and latin1 were accepted ! > > Benjamin and Anjanesh, Please understand that > any_random_rubbish.decode('latin1') will be "accepted". This is *not* > useful > information to be greeted with thanks and exclamation marks. It is merely a > by-product of the fact that *any* single-byte character set like latin1 > that > uses all 256 possible bytes can not fail, by definition; no character "maps > to > <undefined>". If you check my response to Anjanesh's comment, I mentioned that he should either find out which encoding it is in particular or he should open the file in binary mode. I suggested utf-8 and latin1 because those are the most likely candidates for his file since cp1252 was already excluded. Looking at a character map, 0x9d is a control character in latin1, so the page is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but that isn't as common as UTF-8. > > > If you want to read the file as text, find out which encoding it actually > is. > In one of those encodings, you'll probably see some nonsense characters. If > you > are just looking at the file as a sequence of bytes, open the file in > binary > mode rather than text. That way, you'll avoid this issue all together (just > make > sure you use byte strings instead of unicode strings). > > In fact, inspection of Anjanesh's report: > """UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position > 10442: character maps to <undefined> > The string at position 10442 is something like this : > "query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," """ > > draws two observations: > (1) there is nothing in the reported string that can be unambiguously > identified > as corresponding to "0x9d" > (2) it looks like a small snippet from a Python source file! > > Anjanesh, Is it a .py file? If so, is there something like "# encoding: > cp1252" > or "# encoding: utf-8" near the start of the file? *Please* tell us what > sys.getdefaultencoding() returns on your machine. > > Instead of "something like", please report exactly what is there: > > print(ascii(open('the_file', 'rb').read()[10442-20:10442+21])) > > Cheers, > John > > -- > http://mail.python.org/mailman/listinfo/python-list >
-- http://mail.python.org/mailman/listinfo/python-list