Benjamin Kaplan <benjamin.kaplan <at> case.edu> writes: > First of all, you're right that might be confusing. I was thinking of auto-detect as in "check the platform and locale and guess what they usually use". I wasn't thinking of it like the web browsers use it.I think it uses locale.getpreferredencoding().
You're probably right. I'd forgotten about locale.getpreferredencoding(). I'll raise a request on the bug tracker to get some more precise wording in the open() docs. > On my machine, I get sys.getpreferredencoding() == 'utf-8' and locale.getdefaultencoding()== 'cp1252'. sys <-> locale ... +1 long-range transposition typo of the year :-) > If you check my response to Anjanesh's comment, I mentioned that he should either find out which encoding it is in particular or he should open the file in binary mode. I suggested utf-8 and latin1 because those are the most likely candidates for his file since cp1252 was already excluded. The OP is on a Windows machine. His file looks like a source code file. He is unlikely to be creating latin1 files himself on a Windows box. Under the hypothesis that he is accidentally or otherwise reading somebody else's source files as data, it could be any encoding. In one package with which I'm familiar, the encoding is declared as cp1251 in every .py file; AFAICT the only file with non-ASCII characters is an example script containing his wife's name! The OP's 0x9d is a defined character in code pages 1250, 1251, 1256, and 1257 -- admittedly all as implausible as the latin1 control character. > Looking at a character map, 0x9d is a control character in latin1, so the page is probably UTF-8 encoded. Thinking about it now, it could also be MacRoman but that isn't as common as UTF-8. Late breaking news: I presume you can see two instances of U+00DD (LATIN CAPITAL LETTER Y WITH ACUTE) in the OP's report "query":"0 1»Ý \u2021 0\u201a0 \u2021»Ý"," Well, u'\xdd'.encode('utf8') is '\xc3\x9d' ... the Bayesian score for utf8 just went up a notch. The preceding character U+00BB (looks like >>) doesn't cause an exception because 0xBB unlike 0x9D is defined in cp1252. Curiously looking at the \uxxxx escape sequences: \u2021 is "double dagger", \u201a is "single low-9 quotation mark" ... what appears to be the value part of an item in a hard-coded dictionary is about as comprehensible as the Voynich manuscript. Trouble with cases like this is as soon as they become interesting, the OP often snatches somebody's one-liner that "works" (i.e. doesn't raise an exception), makes a quick break for the county line, and they're not seen again :-) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list