On Sun, 25 Dec 2016 04:50 pm, Grady Martin wrote: > On 2016年12月22日 22時38分, subhabangal...@gmail.com wrote: >>I am getting the error: >>UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15: >>invalid start byte > > The following is a reflex of mine, whenever I encounter Python 2 Unicode > errors: > > import sys > reload(sys) > sys.setdefaultencoding('utf8')
This is a BAD idea, and doing it by "reflex" without very careful thought is just cargo-cult programming. You should not thoughtlessly change the default encoding without knowing what you are doing -- and if you know what you are doing, you won't change it at all. The Python interpreter *intentionally* removes setdefaultencoding at startup for a reason. Changing the default encoding can break the interpreter, and it is NEVER what you actually need. If you think you want it because it fixes "Unicode errors", all you are doing is covering up bugs in your code. Here is some background on why setdefaultencoding exists, and why it is dangerous: https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/ If you have set the Python 2 default encoding to anything but ASCII, you are now running a broken system with subtle bugs, including in data structures as fundamental as dicts. The standard behaviour: py> d = {u'café': 1} py> for key in d: ... print key == 'caf\xc3\xa9' ... False As we should expect: the key in the dict, u'café', is *not* the same as the byte-string 'caf\xc3\xa9'. But watch how we can break dictionaries by changing the default encoding: py> reload(sys) <module 'sys' (built-in)> py> sys.setdefaultencoding('utf-8') # don't do this py> for key in d: ... print key == 'caf\xc3\xa9' ... True So Python now thinks that 'caf\xc3\xa9' is a key. Or does it? py> d['caf\xc3\xa9'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'caf\xc3\xa9' By changing the default encoding, we now have something which is both a key and not a key of the dict at the same time. > A relevant Stack Exchange thread awaits you here: > > http://stackoverflow.com/a/21190382/2230956 And that's why I don't trust StackOverflow. It's not bad for answering simple questions, but once the question becomes more complex the quality of accepted answers goes down the toilet. The highest voted answer is *wrong* and *dangerous*. And then there's this comment: Until this moment I was forced to include "# -- coding: utf-8 --" at the begining of each document. This is way much easier and works as charm I have no words for how wrong that is. And this comment: ty, this worked for my problem with python throwing UnicodeDecodeError on var = u"""vary large string""" No it did not. There is no possible way that Python will throw that exception on assignment to a Unicode string literal. It is posts like this that demonstrate how untrustworthy StackOverflow can be. -- Steve “Cheer up,” they said, “things could be worse.” So I cheered up, and sure enough, things got worse. -- https://mail.python.org/mailman/listinfo/python-list