On Thu, Apr 7, 2016, at 04:47, Daiyue Weng wrote: > Hi, when I read a file, the file string contains Mojibake chars at the > beginning, the code is like, > > file_str = open(file_path, 'r', encoding='utf-8').read() > print(repr(open(file_path, 'r', encoding='utf-8').read()) > > part of the string (been printing) containing Mojibake chars is like, > > '锘縶\n "name": "__NAME__"'
Based on a hunch, I tried something: "锘縶" happens to be the GBK/GB18030 interpretation of the bytes "ef bb bf 7b", which is a UTF-8 byte order mark followed by "{". So what happened is that someone wrote text in UTF-8 with a byte-order marker, and someone else read this as GBK/GB18030 and wrote the resulting characters as UTF-8. So it may be easier to simply special-case it: if file_str[:2] == '锘縶': file_str = '{' + file_str[2:] elif file_str[:2] == '锘縖': file_str = '[' + file_str[2:] In principle, the whole process could be reversed as file_str = file_str.encode('gbk').decode('utf-8'), but that would be overkill if it contains no other ASCII characters and can't contain anything at the start except these. Plus, if there are any other non-ASCII characters in the string, it's anyone's guess as to whether they survived the process in a way that allows you to reverse it. -- https://mail.python.org/mailman/listinfo/python-list