On 10/20/2018 8:24 AM, pjmcle...@gmail.com wrote:
On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:
i have a sort of decode error UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid start byte ***************** and it seems to refer to my code line: *********** data = f.read() *************** which is part of this block of code ******************** # Read content of files for path in files: with open(join("docs", path), encoding="utf-8") as f: #with open(join("docs", path)) as f: data = f.read() process_data(data) *********************************************** would the solution fix be this? ********************** data = f.read(), decoding = "utf-8" #OR data = f.read(), decoding = "ascii" # is this the right fix or previous or both wrong??
Both statements are invalid syntax. The encoding is set in the open statement.
What you need to find out: is '0xb0' a one-byte error or is 'utf-8' the wrong encoding? Things I might do:
1. Change the encoding in open() to 'ascii' and see if the exception message still refers to position 83064 or if there is a non-ascii character earlier in the file. The latter would mean that there is at least one earlier non-ascii sequence that was decoded as uft-8. This would suggest that 'utf-8' might be correct and that the '0xb0' byte is an error.
2. In the latter case, add "errors='handler'", where 'handler' is something other than the default 'strict'. Look in the doc or see help(open) for alternatives.
3. In open(), replace "encoding='utf-8'" with "mode='rb'" so that f.read() creates data as bytes instead of a text string. Then print, say, data[83000:83200] to see the context of the non-ascii byte.
4. Change to encoding in open() to 'latin-1'. The file will then be read as text without error, even if latin-1 is the wrong encoding.
-- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list