On 10/20/2018 8:24 AM, pjmcle...@gmail.com wrote:
On Saturday, October 13, 2018 at 7:24:14 PM UTC-4, MRAB wrote:

i have a sort of decode error
UnicodeDecodeError; 'utf-8' can't decode byte 0xb0 in position 83064: invalid 
start byte
*****************
and it seems to refer to my code line:
***********
data = f.read()
***************
which is part of this block of code
********************
# Read content of files
     for path in files:
         with open(join("docs", path), encoding="utf-8") as f:
         #with open(join("docs", path)) as f:
             data = f.read()
             process_data(data)
***********************************************

would the solution fix be this?
**********************
data = f.read(), decoding = "utf-8"  #OR
data = f.read(), decoding = "ascii" # is this the right fix or previous or both 
wrong??

Both statements are invalid syntax. The encoding is set in the open statement.

What you need to find out: is '0xb0' a one-byte error or is 'utf-8' the wrong encoding? Things I might do:

1. Change the encoding in open() to 'ascii' and see if the exception message still refers to position 83064 or if there is a non-ascii character earlier in the file. The latter would mean that there is at least one earlier non-ascii sequence that was decoded as uft-8. This would suggest that 'utf-8' might be correct and that the '0xb0' byte is an error.

2. In the latter case, add "errors='handler'", where 'handler' is something other than the default 'strict'. Look in the doc or see help(open) for alternatives.

3. In open(), replace "encoding='utf-8'" with "mode='rb'" so that f.read() creates data as bytes instead of a text string. Then print, say, data[83000:83200] to see the context of the non-ascii byte.

4. Change to encoding in open() to 'latin-1'. The file will then be read as text without error, even if latin-1 is the wrong encoding.



--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to