Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to

Cameron Simpson Sun, 10 Jun 2018 14:33:01 -0700

On 10Jun2018 13:04, [email protected] <[email protected]> wrote:

here is the full error once again
to summarize, my script works fine in python2
i get this error trying to run it in python3
plz see below after the error, my settings for python 2 and python 3
for me it seems i need to change some settings to 'utf-8'..either just in 
python 3, since thats where i am having issues or change the settings to 
'utf-8' both in python 2 and 3....i would appreciate feedback b4 i do some 
trial and error
thanks for the consideration
tommy


***********************************************
Traceback (most recent call last):
File "createIndex.py", line 132, in <module>
c.createindex()
File "creatIndex.py", line 102, in createIndex
pagedict=self.parseCollection()
File "createIndex.py", line 47, in parseCollection
for line in self.collFile:

File"C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py",line 23, in decode

return codecs.charmap_decode(input,self.errors,decoding_table[0]
UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character 
maps to <undefined>

Ok, this is more helpful. It says that the decoding error, which occurred in...\cp1252.py, was decoding lines from the file self.collFile.


What is that file? And how was it opened?

Also, your settings below may indeed be important.

***************************************************
python 3 settings
import sys
import locale
locale.getpreferredencoding()
'cp1252'

The setting above is the default encoding used when you open a file in textmode in Python 3, but you can override it.

In Python 3 this matters a lot, because Python 3 strings are Unicode. In Python2, strings are just bytes, and are not "decoded" (there is a whole separate"unicode" type for that when it matters).

So in Python 3 the text file reader is decoding the text in the file accordingto what it expects the encoding to be.

Find the place where self.collFile is opened. You can specify the decodingmethod there by adding the "encoding=" parameter to the open() call. It isdefaulting to "cp1252" because that is what locale.getpreferredencoding()returns, but presumably the actual file data are not encoded that way.

You can (a) find out what encoding _is_ used in the file and specify that or(b) tell Python to be less picky. Choice (a) is better if it is feasible.

If you have to guess because you don't know the encoding, one possibility isthat collFile contains utf-8 or utf-16; of these 2, utf-8 seems more likelygiven the 0x9d byte causing the trouble. Try adding:


 encoding='utf-8'

to the open() call, eg:

 self.collFile = open('path-to-the-coll-file', encoding='utf-8')

at the appropriate place.

If that just produces a different decoding error, you have 2 choices: pick anencoding where every byte is "valid", such as 'iso8859-1', or to tell thedecode to just cope with th errors by adding the errors="replace" or"errors="ignore" or errors="namereplace" parameter to the open() call.


Both these choices have downsides.

There are several ISO8859 encodings, and they might all be wrong for your file,leading to _incorrect_ text lines.

The errors="..." parameter also has downsides: you will also end up withmissing (errors="ignore") or incorrect (errors="replace" orerrors="namereplace") text, because the decoder has to do something with thedata: drop it or replace it with something wrong. The former loses data whilethe latter puts in bad data, but at least it is visible if you inspect the datalater.


The full documentation for Python 3's open() call is here:

 https://docs.python.org/3/library/functions.html#open

where the various encoding= and errors= choices are described.

Cheers,
Cameron Simpson <[email protected]>
--
https://mail.python.org/mailman/listinfo/python-list

Re: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10442: character maps to

Reply via email to