On 10Jun2018 13:04, bellcanada...@gmail.com <bellcanada...@gmail.com> wrote:
here is the full error once again
to summarize, my script works fine in python2
i get this error trying to run it in python3
plz see below after the error, my settings for python 2 and python 3
for me it seems i need to change some settings to 'utf-8'..either just in
python 3, since thats where i am having issues or change the settings to
'utf-8' both in python 2 and 3....i would appreciate feedback b4 i do some
trial and error
thanks for the consideration
tommy
***********************************************
Traceback (most recent call last):
File "createIndex.py", line 132, in <module>
c.createindex()
File "creatIndex.py", line 102, in createIndex
pagedict=self.parseCollection()
File "createIndex.py", line 47, in parseCollection
for line in self.collFile:
File
"C:\Users\Robert\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py",
line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table[0]
UnicodeDecodeError: 'charmap'codec can't decode byte 0x9d in position 7414: character
maps to <undefined>
Ok, this is more helpful. It says that the decoding error, which occurred in
...\cp1252.py, was decoding lines from the file self.collFile.
What is that file? And how was it opened?
Also, your settings below may indeed be important.
***************************************************
python 3 settings
import sys
import locale
locale.getpreferredencoding()
'cp1252'
The setting above is the default encoding used when you open a file in text
mode in Python 3, but you can override it.
In Python 3 this matters a lot, because Python 3 strings are Unicode. In Python
2, strings are just bytes, and are not "decoded" (there is a whole separate
"unicode" type for that when it matters).
So in Python 3 the text file reader is decoding the text in the file according
to what it expects the encoding to be.
Find the place where self.collFile is opened. You can specify the decoding
method there by adding the "encoding=" parameter to the open() call. It is
defaulting to "cp1252" because that is what locale.getpreferredencoding()
returns, but presumably the actual file data are not encoded that way.
You can (a) find out what encoding _is_ used in the file and specify that or
(b) tell Python to be less picky. Choice (a) is better if it is feasible.
If you have to guess because you don't know the encoding, one possibility is
that collFile contains utf-8 or utf-16; of these 2, utf-8 seems more likely
given the 0x9d byte causing the trouble. Try adding:
encoding='utf-8'
to the open() call, eg:
self.collFile = open('path-to-the-coll-file', encoding='utf-8')
at the appropriate place.
If that just produces a different decoding error, you have 2 choices: pick an
encoding where every byte is "valid", such as 'iso8859-1', or to tell the
decode to just cope with th errors by adding the errors="replace" or
"errors="ignore" or errors="namereplace" parameter to the open() call.
Both these choices have downsides.
There are several ISO8859 encodings, and they might all be wrong for your file,
leading to _incorrect_ text lines.
The errors="..." parameter also has downsides: you will also end up with
missing (errors="ignore") or incorrect (errors="replace" or
errors="namereplace") text, because the decoder has to do something with the
data: drop it or replace it with something wrong. The former loses data while
the latter puts in bad data, but at least it is visible if you inspect the data
later.
The full documentation for Python 3's open() call is here:
https://docs.python.org/3/library/functions.html#open
where the various encoding= and errors= choices are described.
Cheers,
Cameron Simpson <c...@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list