Re: UTF-8 Encoding Error

Cameron Simpson Thu, 22 Dec 2016 23:17:04 -0800

On 22Dec2016 22:38, Subhabrata Banerjee <[email protected]> wrote:

I am getting the error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 15: invalid 
start byte


as I try to read some files through TaggedCorpusReader. TaggedCorpusReader is a 
module
of NLTK.
My files are saved in ANSI format in MS-Windows default.
I am using Python2.7 on MS-Windows 7.

I have tried the following options till now,
string.encode('utf-8').strip()
unicode(string)
unicode(str, errors='replace')
unicode(str, errors='ignore')
string.decode('cp1252')

But nothing is of much help.


It would help to see a very small program that produces your error message.

Generally you need to open text files in the same encoding used for thei text.Which sounds obvious, but I'm presuming you've not done that.

Normally, when you open a file you can specify the text encoding. I am not aWindows guy, so I do not know what "ANSI format in MS-Windows default" means atthe encoding level.


Supposing you had a bit of code like this:

 with open("filename.txt", "r") as fp:
     for line in fp:
         # line is a Python 2 str, but is a sequence of bytes internally
         unicode_line = line.decode('utf8')

# unicode_line is a Python 2 _unicode_ object, which is text, a# sequence of Unicode codepoints

you could get an error like yours if the file _did not_ contain UTF-8 encodedtext.


If you used:
   unicode(str, errors='replace')
   unicode(str, errors='ignore')

I would not have expected the error you recite, but we would need to see anexample program to be sure.

I would guess that the text in your file is not UTF-8 encoded, and that youneed to specify the correct encoding to the .decode call.


Cheers,
Cameron Simpson <[email protected]>
--
https://mail.python.org/mailman/listinfo/python-list

Re: UTF-8 Encoding Error

Reply via email to