Thanks for the thorough explanation. What I am doing is converting data for processing that will be tab (for columns) and newline (for row) delimited. Some of the data contains tabs and newlines so, I have to convert them to something else so the file integrity is good.
Not my idea, I've been left with the implementation however. "John Machin" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On 28/06/2006 7:46 AM, Mike Currie wrote: >> Can anyone explain why I'm getting an ascii encoding error when I'm >> trying to write out using a UTF-8 encoder? >> > >>>>> f = codecs.open('foo.txt', 'wU', 'utf-8') >>>>> print filteredLine >> thisêhasêàtabsêandêlineàbreaks >>>>> f.write(filteredLine) >> Traceback (most recent call last): >> File "<stdin>", line 1, in ? >> File "C:\Python24\lib\codecs.py", line 501, in write >> return self.writer.write(data) >> File "C:\Python24\lib\codecs.py", line 178, in write >> data, consumed = self.encode(object, self.errors) >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4: >> ordinal >> not in range(128) >> > > Your fundamental problem is that you are trying to decode an 8-bit string > to UTF-8. The codec tries to convert your string to Unicode first, using > the default encoding (ascii), which fails. > > Get this into your head: > You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever > into an 8-bit string. > You decode whatever from an 8-bit string into Unicode. > > Here is a run-down on your problem, using just the encode/decode methods > instead of codecs for illustration purposes: > > (1) Equivalent to what you did. > |>> '\x88'.encode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: > ordinal not in range(128) > > (2) Same thing, explicitly trying to decode your 8-bit string as ASCII. > |>> '\x88'.decode('ascii').encode('utf-8') > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: > ordinal not in range(128) > > (3) Encoding Unicode as UTF-8 works, as expected. > |>> u'\x88'.encode('utf-8') > '\xc2\x88' > > (4) But you need to know what your 8-bit data is supposed to be encoded > in, before you start. > |>> '\x88'.decode('cp1252').encode('utf-8') > '\xcb\x86' > |>> '\x88'.decode('latin1').encode('utf-8') > '\xc2\x88' > > I am rather puzzled as to what you are trying to achieve. You appear to > believe that you possess one or more 8-bit strings, encoded in latin1, > which contain the C0 controls \x09 (HT) and \x0a (LF) AND the > corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF > to NEL, and NEL to LF and similarly with the other pair. Then you want to > write the result, encoded in UTF-8, to a file. The purpose behind that > baroque/byzantine capering would be .... what? >
-- http://mail.python.org/mailman/listinfo/python-list