Python UTF-8 and codecs
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in them. Every configuration I try I get a UnicodeError: ascii codec can't decode byte 0x85 in position 255: oridinal not in range(128) I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict') and that doesn't work and I've also try wrapping the file in an utf8_writer using codecs.lookup('utf8') Any clues? Thanks Mike -- http://mail.python.org/mailman/listinfo/python-list
Re: Python UTF-8 and codecs
I did make a mistake, it should have been 'wU'. The starting data is ASCII. What I'm doing is data processing on files with new line and tab characters inside quoted fields. The idea is to convert all the new line and characters to 0x85 and 0x88 respectivly, then process the files. Finally right before importing them into a database convert them back to new line and tab's thus preserving the field values. Will python not handle the control characters correctly? "Serge Orlov" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On 6/27/06, Mike Currie <[EMAIL PROTECTED]> wrote: >> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in >> them. Every configuration I try I get a UnicodeError: ascii codec can't >> decode byte 0x85 in position 255: oridinal not in range(128) >> >> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', >> errors='strict') >> and that doesn't work and I've also try wrapping the file in an >> utf8_writer >> using codecs.lookup('utf8') >> >> Any clues? > > Use unicode strings for non-ascii characters. The following program > "works": > > import codecs > > c1 = unichr(0x85) > f = codecs.open('foo.txt', 'wU', 'utf-8') > f.write(c1) > f.close() > > But unichr(0x85) is a control characters, are you sure you want it? > What is the encoding of your data? -- http://mail.python.org/mailman/listinfo/python-list
Re: Python UTF-8 and codecs
Okay, Here is a sample of what I'm doing: Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> filterMap = {} >>> for i in range(0,255): ... filterMap[chr(i)] = chr(i) ... >>> filterMap[chr(9)] = chr(136) >>> filterMap[chr(10)] = chr(133) >>> filterMap[chr(136)] = chr(9) >>> filterMap[chr(133)] = chr(10) >>> line = '''this has ... tabsand line ... breaks''' >>> filteredLine = ''.join([ filterMap[a] for a in line]) >>> import codecs >>> f = codecs.open('foo.txt', 'wU', 'utf-8') >>> print filteredLine thisêhasêàtabsêandêlineàbreaks >>> f.write(filteredLine) Traceback (most recent call last): File "", line 1, in ? File "C:\Python24\lib\codecs.py", line 501, in write return self.writer.write(data) File "C:\Python24\lib\codecs.py", line 178, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4: ordinal not in range(128) >>> "Mike Currie" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] >I did make a mistake, it should have been 'wU'. > > The starting data is ASCII. > > What I'm doing is data processing on files with new line and tab > characters inside quoted fields. The idea is to convert all the new line > and characters to 0x85 and 0x88 respectivly, then process the files. > Finally right before importing them into a database convert them back to > new line and tab's thus preserving the field values. > > Will python not handle the control characters correctly? > > > "Serge Orlov" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >> On 6/27/06, Mike Currie <[EMAIL PROTECTED]> wrote: >>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 >>> in >>> them. Every configuration I try I get a UnicodeError: ascii codec can't >>> decode byte 0x85 in position 255: oridinal not in range(128) >>> >>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', >>> errors='strict') >>> and that doesn't work and I've also try wrapping the file in an >>> utf8_writer >>> using codecs.lookup('utf8') >>> >>> Any clues? >> >> Use unicode strings for non-ascii characters. The following program >> "works": >> >> import codecs >> >> c1 = unichr(0x85) >> f = codecs.open('foo.txt', 'wU', 'utf-8') >> f.write(c1) >> f.close() >> >> But unichr(0x85) is a control characters, are you sure you want it? >> What is the encoding of your data? > > -- http://mail.python.org/mailman/listinfo/python-list
Ascii Encoding Error with UTF-8 encoder
Can anyone explain why I'm getting an ascii encoding error when I'm trying to write out using a UTF-8 encoder? Thanks Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> filterMap = {} >>> for i in range(0,255): ... filterMap[chr(i)] = chr(i) ... >>> filterMap[chr(9)] = chr(136) >>> filterMap[chr(10)] = chr(133) >>> filterMap[chr(136)] = chr(9) >>> filterMap[chr(133)] = chr(10) >>> line = '''this has ... tabsand line ... breaks''' >>> filteredLine = ''.join([ filterMap[a] for a in line]) >>> import codecs >>> f = codecs.open('foo.txt', 'wU', 'utf-8') >>> print filteredLine thisêhasêàtabsêandêlineàbreaks >>> f.write(filteredLine) Traceback (most recent call last): File "", line 1, in ? File "C:\Python24\lib\codecs.py", line 501, in write return self.writer.write(data) File "C:\Python24\lib\codecs.py", line 178, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4: ordinal not in range(128) -- http://mail.python.org/mailman/listinfo/python-list
Re: Python UTF-8 and codecs
Well, not really. It doesn't affect the result. I still get the error message. Did you get a different result? "Serge Orlov" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On 6/27/06, Mike Currie <[EMAIL PROTECTED]> wrote: >> Okay, >> >> Here is a sample of what I'm doing: >> >> >> Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on >> win32 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> filterMap = {} >> >>> for i in range(0,255): >> ... filterMap[chr(i)] = chr(i) >> ... >> >>> filterMap[chr(9)] = chr(136) >> >>> filterMap[chr(10)] = chr(133) >> >>> filterMap[chr(136)] = chr(9) >> >>> filterMap[chr(133)] = chr(10) > > This part is incorrect, it should be: > > filterMap = {} > for i in range(0,128): >filterMap[chr(i)] = chr(i) > > filterMap[chr(9)] = unichr(136) > filterMap[chr(10)] = unichr(133) > filterMap[unichr(136)] = chr(9) > filterMap[unichr(133)] = chr(10) -- http://mail.python.org/mailman/listinfo/python-list
Re: Ascii Encoding Error with UTF-8 encoder
Thanks for the thorough explanation. What I am doing is converting data for processing that will be tab (for columns) and newline (for row) delimited. Some of the data contains tabs and newlines so, I have to convert them to something else so the file integrity is good. Not my idea, I've been left with the implementation however. "John Machin" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On 28/06/2006 7:46 AM, Mike Currie wrote: >> Can anyone explain why I'm getting an ascii encoding error when I'm >> trying to write out using a UTF-8 encoder? >> > >>>>> f = codecs.open('foo.txt', 'wU', 'utf-8') >>>>> print filteredLine >> thisêhasêàtabsêandêlineàbreaks >>>>> f.write(filteredLine) >> Traceback (most recent call last): >> File "", line 1, in ? >> File "C:\Python24\lib\codecs.py", line 501, in write >> return self.writer.write(data) >> File "C:\Python24\lib\codecs.py", line 178, in write >> data, consumed = self.encode(object, self.errors) >> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4: >> ordinal >> not in range(128) >> > > Your fundamental problem is that you are trying to decode an 8-bit string > to UTF-8. The codec tries to convert your string to Unicode first, using > the default encoding (ascii), which fails. > > Get this into your head: > You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever > into an 8-bit string. > You decode whatever from an 8-bit string into Unicode. > > Here is a run-down on your problem, using just the encode/decode methods > instead of codecs for illustration purposes: > > (1) Equivalent to what you did. > |>> '\x88'.encode('utf-8') > Traceback (most recent call last): > File "", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: > ordinal not in range(128) > > (2) Same thing, explicitly trying to decode your 8-bit string as ASCII. > |>> '\x88'.decode('ascii').encode('utf-8') > Traceback (most recent call last): > File "", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: > ordinal not in range(128) > > (3) Encoding Unicode as UTF-8 works, as expected. > |>> u'\x88'.encode('utf-8') > '\xc2\x88' > > (4) But you need to know what your 8-bit data is supposed to be encoded > in, before you start. > |>> '\x88'.decode('cp1252').encode('utf-8') > '\xcb\x86' > |>> '\x88'.decode('latin1').encode('utf-8') > '\xc2\x88' > > I am rather puzzled as to what you are trying to achieve. You appear to > believe that you possess one or more 8-bit strings, encoded in latin1, > which contain the C0 controls \x09 (HT) and \x0a (LF) AND the > corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF > to NEL, and NEL to LF and similarly with the other pair. Then you want to > write the result, encoded in UTF-8, to a file. The purpose behind that > baroque/byzantine capering would be what? > -- http://mail.python.org/mailman/listinfo/python-list