Python UTF-8 and codecs

2006-06-27 Thread Mike Currie
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in 
them.  Every configuration I try I get a UnicodeError: ascii codec can't 
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict') 
and that doesn't work and I've also try wrapping the file in an utf8_writer 
using codecs.lookup('utf8')

Any clues?

Thanks
Mike


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python UTF-8 and codecs

2006-06-27 Thread Mike Currie
I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters 
inside quoted fields.  The idea is to convert all the new line and 
characters to 0x85 and 0x88 respectivly, then process the files.  Finally 
right before importing them into a database convert them back to new line 
and tab's thus preserving the field values.

Will python not handle the control characters correctly?


"Serge Orlov" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> On 6/27/06, Mike Currie <[EMAIL PROTECTED]> wrote:
>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
>> them.  Every configuration I try I get a UnicodeError: ascii codec can't
>> decode byte 0x85 in position 255: oridinal not in range(128)
>>
>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', 
>> errors='strict')
>> and that doesn't work and I've also try wrapping the file in an 
>> utf8_writer
>> using codecs.lookup('utf8')
>>
>> Any clues?
>
> Use unicode strings for non-ascii characters. The following program 
> "works":
>
> import codecs
>
> c1 = unichr(0x85)
> f = codecs.open('foo.txt', 'wU', 'utf-8')
> f.write(c1)
> f.close()
>
> But unichr(0x85) is a control characters, are you sure you want it?
> What is the encoding of your data? 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python UTF-8 and codecs

2006-06-27 Thread Mike Currie
Okay,

Here is a sample of what I'm doing:


Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> filterMap = {}
>>> for i in range(0,255):
... filterMap[chr(i)] = chr(i)
...
>>> filterMap[chr(9)] = chr(136)
>>> filterMap[chr(10)] = chr(133)
>>> filterMap[chr(136)] = chr(9)
>>> filterMap[chr(133)] = chr(10)
>>> line = '''this  has
... tabsand line
... breaks'''
>>> filteredLine = ''.join([ filterMap[a] for a in line])
>>> import codecs
>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>> print filteredLine
thisêhasêàtabsêandêlineàbreaks
>>> f.write(filteredLine)
Traceback (most recent call last):
  File "", line 1, in ?
  File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
  File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4: 
ordinal
not in range(128)
>>>

"Mike Currie" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
>I did make a mistake, it should have been 'wU'.
>
> The starting data is ASCII.
>
> What I'm doing is data processing on files with new line and tab 
> characters inside quoted fields.  The idea is to convert all the new line 
> and characters to 0x85 and 0x88 respectivly, then process the files. 
> Finally right before importing them into a database convert them back to 
> new line and tab's thus preserving the field values.
>
> Will python not handle the control characters correctly?
>
>
> "Serge Orlov" <[EMAIL PROTECTED]> wrote in message 
> news:[EMAIL PROTECTED]
>> On 6/27/06, Mike Currie <[EMAIL PROTECTED]> wrote:
>>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 
>>> in
>>> them.  Every configuration I try I get a UnicodeError: ascii codec can't
>>> decode byte 0x85 in position 255: oridinal not in range(128)
>>>
>>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', 
>>> errors='strict')
>>> and that doesn't work and I've also try wrapping the file in an 
>>> utf8_writer
>>> using codecs.lookup('utf8')
>>>
>>> Any clues?
>>
>> Use unicode strings for non-ascii characters. The following program 
>> "works":
>>
>> import codecs
>>
>> c1 = unichr(0x85)
>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>> f.write(c1)
>> f.close()
>>
>> But unichr(0x85) is a control characters, are you sure you want it?
>> What is the encoding of your data?
>
> 


-- 
http://mail.python.org/mailman/listinfo/python-list

Ascii Encoding Error with UTF-8 encoder

2006-06-27 Thread Mike Currie
Can anyone explain why I'm getting an ascii encoding error when I'm trying 
to write out using a UTF-8 encoder?

Thanks

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> filterMap = {}
>>> for i in range(0,255):
... filterMap[chr(i)] = chr(i)
...
>>> filterMap[chr(9)] = chr(136)
>>> filterMap[chr(10)] = chr(133)
>>> filterMap[chr(136)] = chr(9)
>>> filterMap[chr(133)] = chr(10)
>>> line = '''this  has
... tabsand line
... breaks'''
>>> filteredLine = ''.join([ filterMap[a] for a in line])
>>> import codecs
>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>> print filteredLine
thisêhasêàtabsêandêlineàbreaks
>>> f.write(filteredLine)
Traceback (most recent call last):
  File "", line 1, in ?
  File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
  File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python UTF-8 and codecs

2006-06-27 Thread Mike Currie
Well,  not really.  It doesn't affect the result.  I still get the error 
message.  Did you get a different result?


"Serge Orlov" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> On 6/27/06, Mike Currie <[EMAIL PROTECTED]> wrote:
>> Okay,
>>
>> Here is a sample of what I'm doing:
>>
>>
>> Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
>> win32
>> Type "help", "copyright", "credits" or "license" for more information.
>> >>> filterMap = {}
>> >>> for i in range(0,255):
>> ... filterMap[chr(i)] = chr(i)
>> ...
>> >>> filterMap[chr(9)] = chr(136)
>> >>> filterMap[chr(10)] = chr(133)
>> >>> filterMap[chr(136)] = chr(9)
>> >>> filterMap[chr(133)] = chr(10)
>
> This part is incorrect, it should be:
>
> filterMap = {}
> for i in range(0,128):
>filterMap[chr(i)] = chr(i)
>
> filterMap[chr(9)] = unichr(136)
> filterMap[chr(10)] = unichr(133)
> filterMap[unichr(136)] = chr(9)
> filterMap[unichr(133)] = chr(10) 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Ascii Encoding Error with UTF-8 encoder

2006-06-27 Thread Mike Currie
Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for 
columns) and newline (for row) delimited.   Some of the data contains tabs 
and newlines so, I have to convert them to something else so the file 
integrity is good.

Not my idea, I've been left with the implementation however.

"John Machin" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> On 28/06/2006 7:46 AM, Mike Currie wrote:
>> Can anyone explain why I'm getting an ascii encoding error when I'm 
>> trying to write out using a UTF-8 encoder?
>>
>
>>>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>>>> print filteredLine
>> thisêhasêàtabsêandêlineàbreaks
>>>>> f.write(filteredLine)
>> Traceback (most recent call last):
>>   File "", line 1, in ?
>>   File "C:\Python24\lib\codecs.py", line 501, in write
>> return self.writer.write(data)
>>   File "C:\Python24\lib\codecs.py", line 178, in write
>> data, consumed = self.encode(object, self.errors)
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
>> ordinal
>> not in range(128)
>>
>
> Your fundamental problem is that you are trying to decode an 8-bit string 
> to UTF-8. The codec tries to convert your string to Unicode first, using 
> the default encoding (ascii), which fails.
>
> Get this into your head:
> You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever 
> into an 8-bit string.
> You decode whatever from an 8-bit string into Unicode.
>
> Here is a run-down on your problem, using just the encode/decode methods 
> instead of codecs for illustration purposes:
>
> (1) Equivalent to what you did.
> |>> '\x88'.encode('utf-8')
> Traceback (most recent call last):
>   File "", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: 
> ordinal not in range(128)
>
> (2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
> |>> '\x88'.decode('ascii').encode('utf-8')
> Traceback (most recent call last):
>   File "", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: 
> ordinal not in range(128)
>
> (3) Encoding Unicode as UTF-8 works, as expected.
> |>> u'\x88'.encode('utf-8')
> '\xc2\x88'
>
> (4) But you need to know what your 8-bit data is supposed to be encoded 
> in, before you start.
> |>> '\x88'.decode('cp1252').encode('utf-8')
> '\xcb\x86'
> |>> '\x88'.decode('latin1').encode('utf-8')
> '\xc2\x88'
>
> I am rather puzzled as to what you are trying to achieve. You appear to 
> believe that you possess one or more 8-bit strings, encoded in latin1, 
> which contain the C0 controls \x09 (HT) and \x0a (LF) AND the 
> corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF 
> to NEL, and NEL to LF and similarly with the other pair. Then you want to 
> write the result, encoded in UTF-8, to a file. The purpose behind that 
> baroque/byzantine capering would be  what?
> 


-- 
http://mail.python.org/mailman/listinfo/python-list